How to Improve AI Apps with (Automated) Evals

4.0K views· 148 likes· 29:53· Jun 22, 2025

ShareTwitter Facebook LinkedIn Instagram

🛍️ Products Mentioned (4)

🤝 Want your team maximizing Claude? I run 1:1 and team AI workshops for companies doing $1M+ per year: https://aibuilder.academy/yt/ayGdRbMDZcU Although LLMs can perform arbitrary tasks, evaluating the quality of open-ended tasks is something that typically requires human evaluation. In this video, I'll discuss how we can scale this process up using automated evals. 📰 Read More: https://medium.com/@shawhin/llm-in-a-loop-improving-outputs-with-evals-5620e00f7258?sk=95956863ff584b8d1fd3664b0ec8a6bc 💻 Example Code: https://github.com/ShawhinT/linkedin-ghostwriter-dev References [1] https://youtu.be/-sL7QzDFW-4 [2] https://youtu.be/982V2ituTdc [3] https://youtu.be/GL0XhAj5LPE [4] https://maven.com/parlance-labs/evals Introduction - 0:00 The Typical LLM Workflow - 0:21 The Problem - 1:11 Automed Evals - 1:50 2 Types of (Automated) Evals - 4:25 Example: Eval-driven LinkedIn Ghostwriter - 7:03 Step 1: Identify Failure Modes - 9:36 Step 2: Create LLM Judge - 10:49 Step 3: Curate User Inputs - 19:49 Step 4: Generate LI Posts - 20:30 Step 5: Apply Evals - 21:12 Step 6: Review Results and Refine - 22:06 The Results - 25:19 Demo - 26:59

Watch on YouTube