Andrew Ng on AI Agent Development: Evaluation and Error Analysis Are the Core Competitive Advantage

Andrew Ng says mastering evals and error analysis, not model choice, is what separates great Agent developers.
In his Agentic AI course, Andrew Ng argues that the biggest gap between effective and ineffective AI Agent developers isn't about models or frameworks — it's about maintaining a disciplined development process centered on evaluations (evals) and error analysis. He highlights how Agentic Workflows are enabling previously impossible applications across domains like legal, medical, and research, while cautioning against the hype surrounding the term.
Introduction: When "Agentic" Becomes an Overused Label
Andrew Ng opens this course on Agentic AI with a candid and interesting observation: when he first coined the term "agentic" to describe an important and rapidly growing trend in AI application development, he never anticipated that a wave of marketers would slap the word onto virtually every product in sight. This directly led to a dramatic surge in Agentic AI hype.
But the good news is — hype aside, the number of truly valuable and useful Agentic AI applications is also growing rapidly, even if not as explosively as the hype. The core goal of this course is to teach you best practices for building Agentic AI applications and open up entirely new development possibilities.
The core concept of Agentic AI lies in empowering AI systems with the ability to autonomously plan, make decisions, and execute multi-step tasks — a fundamental departure from traditional single-turn Q&A interactions. Andrew Ng systematically articulated this vision in multiple public talks in early 2024, connecting it to technical paradigms such as ReAct (Reasoning and Acting), Tool Use, and multi-agent collaboration. The term was quickly adopted by industry, but it also triggered concept dilution — everything from simple API call chains to complex autonomous decision-making systems got labeled as "Agentic," making it difficult for developers and decision-makers to distinguish genuine technical breakthroughs from marketing packaging.

What Is an Agentic Workflow? What Are the Real-World Use Cases?
In the course, Andrew Ng highlights several key areas where Agentic Workflows are being widely applied:
- Customer Support Agents: Building AI agents that can autonomously handle customer issues
- Deep Research: Helping produce research reports with deep insights
- Legal Document Processing: Handling complex and challenging legal documents
- Medical Diagnosis Assistance: Analyzing patient input and suggesting possible medical diagnoses

The underlying technical architecture of an Agentic Workflow typically includes several key components: a large language model as the reasoning engine, function calling interfaces for interacting with external systems, memory modules (short-term context and long-term storage) for maintaining task state, and a planning module responsible for task decomposition and execution strategy. Unlike traditional deterministic workflows, Agentic Workflows allow AI to dynamically adjust its strategy based on intermediate results during execution — and even backtrack to correct earlier decisions. This flexibility is precisely why they can handle unstructured, complex tasks like legal document review and medical diagnosis.
Andrew Ng particularly emphasizes a key fact: In many of his teams, a large number of projects would simply be impossible without Agentic Workflows. This isn't a nice-to-have technology — it's a qualitative leap from "impossible" to "possible." This means that mastering how to build Agentic Workflows has become one of the most important and valuable skills in AI today.
The Core Gap Between Expert and Novice AI Agent Developers: It's Not the Model — It's the Process
This is the most thought-provoking part of the entire opening. Andrew Ng points out that the biggest difference he's observed between people who truly know how to build Agentic Workflows and those who are less effective isn't about using more advanced models or fancier frameworks. Instead, it comes down to a seemingly simple but critically important capability —
Driving a disciplined development process, particularly one focused on evaluations (evals) and error analysis.

This point deserves deeper analysis. In current AI development practice, many people pour enormous effort into choosing models, tweaking prompts, and experimenting with different frameworks, while neglecting the most fundamental and important engineering practice:
Evaluations (Evals): Test-Driven Thinking for Agent Development
Building an Agent is not a one-and-done task. You need a systematic evaluation methodology to measure how your Agent performs across various scenarios. Without a reliable eval framework, you simply can't tell whether your changes made the system better or worse. It's like Test-Driven Development (TDD) in software engineering, except that in AI, designing evaluations is far more complex and nuanced.
Evaluating AI Agents is much more complex than traditional software testing because outputs are non-deterministic — the same input may produce different but equally valid outputs. Current industry evaluation methods include: gold-standard test sets based on human annotations, LLM-as-Judge (using another large model to assess output quality), and automated metrics targeting specific dimensions (such as task completion rate, tool call accuracy, hallucination rate, etc.). The eval framework that Andrew Ng emphasizes essentially requires developers to establish a quantifiable, reproducible quality measurement system rather than relying on subjective feelings to judge whether the system is good or bad.
Error Analysis: Finding the Real Bottlenecks in Your Agent
When an Agent makes mistakes — and it will — you need to systematically analyze error patterns and root causes. Is the prompt unclear? Is the tool-calling logic flawed? Is the context window too short, causing information loss? Or is the task decomposition granularity unreasonable? Only through rigorous error analysis can you identify the real bottlenecks and optimize accordingly.
In Agent development, errors often have cascading effects — a small deviation in an early step can be amplified through the subsequent reasoning chain, ultimately leading to completely wrong results. Systematic error analysis methods typically include: step-by-step trace analysis of the Agent's intermediate reasoning steps, classification and clustering of failure cases to discover common patterns, and ablation studies to isolate the specific component causing the error. This is fundamentally different from model debugging in machine learning — Agent errors more often stem from system design issues rather than model capability itself.

Why Is Andrew Ng's Agentic AI Course Worth Taking?
Andrew Ng's influence in AI education is undeniable. From his early Coursera Machine Learning course to the later Deep Learning Specialization, he has consistently excelled at transforming complex technical concepts into actionable practical guides.
The value of this Agentic AI course lies in its pragmatic orientation. In an era flooded with "Agents can do everything" rhetoric and an endless parade of new frameworks, Andrew Ng chooses to return to engineering fundamentals:
- Not chasing hype: Clearly distinguishing between hype and real value
- Focusing on methodology: Emphasizing "unglamorous but effective" practices like evals and error analysis
- Practice-oriented: Driven by real-world application scenarios
For developers looking to get started with or level up their Agent development skills, this kind of methodological guidance often has more long-term value than specific code tutorials. Because frameworks change and models iterate, but a systematic development process and evaluation mindset are transferable core competencies.
Conclusion
The opening of Andrew Ng's course sends a clear signal: the real value of Agentic AI isn't in the hype — it's in the fact that it genuinely makes previously impossible applications a reality. And to truly master this skill, the key isn't chasing the latest tools and frameworks, but rather establishing a rigorous development process centered on evaluation and error analysis.
In today's rapidly evolving AI landscape, this philosophy of "getting back to fundamentals" is more valuable than ever.
Related articles

Claude Code Installation Guide & The Five Stages of AI Programming Tools Explained
Complete Claude Code installation guide with the five stages of AI programming tools, from manual coding to agents. Learn 0-to-1 project building and 1-to-100 iteration challenges.

Enterprise-Level AI Project Rules Files: 5 Hard Rules + 6 Writing Techniques
AI keeps messing up your code? Learn 5 hard rules and 6 writing techniques for enterprise-level Rules files in Claude Code, Cursor & more, with templates.

Building Cloud Computing Clusters from Old Phones: Google and UCSD Explore a New Path to Sustainable Computing
Google and UCSD explore building cloud clusters from old phones, leveraging ARM chip efficiency to cut e-waste and data center carbon footprints.