LLM Observability & Eval in Practice: A Complete Guide to Monitoring and Evaluating Large Models in Production

Your AI Agent is finally live — now what? How do you know what it's actually doing in production? How do you pinpoint issues when things go wrong? And how do you make sure fixing one bug doesn't create three new ones?

At the AI Engineer conference, Dat, an AI architect at Arise AI, shared his deep practical experience with enterprise-grade LLM observability, evaluation (Eval), and iterative experimentation. As a heavy practitioner whose work consumed between 100 billion and 1 trillion tokens last year, he collaborates with some of the world's largest enterprises and has witnessed firsthand the core pain points they encounter during AI transformation.

This article systematically covers the three major themes from his talk: observability, evaluation systems, and the experimentation-improvement feedback loop.

That's kind of step one.

be able to run some out-of-the-box template or customize some eval here.

Then you can start to run experiments.

Observability: What Is Your Agent Actually Doing?

Start with Traces, but Don't Stop There

Dat opened with a key insight: AI is fundamentally a reimagination of software engineering — same patterns, different flavor. Since it's an engineering problem, the first step is observability — understanding what's actually happening inside the system you've built.

Arise's tech stack is built entirely on OpenTelemetry (OTEL), the widely validated observability standard from traditional software engineering. Through auto-instrumentation, developers only need to add a single line of code, and the system automatically captures call chains across frameworks and SDKs, generating trace and span views.

There's an important mindset shift here: In AI systems, the code itself cannot audit an Agent's behavior — what actually can is telemetry data. Traces record the complete behavioral trajectory of an Agent and form the foundation of observability.

Sessions and Distributed Views

But traces are just the starting point. Dat referenced Anthropic's Managed Agents paper to introduce the concept of Sessions. Sessions focus on state — the state transitions across multi-turn conversations and multiple runs between an Agent and a user. Enterprise users often don't care about the details of which tool an Agent called; what they care about is: Is the end user satisfied? Were all their questions answered?

Arise AX offers a unique distributed view feature. Since Agent behavior is non-deterministic, the same Agent may take different branching paths. This view lets you get a bird's-eye perspective on the behavioral distribution across all instantiated Agents:

How much traffic went down branch A versus branch B?
Which component in a specific branch caused significant latency?
How do evaluation signals differ across different paths?

This leads directly to the concept of Trajectory Eval: when an Agent takes path A everything works fine, but signals drop on path B. The root cause might be an incorrect component invocation order — for example, B depends on A's output, but the LLM decided to call B first.

Evaluation System: Five Signal Types and Four Scope Levels

Five Types of Evaluation Signals

Dat categorized evaluation signals into five types, forming a complete evaluation matrix:

1. LLM-as-a-Judge

It seems simple but can actually be quite complex. This is currently the most popular automated evaluation approach, using one large model to judge the output quality of another.

2. Human Feedback

Whether it's end-user usage feedback or a product manager's subjective judgment, human signals are extremely valuable — they serve as anchor points for calibrating automated evaluations.

3. Golden Dataset

High-quality datasets annotated by domain experts, representing your most trusted evaluation standards. A key technique: use golden datasets to calibrate your LLM-as-a-Judge, training the LLM to approximate expert judgment.

4. Deterministic Evaluation

Not all evaluations require an LLM. For example, checking whether JSON output is valid, whether the schema is correct, or whether required fields are non-empty — simple logical checks suffice, at virtually zero cost.

5. Business Metrics

Ultimately, every AI product serves one of three purposes — making more money, saving more money, or saving more time. Business metrics are the ultimate yardstick for measuring AI value.

Four Scope Levels of Evaluation

Evaluations differ not only by type but also by scope — a dimension many teams tend to overlook:

Span level: Evaluating the input/output of a single component — the simplest and most direct
Multi-Span level: Requiring data across multiple components to complete an evaluation, such as assessing data handoff quality between multiple Agents
Trajectory level: Evaluating whether the entire call chain trajectory is correct
Session level: The most macro perspective — evaluating the entire session state machine. Did the user ever feel frustrated? Were all questions answered?

Dat particularly emphasized a pragmatic principle: Just because you can evaluate something doesn't mean you should. You need to find the minimal set of evaluations to determine whether the system is working as expected, because every evaluation has a cost.

Experimentation and Improvement: From Finding Problems to Automated Feedback Loops

Dataset-Driven Experimentation Workflow

With observability and an evaluation system in place, the next step is improvement. The workflow Dat described is very clear:

Identify data points with anomalous signals from traces
Collect these data points into datasets (or directly upload input/output pairs)
Run experiments on the datasets — modify prompts, swap models, adjust orchestration logic, change configurations
Compare experimental results and quantify improvement effects

But there's a critical warning here: In non-deterministic systems, what you think is a fix for one problem may simultaneously create two or three regressions. This is why experimentation must be systematic rather than based on intuition.

Efficient Collaboration Between Two Roles

Dat observed that excellent AI product teams typically feature close collaboration between two types of roles:

Technical staff (AI engineers, developers): Excel at building systems and automation pipelines
Domain experts (product managers, business specialists): Understand what the AI experience should look and feel like

The best practice is to let people who can write code focus on engineering implementation, while domain experts handle prompt engineering and evaluation criteria definition. Arise's product design reflects this — supporting both non-technical UI operations and programmatic evaluation runs.

Future Direction: Automating the Entire Improvement Flywheel

The most forward-looking insight Dat shared: The entire observability → evaluation → experimentation flywheel should be automated.

Arise has a built-in AI system called Alex that can be invoked via CLI, toolsets, or directly by coding Agents like Claude Code. You can simply ask Alex: "Is there anything wrong with my application?" It will automatically analyze trace data, discover anomalies like high latency and errors, and even proactively create new evaluation metrics.

Dat's ultimate vision: Users shouldn't even need to choose their own evaluation metrics — AI should automatically determine what evaluations are needed based on trace context and automatically create new evaluations when changes are detected. It's not magic, but it should feel like magic.

Product Ecosystem: Open Source and Enterprise Solutions

Arise currently has two product lines:

Arise Phoenix: The open-source version — single-container deployment, no Kubernetes required, ideal for engineers to quickly get started and validate
Arise AX: The enterprise version, serving large companies like Uber, Booking, and Reddit, providing complete distributed views and collaboration features

Conclusion

A core framework can be distilled from this talk: Productionizing LLM applications is not a one-time deployment — it's a continuous observe → evaluate → experiment → improve feedback loop.

Within this loop:

Observability is the infrastructure that lets you see the system's true behavior
Evaluation systems are the signal sources that tell you what's working and what's broken
Experimentation and improvement is the value delivery mechanism that turns discovered problems into actual optimizations
Automation is the ultimate direction, making the entire flywheel spin faster and faster

For teams pushing AI Agents into production, the most practical advice might be: start by getting OpenTelemetry integration right, then begin with a minimal set of evaluations, and gradually build out a complete quality assurance system. Don't try to do everything at once — but don't go live naked without any observability either.