Agents Need Control Flow, Not More Prompts

Build AI Agents with code-based control flow instead of overly long Prompts for orchestration
The article argues that the root cause of AI Agent unreliability is embedding control flow in natural language Prompts, not insufficient LLM capability. Due to the probabilistic nature of LLMs and the "Lost in the Middle" effect, overly long Prompts dilute critical instructions. The correct approach is a "Control Flow First" architecture: use code to define state machines and orchestrate workflows while Prompts handle only single tasks, boosting multi-step reliability from 40% to over 90%.
A Counterintuitive Fact: Longer Prompts Make Agents Less Reliable
Many developers fall into a common trap when building AI Agents — when an Agent underperforms, their first instinct is to optimize the Prompt by adding more conditional logic, more detailed instructions, and more comprehensive examples. The result? The Prompt grows longer and longer, while the Agent becomes increasingly unstable.
This isn't your fault. The architecture is wrong.
One team wrote a Prompt exceeding 4,000 tokens for their Agent, cramming all conditional branches, exception handling, and flow control into natural language descriptions. The system still crashed randomly and performed erratically. The root cause wasn't the LLM's capability — it was that they embedded control flow in natural language — and natural language is a terrible programming language.

Why Natural Language Is Unfit for Control Flow
Programming languages exist because they possess a critical property that natural language lacks: determinism. An if-else branch in code behaves 100% predictably, but the same logic described in natural language to an LLM may produce different results every time.
This non-determinism has deep technical roots. The underlying mechanism of Large Language Models (LLMs) dictates the probabilistic nature of their outputs — LLMs are fundamentally predicting the probability of the next token. Even with Temperature set to 0, when facing complex multi-step instructions, the model is still affected by the Attention Mechanism. As Prompt length increases, the model's "attention weight" on earlier critical instructions gets diluted, leading to forgetting or misinterpretation. Researchers call this the "Lost in the Middle" effect, which was experimentally confirmed in a 2023 Stanford study: when critical information is buried in the middle of an extremely long context, the model's retrieval accuracy drops significantly. This is the fundamental reason why a 4,000-token Prompt actually makes Agents less stable.
When we write instructions like "If the user mentions pricing, first query the database, then generate a comparison report, and if the query fails, return default values" in a Prompt, we're essentially using a fuzzy, probabilistic medium to express logic that requires precise execution. The LLM might:
- Skip certain steps
- Misinterpret the boundaries of conditional judgments
- Lose context in multi-step tasks
- Interpret "failure" ambiguously
These problems won't completely disappear as models become more capable, because they are inherent limitations of natural language itself.
The Real Solution: Control Flow First Architecture
The real solution is a paradigm shift — Control Flow First. The core idea is simple: treat the LLM as a function call, use code to define state machines, loops, and error handling, and let Prompts handle only single, well-defined tasks.

How to Do It in Practice
Consider a common data analysis Agent with the task: "Extract data → Query database → Generate report."
Prompt-first approach: Use one super-long instruction to string the entire workflow together, letting the LLM decide how to execute each step. Measured success rate: under 40%.
Control flow first approach: Define the entire workflow as a directed graph, where each node corresponds to an orchestrated step, and every step has type checking and retry logic. The LLM only completes a single task within each node (e.g., "Extract dates and amounts from this text"), while flow progression, branching, and exception handling are all controlled by code.
The "directed graph" here isn't an abstract concept — it's a classic software engineering tool for handling complex workflows: Directed Acyclic Graphs (DAGs) and State Machines. A state machine explicitly defines the "state" the system is in at any given moment and the "events" that trigger state transitions — precisely the determinism that natural language lacks. The LangGraph framework brings this concept into Agent orchestration: each node represents an atomic operation, edges represent conditional transitions, and the entire Agent's execution path is fully traceable, debuggable, and rollback-capable at the code level, completely eliminating the non-determinism of "letting the LLM decide the flow."

Performance Comparison
The improvements from this architectural shift are dramatic:
- 80x reduction in per-step task cost: Because each Prompt becomes extremely short and focused, token consumption drops dramatically
- Multi-step reliability jumps from 40% to over 90%: Because flow control no longer depends on the LLM's "understanding" but is guaranteed by deterministic code
The model didn't get smarter — the architecture got it right.

The DSPy Insight: Don't Write Prompts, Write Programs
This philosophy didn't emerge from thin air. Stanford University's DSPy framework embodies the core principle of "Don't write Prompts, write programs." DSPy abstracts LLM interactions into programmable modules. Developers define input/output Signatures, the framework automatically optimizes the underlying Prompts, and developers only need to focus on the program logic itself.
DSPy (Declarative Self-improving Language Programs) was released by Stanford's NLP lab in 2023. Its core innovation is elevating Prompts from hand-written strings to optimizable program parameters. Developers describe task logic by defining Signatures (type declarations for inputs and outputs) and Modules (reasoning modules like ChainOfThought, ReAct, etc.), while DSPy's Compiler automatically searches for optimal Prompt and Few-shot example combinations using a small number of labeled samples. This means Prompts are no longer the product of manual tuning but are incorporated into a machine learning optimization loop — fundamentally transforming "Prompt engineering" into "program design," letting the system find the most effective expression on its own.
This represents an important trend in AI programming: we're moving from "Prompt Engineering" to "Agent Engineering." The former's core skill is writing good natural language instructions; the latter's core skill is designing good system architecture.
Practical Advice for Developers
If you're building AI Agents, here are some practices worth implementing immediately:
- Split your Prompts: Any Prompt exceeding 500 tokens should be decomposed into multiple single-responsibility small Prompts
- Orchestrate flows with code: State transitions, conditional branches, and retry loops — this logic must live in code, not in Prompts
- Add validation at every node: Type checking, format validation, and result assertions — ensure each step's output meets expectations before proceeding to the next
- Leverage mature frameworks: LangGraph, DSPy, CrewAI, and other frameworks are all evolving toward control flow first — using the right tools multiplies your effectiveness
The current Agent orchestration framework ecosystem is relatively mature, with each having its own focus: LangGraph is graph-based, ideal for complex Agents requiring fine-grained control over loops and conditional branches; CrewAI centers on multi-Agent collaboration with built-in role assignment and task delegation mechanisms, suited for simulating team collaboration scenarios; AutoGen (Microsoft) focuses on multi-Agent conversation orchestration with support for human-in-the-loop intervention. The common trend across these frameworks is "control flow first" — returning orchestration authority from the LLM back to code, with the LLM only completing cognitive tasks within clearly defined boundaries. When choosing a framework, the core considerations should be its support for state persistence, error retry, and node-level observability.
The future of Agents isn't longer Prompts — it's smarter architecture. When you find yourself writing if...then...else inside a Prompt, stop — that's code's job.
Key Takeaways
- Natural language is a terrible programming language; embedding control flow in Prompts is the root cause of Agent unreliability
- The LLM's "Lost in the Middle" effect means overly long Prompts inevitably dilute and lose critical instructions
- Control flow first architecture treats LLMs as function calls, uses code to define state machines and error handling, with Prompts responsible only for single tasks
- This architectural shift can boost multi-step task reliability from 40% to over 90% while reducing per-step costs by 80x
- Frameworks like DSPy represent the paradigm shift from Prompt Engineering to Agent Engineering, where Prompts themselves become machine-optimizable parameters
- Developers should split long Prompts, orchestrate flows with code, and add type checking and validation at every node
Related articles
Deep DivesDeep Dive into How OpenClaw (Open-Source Crayfish) AI Agent Works
Deep analysis of OpenClaw AI Agent internals: System Prompt, tool calling, SubAgents, Skill system, memory, and Context Engineering explained.
Deep DivesDemystifying Transformer: A Word-Continuation Function, Deconstructed
Understand Transformer through the lens of word continuation. Breaking down language generation into Embedding, Transformer Block, and Probability output modules for intuitive understanding.
Deep DivesFive Core Differences Between Claude Code and Regular AI Chat
A detailed comparison of Claude Code vs regular AI chat across five dimensions: interaction, context understanding, execution, memory, and tool integration.