From Traditional RAG to Agentic RAG: Core Principles and Implementation Guide

From traditional RAG to Agentic RAG: giving LLMs autonomous decision-making and multi-turn iterative retrieval
This article systematically explains the fixed pipeline of traditional RAG (Retrieval-Augmented Generation) and its limitations, then introduces how Agentic RAG encapsulates retrieval and file reading capabilities as tools to grant LLMs autonomous decision-making, multi-turn iteration, and dynamic adjustment abilities. Using the open-source project ChatBox as an example, it demonstrates a concrete Agentic RAG implementation based on the LangGraph framework with four core tools, achieving the technological leap from "retrieve and answer" to the "think-act-observe" agent paradigm.
Introduction: The Limitations of Traditional RAG
Have you ever encountered these scenarios: You spent significant time building a RAG system, only to have the model give irrelevant answers; retrieving a bunch of seemingly related but useless content; a user asks "what documents do you have" and the system crashes—because it can only search, not think; when retrieval fails to find an answer, it simply gives up rather than trying a different query or reviewing context for a retry.
The fixed pipeline of traditional RAG can no longer meet the demands of complex scenarios. This article starts from the fundamental principles of traditional RAG, progressively reveals how it evolves into Agentic RAG, and walks you through the core logic of this technological leap using the open-source project ChatBox's implementation.

How Traditional RAG Works
Offline Pipeline: Document Chunking → Vectorization → Storage
The first component of traditional RAG is the offline pipeline, which is independent of user interactions and primarily handles data preparation:
- Document Loading: Reading PDFs, Word documents, etc. into memory
- Text Chunking: Since documents can contain tens of thousands of characters and cannot be fed entirely into an LLM at once, they need to be split into fixed-length segments (e.g., 256 characters) with some overlap between segments to maintain semantic coherence. The choice of chunking strategy directly impacts retrieval quality—chunks that are too large introduce noise, while chunks that are too small may lose contextual semantics. A common overlap ratio is between 10%-20%.
- Vectorization: Using an Embedding model to convert each segment into a fixed-dimensional vector. Embedding models are neural networks that transform text into numerical representations in high-dimensional vector space. Common ones include OpenAI's text-embedding-ada-002 (outputting 1536-dimensional vectors) and the open-source BGE series. These models are pre-trained on large-scale corpora so that semantically similar texts are closer in vector space, enabling semantic retrieval based on cosine similarity.
- Storage: Storing vectors in a vector database (e.g., Chroma DB). Vector databases are specifically optimized for approximate nearest neighbor (ANN) search on high-dimensional vectors, using indexing algorithms like HNSW and IVF to achieve millisecond-level retrieval across millions of records.
Online Pipeline: Retrieval → Assembly → Generation
When a user asks a question, the online pipeline kicks in:
- Query Rewriting: The user's original question may not be suitable for direct retrieval and needs optimization. For example, if a user asks "how to return an item," it might need to be rewritten as "return process return policy return conditions" to improve retrieval recall.
- Dual-Path Retrieval: Simultaneously using BM25 keyword retrieval and vector similarity retrieval, each returning a batch of relevant segments. BM25 is a classic information retrieval algorithm based on term frequency statistics that evaluates relevance by calculating term frequency (TF) and inverse document frequency (IDF) of query terms in documents. The dual-path design is based on complementarity—vector retrieval excels at understanding synonyms and semantic relationships (e.g., "return" and "exchange merchandise"), while BM25 is more reliable for exact term matching (e.g., product model numbers, proper nouns).
- Merging and Reranking: Merging results from both paths and performing reranking to select the most relevant Top-K segments. Reranking typically uses Cross-Encoder models that concatenate the query and document segment for joint encoding, capturing finer-grained semantic interactions compared to the bi-encoder architecture of vector retrieval.
- Prompt Assembly and Generation: Injecting the retrieved segments as context into a prompt template and passing it to the LLM to generate the final answer.
The entire process is unidirectional, fixed, and one-shot—starting from the user's question, going through retrieval, assembly, and generation in a straight line from start to finish.
The Fatal Flaws of Traditional RAG
When the first round of retrieval fails to obtain useful information, traditional RAG is helpless. The model might want to re-retrieve, try a different retrieval approach, or use a tool to gather more context, but the fixed pipeline doesn't allow these operations. For example, when a user asks "what documents are in the knowledge base," traditional RAG simply cannot answer—because its essence is to retrieve partial content and have the model answer based on that, not to understand the full picture of the entire knowledge base.
Agentic RAG: From Pipeline to Intelligent Closed Loop
Core Concept: Tool-Based Autonomous Decision Making
Agentic RAG is a fundamental upgrade to the entire RAG workflow. It encapsulates each component of RAG—query rewriting, vector retrieval, keyword search, file reading, etc.—as callable tools, and allows the LLM to make autonomous decisions, perform multi-turn invocations, and dynamically adjust before generating an answer.
The "tool calling" here relies on the LLM's Function Calling capability, first introduced by OpenAI in June 2023. During inference, the model can output structured function call requests (including function names and parameters), which are executed by external systems and the results returned to the model. Not all models support this feature—currently, models with Function Calling support include the GPT-4 series, Claude series, Qwen, GLM-4, and others. This is why ChatBox needs to first determine whether the model supports tool calling.
In short, Agentic RAG is no longer a straight line from start to finish, but rather a cyclical intelligent agent behavior loop: think → call tools → observe results → think again → act again, until the final answer is generated.
Three Core Capabilities
Implementing Agentic RAG requires the model to possess three types of core capabilities:
- Planning Ability: Manifested in the Chain-of-Thought reasoning process, where the model can plan execution steps and reflect on results. Chain-of-Thought is a prompting technique proposed by Wei et al. in 2022 that guides models to reason step-by-step rather than giving direct answers, significantly improving performance on complex tasks.
- Tool Calling Ability: The model can autonomously decide which tools to invoke, including multi-agent collaboration.
- Multi-Step Iteration Ability: Before providing the final answer, the model can perform multiple tool calls—this is the core of the Agentic concept. Unlike traditional single-turn interactions, the model can execute 5-10 or even more tool calls within a single user request, with each call's results appended to the conversation context for the model's reference.
Workflow Comparison
Traditional RAG: User query → Vector retrieval → Retrieval results → Generate answer (completed in one pass)
Agentic RAG:
- First search round: Performs semantic search using query search, finds low query hit rate
- Observation and rewriting: Model determines the query needs to be rewritten
- Second search round: Re-retrieves using the rewritten query, hits relevant chunks
- Third supplementary round: Retrieves adjacent context via read_file_chunk
- Final generation: Completes the answer based on thoroughly organized information
ChatBox's Agentic RAG Implementation Analysis
Architecture Design: Trading Time for Intelligence
The open-source project ChatBox has a representative implementation logic. When a user question enters the system:
- Determine if the model supports tool calling
- If not supported: Use a prompt to determine whether the question requires retrieval. If not needed, respond directly; if needed, perform semantic search then generate an answer.
- If supported: Register all tools with the model and let it autonomously decide which tools to call.
This design is superior to telling the model in a prompt to "ignore irrelevant context," because it uses two models for decision-making, which theoretically yields better results. This also means Agentic RAG consumes more tokens and inference time—this is exactly what "trading time for intelligence" means: achieving more accurate answers at higher computational cost.
Four Core Tools
ChatBox features four distinctive tools:
| Tool Name | Function | Value |
|---|---|---|
| query_search | Basic semantic retrieval | Fundamental RAG capability |
| list_files | List knowledge base files | Fallback capability for answering "what documents exist" type questions |
| read_file_chunk | Read segments precisely by document ID | Independent of semantic retrieval, can proactively read adjacent chunks for context |
| get_file_metadata | Read file metadata | Obtain file attribute information |
list_files solves the pain point where traditional RAG cannot answer "what documents are in the knowledge base"; read_file_chunk allows the model to proactively read adjacent context when it finds information incomplete, breaking free from sole dependence on semantic similarity retrieval. This design resembles how humans read documents—first searching to find the approximate location, then flipping through surrounding pages to get complete information.
Core Code Implementation
ChatBox implements Agentic RAG based on LangGraph's create_react_agent, with only three core parameters:
- LLM instance: Responsible for reasoning and decision-making
- Tool list: Registers the four core tools
- System prompt: Guides the model on how to use tools and execute step by step
agent = create_react_agent(
model=llm,
tools=[query_search, list_files, read_file_chunk, get_file_metadata],
system_prompt=system_prompt
)
result = agent.invoke({"messages": [user_query]})
LangGraph is a graph-structured orchestration framework developed by the LangChain team. create_react_agent encapsulates the ReAct (Reasoning + Acting) loop logic—a paradigm proposed by Yao et al. in 2022 that enables LLMs to alternate between reasoning and acting. Under the hood, it builds a state graph: the model node is responsible for reasoning and deciding whether to call tools, the tool node executes and returns results, and the two cycle until the model decides to output the final answer.
The code appears simple, yet it grants the model autonomous decision-making and dynamic adjustment capabilities. As the video author noted: "Many wrapper applications in China essentially implement their core logic this way."
Key Differences Summary
| Dimension | Traditional RAG | Agentic RAG |
|---|---|---|
| Workflow | Unidirectional and fixed, one-shot | Iterative loop, multi-step execution |
| Model Role | Used only in the generation phase | Participates in decision-making from the start |
| Retrieval Failure Handling | Directly returns "I don't know" | Rewrites query and retries, supplements context |
| Capability Boundary | Can only answer based on retrieved content | Can answer file listings, precise reading, etc. |
| Core Philosophy | Retrieval + Generation | Planning + Invocation + Reflection + Iteration |
| Cost | Low (single retrieval + generation) | High (multi-turn reasoning + multiple tool calls) |
Conclusion
Traditional RAG embodies the linear thinking of "retrieve and answer," while Agentic RAG represents the intelligent agent paradigm of "think-act-observe." Tools provide capability, but intelligence lies in choice. True Agentic RAG begins with retrieval and succeeds through decision-making.
For LLM application developers, understanding this evolutionary path is crucial. The core work lies not only in how to chunk documents and optimize embeddings, but more importantly in how to design a reasonable toolset that enables the model to autonomously plan and flexibly respond in complex scenarios. It's worth noting that Agentic RAG is not a silver bullet—it introduces higher latency and cost, and traditional RAG may still be the better choice for simple Q&A scenarios. True engineering wisdom lies in selecting the appropriate approach based on scenario complexity.
Related articles
TutorialsCursor + Codex Dual-IDE Collaboration: A Practical Methodology for Open-Source Project Customization
A complete methodology for open-source project customization based on real-world experience, detailing the Cursor+Codex dual-IDE workflow, seven-stage process, MVP validation, and AI source code reading techniques.
TutorialsCursor Multi-Agent in Practice: Building a Full-Stack Next.js Blog in 50 Minutes
Build a full-stack blog in 50 minutes using Cursor IDE's multi-Agent mode with Next.js, Clerk auth, and Supabase. Learn the 4-phase AI Agent workflow and key integration pitfalls.
TutorialsBuilding an AI Software Factory from Scratch: A Cursor Engineer's Hands-On Experience with Multi-Agent Collaboration
Cursor engineer Eric shares practical insights on building an AI software factory: automation levels, guardrail design, parallel Agent management, and scaling to 1000+ Agents for 24/7 development.