From Traditional RAG to Agentic RAG: Core Principles and Implementation Guide

Introduction: The Limitations of Traditional RAG

Have you ever encountered these scenarios: You spent significant time building a RAG system, only to have the model give irrelevant answers; retrieving a bunch of seemingly related but useless content; a user asks "what documents do you have" and the system crashes—because it can only search, not think; when retrieval fails to find an answer, it simply gives up rather than trying a different query or reviewing context for a retry.

The fixed pipeline of traditional RAG can no longer meet the demands of complex scenarios. This article starts from the fundamental principles of traditional RAG, progressively reveals how it evolves into Agentic RAG, and walks you through the core logic of this technological leap using the open-source project ChatBox's implementation.

Agentic RAG Technical Evolution

How Traditional RAG Works

Offline Pipeline: Document Chunking → Vectorization → Storage

The first component of traditional RAG is the offline pipeline, which is independent of user interactions and primarily handles data preparation:

Document Loading: Reading PDFs, Word documents, etc. into memory
Text Chunking: Since documents can contain tens of thousands of characters and cannot be fed entirely into an LLM at once, they need to be split into fixed-length segments (e.g., 256 characters) with some overlap between segments to maintain semantic coherence. The choice of chunking strategy directly impacts retrieval quality—chunks that are too large introduce noise, while chunks that are too small may lose contextual semantics. A common overlap ratio is between 10%-20%.
Vectorization: Using an Embedding model to convert each segment into a fixed-dimensional vector. Embedding models are neural networks that transform text into numerical representations in high-dimensional vector space. Common ones include OpenAI's text-embedding-ada-002 (outputting 1536-dimensional vectors) and the open-source BGE series. These models are pre-trained on large-scale corpora so that semantically similar texts are closer in vector space, enabling semantic retrieval based on cosine similarity.
Storage: Storing vectors in a vector database (e.g., Chroma DB). Vector databases are specifically optimized for approximate nearest neighbor (ANN) search on high-dimensional vectors, using indexing algorithms like HNSW and IVF to achieve millisecond-level retrieval across millions of records.

Online Pipeline: Retrieval → Assembly → Generation

When a user asks a question, the online pipeline kicks in:

Query Rewriting: The user's original question may not be suitable for direct retrieval and needs optimization. For example, if a user asks "how to return an item," it might need to be rewritten as "return process return policy return conditions" to improve retrieval recall.
Dual-Path Retrieval: Simultaneously using BM25 keyword retrieval and vector similarity retrieval, each returning a batch of relevant segments. BM25 is a classic information retrieval algorithm based on term frequency statistics that evaluates relevance by calculating term frequency (TF) and inverse document frequency (IDF) of query terms in documents. The dual-path design is based on complementarity—vector retrieval excels at understanding synonyms and semantic relationships (e.g., "return" and "exchange merchandise"), while BM25 is more reliable for exact term matching (e.g., product model numbers, proper nouns).
Merging and Reranking: Merging results from both paths and performing reranking to select the most relevant Top-K segments. Reranking typically uses Cross-Encoder models that concatenate the query and document segment for joint encoding, capturing finer-grained semantic interactions compared to the bi-encoder architecture of vector retrieval.
Prompt Assembly and Generation: Injecting the retrieved segments as context into a prompt template and passing it to the LLM to generate the final answer.

The entire process is unidirectional, fixed, and one-shot—starting from the user's question, going through retrieval, assembly, and generation in a straight line from start to finish.

The Fatal Flaws of Traditional RAG

When the first round of retrieval fails to obtain useful information, traditional RAG is helpless. The model might want to re-retrieve, try a different retrieval approach, or use a tool to gather more context, but the fixed pipeline doesn't allow these operations. For example, when a user asks "what documents are in the knowledge base," traditional RAG simply cannot answer—because its essence is to retrieve partial content and have the model answer based on that, not to understand the full picture of the entire knowledge base.

Agentic RAG: From Pipeline to Intelligent Closed Loop

Core Concept: Tool-Based Autonomous Decision Making

Agentic RAG is a fundamental upgrade to the entire RAG workflow. It encapsulates each component of RAG—query rewriting, vector retrieval, keyword search, file reading, etc.—as callable tools, and allows the LLM to make autonomous decisions, perform multi-turn invocations, and dynamically adjust before generating an answer.

The "tool calling" here relies on the LLM's Function Calling capability, first introduced by OpenAI in June 2023. During inference, the model can output structured function call requests (including function names and parameters), which are executed by external systems and the results returned to the model. Not all models support this feature—currently, models with Function Calling support include the GPT-4 series, Claude series, Qwen, GLM-4, and others. This is why ChatBox needs to first determine whether the model supports tool calling.

In short, Agentic RAG is no longer a straight line from start to finish, but rather a cyclical intelligent agent behavior loop: think → call tools → observe results → think again → act again, until the final answer is generated.

Three Core Capabilities

Implementing Agentic RAG requires the model to possess three types of core capabilities:

Planning Ability: Manifested in the Chain-of-Thought reasoning process, where the model can plan execution steps and reflect on results. Chain-of-Thought is a prompting technique proposed by Wei et al. in 2022 that guides models to reason step-by-step rather than giving direct answers, significantly improving performance on complex tasks.
Tool Calling Ability: The model can autonomously decide which tools to invoke, including multi-agent collaboration.
Multi-Step Iteration Ability: Before providing the final answer, the model can perform multiple tool calls—this is the core of the Agentic concept. Unlike traditional single-turn interactions, the model can execute 5-10 or even more tool calls within a single user request, with each call's results appended to the conversation context for the model's reference.

Workflow Comparison

Traditional RAG: User query → Vector retrieval → Retrieval results → Generate answer (completed in one pass)

Agentic RAG:

First search round: Performs semantic search using query search, finds low query hit rate
Observation and rewriting: Model determines the query needs to be rewritten
Second search round: Re-retrieves using the rewritten query, hits relevant chunks
Third supplementary round: Retrieves adjacent context via read_file_chunk
Final generation: Completes the answer based on thoroughly organized information

ChatBox's Agentic RAG Implementation Analysis

Architecture Design: Trading Time for Intelligence

The open-source project ChatBox has a representative implementation logic. When a user question enters the system:

Determine if the model supports tool calling
- If not supported: Use a prompt to determine whether the question requires retrieval. If not needed, respond directly; if needed, perform semantic search then generate an answer.
- If supported: Register all tools with the model and let it autonomously decide which tools to call.

This design is superior to telling the model in a prompt to "ignore irrelevant context," because it uses two models for decision-making, which theoretically yields better results. This also means Agentic RAG consumes more tokens and inference time—this is exactly what "trading time for intelligence" means: achieving more accurate answers at higher computational cost.

Four Core Tools

ChatBox features four distinctive tools:

Tool Name	Function	Value
query_search	Basic semantic retrieval	Fundamental RAG capability
list_files	List knowledge base files	Fallback capability for answering "what documents exist" type questions
read_file_chunk	Read segments precisely by document ID	Independent of semantic retrieval, can proactively read adjacent chunks for context
get_file_metadata	Read file metadata	Obtain file attribute information

list_files solves the pain point where traditional RAG cannot answer "what documents are in the knowledge base"; read_file_chunk allows the model to proactively read adjacent context when it finds information incomplete, breaking free from sole dependence on semantic similarity retrieval. This design resembles how humans read documents—first searching to find the approximate location, then flipping through surrounding pages to get complete information.

Core Code Implementation

ChatBox implements Agentic RAG based on LangGraph's create_react_agent, with only three core parameters:

LLM instance: Responsible for reasoning and decision-making
Tool list: Registers the four core tools
System prompt: Guides the model on how to use tools and execute step by step

agent = create_react_agent(
    model=llm,
    tools=[query_search, list_files, read_file_chunk, get_file_metadata],
    system_prompt=system_prompt
)
result = agent.invoke({"messages": [user_query]})

LangGraph is a graph-structured orchestration framework developed by the LangChain team. create_react_agent encapsulates the ReAct (Reasoning + Acting) loop logic—a paradigm proposed by Yao et al. in 2022 that enables LLMs to alternate between reasoning and acting. Under the hood, it builds a state graph: the model node is responsible for reasoning and deciding whether to call tools, the tool node executes and returns results, and the two cycle until the model decides to output the final answer.

The code appears simple, yet it grants the model autonomous decision-making and dynamic adjustment capabilities. As the video author noted: "Many wrapper applications in China essentially implement their core logic this way."

Key Differences Summary

Dimension	Traditional RAG	Agentic RAG
Workflow	Unidirectional and fixed, one-shot	Iterative loop, multi-step execution
Model Role	Used only in the generation phase	Participates in decision-making from the start
Retrieval Failure Handling	Directly returns "I don't know"	Rewrites query and retries, supplements context
Capability Boundary	Can only answer based on retrieved content	Can answer file listings, precise reading, etc.
Core Philosophy	Retrieval + Generation	Planning + Invocation + Reflection + Iteration
Cost	Low (single retrieval + generation)	High (multi-turn reasoning + multiple tool calls)

Conclusion

Traditional RAG embodies the linear thinking of "retrieve and answer," while Agentic RAG represents the intelligent agent paradigm of "think-act-observe." Tools provide capability, but intelligence lies in choice. True Agentic RAG begins with retrieval and succeeds through decision-making.

For LLM application developers, understanding this evolutionary path is crucial. The core work lies not only in how to chunk documents and optimize embeddings, but more importantly in how to design a reasonable toolset that enables the model to autonomously plan and flexibly respond in complex scenarios. It's worth noting that Agentic RAG is not a silver bullet—it introduces higher latency and cost, and traditional RAG may still be the better choice for simple Q&A scenarios. True engineering wisdom lies in selecting the appropriate approach based on scenario complexity.