Getting Started with RAG: A Complete Guide from LLM Hallucinations to Retrieval-Augmented Generation
Getting Started with RAG: A Complete G…
RAG enhances LLM answers by retrieving external knowledge, solving hallucination and timeliness issues.
This article systematically introduces RAG (Retrieval-Augmented Generation) technology. LLMs suffer from three major pain points: hallucination, outdated training data, and insufficient domain depth. RAG addresses these by retrieving relevant information from external knowledge bases before answering, submitting it alongside the user's question — essentially enabling an "open-book exam" for the model. Its tech stack covers five key components: document chunking, Embedding, vector databases, retrieval ranking, and prompt assembly. Graph RAG is currently recognized as one of the most effective advanced approaches.
Why Do We Need RAG? Three Major Pain Points of Large Language Models
Before diving into RAG, we need to understand the core problems facing current Large Language Models (LLMs). Despite the ever-growing capabilities of models like GPT and DeepSeek, they still have three unavoidable shortcomings in practical applications.
Hallucination: Confidently Making Things Up
"Hallucination" is one of the most criticized problems with large models. Since mainstream LLMs are based on the Transformer architecture, they are fundamentally probabilistic generation models — their goal is to make output "sound human," not to guarantee factual accuracy.
Transformer Architecture and Probabilistic Generation Mechanism
Since Google's 2017 paper Attention Is All You Need, the Transformer architecture has become the cornerstone of modern large language models. Its core mechanism, "Self-Attention," allows the model to dynamically weigh the importance of all other words in a sentence when processing each word, capturing long-range semantic dependencies. Mainstream models like GPT, BERT, and DeepSeek all evolved from this architecture. During generation, the model is essentially performing "next token prediction" — given the preceding context, it samples the most probable next word from the vocabulary based on a probability distribution. This mechanism inherently prioritizes "fluency" over "truthfulness," which is the fundamental source of hallucinations.
A classic example: if you ask an LLM "In which chapter does Lin Daiyu uproot the willow tree?" (this is actually a story about Lu Zhishen from Water Margin, not Lin Daiyu from Dream of the Red Chamber), the model won't correct your mistake. Instead, it will vividly fabricate a specific chapter and plot description.
Hallucinations stem from two main causes:
- Data noise: Training data contains vast amounts of unverified information, including rumors and fictional content. For instance, if a fan fiction happens to describe "Lin Daiyu uprooting a willow tree," the model might learn it as factual knowledge.
- Probability-first: The Transformer architecture means the model generates text by maximizing probability rather than prioritizing factual accuracy.
Slow Data Updates: Knowledge Always Has an Expiration Date
Every large model's training data has a cutoff date. If a model's training data ends at a certain month, it knows nothing about events that occurred afterward. For scenarios requiring high timeliness like stock prices and news, this limitation is nearly fatal.
Limited Domain Expertise: Broad but Not Deep
Large models prioritize knowledge breadth over depth during training. For highly specialized fields like medicine and law, a model might grasp 80% of general knowledge, but the remaining 20% requiring deep professional understanding is often insufficiently covered.

What Is RAG? An "Open-Book Exam"
RAG stands for Retrieval-Augmented Generation, an acronym formed from the first letters of each word. Its core idea is highly intuitive: before the LLM answers a question, it first retrieves relevant information from an external knowledge base, then submits the retrieved results along with the user's question to the LLM, allowing the model to generate more accurate answers based on these "reference materials."
RAG Workflow Explained
Let's illustrate with a simple scenario: suppose you ask an LLM "How many vacation days does our company offer?" — the LLM obviously cannot know this kind of internal company information.
RAG's solution involves four steps:
- Build a knowledge base: Pre-process and store company documents, policies, and other materials in a dedicated knowledge base.
- Retrieve and match: When a user asks a question, the system first searches the knowledge base for content related to "company vacation days," potentially matching multiple results like "The company offers 8 vacation days, increasing to 12 days after 5 years of service."
- Augmented generation: Send the matched knowledge snippets along with the user's original question to the LLM. The model now has two information sources — the user's question and the reference materials from the knowledge base.
- Output the answer: The LLM generates an accurate, evidence-based response using this information.
To put it figuratively: RAG transforms the LLM's "closed-book exam" into an "open-book exam." When encountering a question it can't answer, it can look up an "encyclopedia" to find the answer, then formulate a response in its own words.
Three Core Values of RAG
RAG primarily addresses three key objectives:
- Enhanced factuality: By incorporating external knowledge, answers become verifiable
- Improved timeliness: The knowledge base can be updated at any time, unrestricted by the model's training data cutoff
- Reduced hallucinations: The model answers based on retrieved real data rather than fabricating based on "probability"
It's important to emphasize that RAG cannot 100% eliminate hallucinations and knowledge limitations — its role is to reduce and mitigate these problems. This is a pragmatic understanding and a direction that requires continuous optimization in engineering practice.
Technical Implementation of RAG: From Paper to Production
The RAG concept originates from the paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Published by Facebook AI Research (now Meta AI) in 2020, this paper addressed "knowledge-intensive NLP tasks" (such as open-domain QA and fact verification) by proposing a paradigm that combines parametric memory (knowledge stored in model weights) with non-parametric memory (external document retrieval). The original paper used DPR (Dense Passage Retrieval) as the retriever and BART as the generator, surpassing pure parametric models on multiple knowledge-intensive benchmarks and establishing RAG as the mainstream approach for LLM knowledge augmentation. The core architecture proposed in the paper remains the foundational paradigm for RAG systems today.
Five Key Technical Components
To implement a complete RAG system, you need to master the following core technologies:
-
Document Processing and Chunking: Splitting raw documents into appropriately sized text segments for subsequent retrieval. Chunking strategy is one of the key determinants of RAG system performance — chunks that are too large dilute effective content, while chunks that are too small may break complete semantic units. Common strategies include fixed character count chunking, paragraph-based chunking, recursive character chunking (LangChain's default approach), and semantic chunking based on Embedding similarity. In practice, chunk_size is typically set between 256 and 1024 tokens, with approximately 10% to 20% overlap to prevent critical information from being lost at chunk boundaries.
-
Embedding (Vector Embedding): Converting text into high-dimensional vector representations so computers can understand text semantics. The core idea is: semantically similar texts are close together in vector space. For example, the vector distance between "Apple phone" and "iPhone" would be much smaller than between "Apple phone" and "vacation policy." Modern Embedding models (such as OpenAI's text-embedding-ada-002 and China's BGE series) typically output 768 to 3072-dimensional floating-point vectors, supporting semantic-level fuzzy matching rather than simple keyword search.
-
Vector Database: Database systems specifically designed for storing and retrieving high-dimensional vectors, serving as the core infrastructure of the RAG tech stack. Unlike traditional relational databases that query by exact values, the core operation of vector databases is "Approximate Nearest Neighbor" (ANN) search, using indexing algorithms like HNSW and IVF to accelerate similarity computation to millisecond-level response times. Common implementations include: Chroma (lightweight, suitable for local development), FAISS (Meta open-source, extremely performant), Milvus (enterprise-grade distributed deployment), and Pinecone (cloud-native managed service).
-
Retrieval and Ranking: Finding the most relevant knowledge snippets based on the user's question, with advanced approaches including multi-path recall, reranking, and other optimization strategies.
-
Prompt Assembly: Combining retrieved results with the user's question into a complete prompt for the LLM to generate an answer.
RAG vs. Knowledge Graphs
Interestingly, the "knowledge base" in RAG is a completely different concept from a "knowledge graph." A knowledge graph is a structured knowledge representation that builds semantic networks through entities and relationships, while RAG's knowledge base is more of an unstructured or semi-structured document collection.
However, combining knowledge graphs with RAG (Graph RAG) is currently recognized as one of the most effective RAG approaches. Graph RAG was popularized by Microsoft Research in 2024, with its core advantage being the handling of complex questions requiring multi-hop reasoning — for example, "Which university did Company A's CEO graduate from?" requires first finding who the CEO is, then querying their educational background, which is difficult to accomplish with a single vector retrieval. Graph RAG pre-builds entity-relationship graphs and performs multi-hop traversal along graph paths during retrieval, capturing deep semantic associations, typically using graph databases like Neo4j as the underlying support.
Learning Path and Advanced Directions
Starting from RAG's basic concepts, the subsequent learning path typically includes the following stages:
- LangChain Framework Introduction: The most widely used framework in LLM application development, providing a complete toolchain for RAG
- Advanced RAG Optimization Techniques: Including multi-path recall, reranking, query rewriting, and other optimization strategies
- Graph RAG Practice: RAG based on knowledge graphs, improving retrieval precision through structured knowledge
- Application Platform Hands-on: Using low-code AI application platforms like Dify
- Enterprise Project Deployment: Complete practice from small demos to production-grade RAG systems
While RAG's logic is simple, doing it well in real projects requires meticulous refinement at every stage — document processing, retrieval strategies, prompt engineering, and more. This is why RAG is both an essential path for getting started with LLM application development and a technical direction worth continuous deep investment.
Key Takeaways
- LLMs have three core problems — hallucination, slow data updates, and limited domain expertise — which gave rise to RAG technology
- RAG (Retrieval-Augmented Generation) builds external knowledge bases, retrieves relevant information before answering, and submits retrieved results along with the question to the LLM, significantly improving answer accuracy and timeliness
- RAG's core tech stack includes document chunking, Embedding, vector databases, and prompt assembly, with rich engineering optimization opportunities at each stage
- RAG can reduce but cannot 100% eliminate LLM hallucinations, requiring continuous optimization in engineering practice
- Graph RAG, combining knowledge graphs with RAG, is recognized as one of the most effective RAG approaches, suitable for complex knowledge QA scenarios requiring multi-hop reasoning
Related articles
Deep DivesDeep Dive into How OpenClaw (Open-Source Crayfish) AI Agent Works
Deep analysis of OpenClaw AI Agent internals: System Prompt, tool calling, SubAgents, Skill system, memory, and Context Engineering explained.
Deep DivesDemystifying Transformer: A Word-Continuation Function, Deconstructed
Understand Transformer through the lens of word continuation. Breaking down language generation into Embedding, Transformer Block, and Probability output modules for intuitive understanding.
Deep DivesFive Core Differences Between Claude Code and Regular AI Chat
A detailed comparison of Claude Code vs regular AI chat across five dimensions: interaction, context understanding, execution, memory, and tool integration.