RAG Recall Rate Optimization: A Full-Pipeline Funnel Engineering Breakdown from Data Ingestion to Reranking

A full-pipeline funnel engineering approach to optimizing RAG recall from data ingestion to reranking.
This article systematically breaks down RAG recall optimization into four layers: data ingestion (semantic chunking, metadata enrichment, domain fine-tuning), query processing (multi-path rewriting, HyDE, query decomposition), retrieval strategy (hybrid search, parent-child documents, Graph RAG), and reranking/filtering (Cross-Encoder, context compression, evaluation metrics). It provides a structured interview framework for answering RAG optimization questions.
When asked "What do you do when RAG system recall is too low?" in an interview, many people's first instinct is to swap in a stronger vector model. But if that's your only answer, the interviewer will likely mentally label you as someone who "only knows how to call APIs."
True recall rate optimization is a full-pipeline funnel engineering effort covering data ingestion, query processing, retrieval strategy, and reranking/filtering. In his LLM interview series on Bilibili, creator Peng Yu walked through a complete RAG architecture diagram and broke this problem down into four layers with clear logic and strong practical relevance. This article provides a systematic summary and in-depth interpretation based on that video.
Data Ingestion Layer: Without a Solid Foundation, Recall Is Impossible
The most painful problem in RAG development often isn't that the LLM can't generate text—it's that it "talks nonsense with its eyes closed." The root cause is that the reference documents fed to the model are simply wrong—like taking an open-book exam where you flip through the entire book but can't find the page with the answer. Naturally, whatever you write will be incorrect.

There are three key optimization points at the data ingestion layer:
Replace Fixed-Character Chunking with Intelligent Semantic Chunking
Many developers still chunk documents by fixed character count, which cuts off sentences mid-thought and destroys semantic integrity. The correct approach is to use intelligent semantic chunking, or employ a sliding window strategy with overlapping context regions, ensuring each document chunk carries complete semantic information.
Metadata Enrichment for Better Filtering Efficiency
Don't just store vectors. Tag documents with rich metadata—publication date, author, business module, etc. During retrieval, perform a round of metadata filtering first, which immediately eliminates a large volume of irrelevant information. According to the video, this step alone can help the model exclude approximately 90% of irrelevant documents, naturally boosting recall.
Domain Fine-tuning of Vector Models
If your business is full of specialized terminology, general-purpose embedding models will inevitably struggle. You need to perform Contrastive Learning Fine-tuning using domain-specific corpora, enabling the model to truly understand your industry's professional vocabulary.
Query Processing Layer: Is the User's Question Actually Searchable?
Users tend to be "lazy"—their questions might be just a few words and often don't express what they actually mean. If you search directly with the raw query, the results will be predictably poor.

Multi-Path Query Rewriting to Expand Recall Coverage
Have the LLM rewrite the user's original question into 3-5 synonymous but differently phrased versions, then run concurrent retrieval on all paths. As long as one path hits the target document, overall recall succeeds. This is a classic "cast a wide net" strategy to combat semantic ambiguity.
HyDE (Hypothetical Document Embeddings) to Bridge the Semantic Gap
This is an extremely clever technique: first have the LLM "blindly guess" an answer to the user's question. Although this answer may be inaccurate, its vector features are highly similar to the actual target document. Using this hypothetical answer's vector to retrieve real documents effectively bridges the semantic gap—the disconnect between how users phrase questions and how documents express information is elegantly bridged by this intermediate step.
Query Decomposition for Complex Questions
For complex questions like "Who is the legal representative of so-and-so's company?"—this actually involves two steps: first find the person's company, then find the company's legal representative. You must perform Query Decomposition, breaking one complex question into multiple sub-questions, retrieving and reasoning step by step.
Retrieval Strategy Layer: Vector Search Isn't a Silver Bullet
Many people have blind faith in vector search, believing "semantics is king." But in reality, if a user searches for a specific equipment model number (like "XJ-2024-A3"), vector search will likely return a bunch of results that "look similar" but are completely wrong.

Hybrid Search: Semantic Matching and Keyword Exact Matching in Parallel
You must implement Hybrid Search: vector search handles semantic understanding, while BM25 handles keyword exact matching. Combined, the system can both "understand what you mean" and "match your exact terms." This has been repeatedly validated in engineering practice as superior to any single retrieval method.
Parent-Child Document Architecture: Search Small, Recall Big
This is a strategy strongly recommended in the video: during retrieval, use finely-chunked child documents for matching to ensure retrieval precision; but when feeding the LLM, send the complete parent document to ensure contextual completeness. This is the so-called "search small, recall big" approach—both precise and comprehensive.
Graph RAG for Multi-Hop Reasoning
When relationships between documents are particularly complex, you need to introduce knowledge graphs. Through entity and relationship graph traversal, you can pull back related information hidden in corners. This is especially effective for multi-hop reasoning questions.
Reranking and Filtering Layer: The Final Gate of the Funnel
The previous layers may have recalled 50 documents to avoid missing the answer. But not all 50 are "good candidates"—they must pass through a final fine-grained screening.

Cross-Encoder Reranking
Use a high-precision cross-encoder model to rescore the recalled documents and select the top 5. Although cross-encoders are slow at inference, they perform deep interaction computation between query and document, achieving ranking precision far beyond bi-encoder models. This embodies the "wide entry, strict exit" funnel philosophy.
Dynamic Context Compression to Address Attention Loss
Even after obtaining the top 5 documents, you can't just stuff them all into the LLM. LLMs suffer from the Lost in the Middle problem—when input text is too long, the model's attention to middle sections drops significantly. Therefore, context compression is needed to remove noise from documents and retain only the most relevant essence.
Closed-Loop Evaluation Mechanisms Drive Continuous Optimization
The most critical point: don't judge by feeling that "recall seems better." You must establish a quantitative evaluation system, tracking metrics like Recall@K, MRR, and NDCG. Optimization without metrics is running blind.
Interview Answer Framework Summary
When an interviewer asks "What do you do when RAG recall is low?", never give just one point. You should explain that this is a full-pipeline funnel engineering problem:
- Early stage: Rely on data infrastructure (intelligent chunking, metadata enrichment, model fine-tuning) and query rewriting (multi-path rewriting, HyDE, query decomposition) as the baseline guarantee
- Middle stage: Rely on hybrid search and parent-child document architecture to boost efficiency
- Late stage: Rely on reranking models and context compression for precision filtering
Excellent engineers never put blind faith in any single model or technique—they use a combination of strategies to combat system uncertainty. This kind of systematic thinking is what interviewers truly want to assess.
Related articles

AI Programming Learning Roadmap: A Complete Six-Stage Guide from Beginner to Expert
A systematic breakdown of the six-stage AI programming learning roadmap, from zero-code start to mastering Cursor and professional tools, methodology frameworks, advanced patterns, and project practice.

Deep Dive into Devin's Background Agent Architecture: Behind the 80% AI-Committed Code
Deep analysis of Devin's background agent architecture: brain-sandbox separation, environment setup, MCP integration, memory systems, and multi-agent collaboration challenges.

Claude Code in Practice: An In-Depth Efficiency Comparison Between Claude and DeepSeek for Programming
A hands-on comparison of Claude vs DeepSeek V4 for AI programming: code quality, development efficiency, and cost differences. DeepSeek costs 1/6 to 1/10 of Claude but requires 1-2x more time.