RAG Recall Rate Optimization: A Full-Pipeline Funnel Engineering Breakdown from Data Ingestion to Reranking

When asked "What do you do when RAG system recall is too low?" in an interview, many people's first instinct is to swap in a stronger vector model. But if that's your only answer, the interviewer will likely mentally label you as someone who "only knows how to call APIs."

True recall rate optimization is a full-pipeline funnel engineering effort covering data ingestion, query processing, retrieval strategy, and reranking/filtering. In his LLM interview series on Bilibili, creator Peng Yu walked through a complete RAG architecture diagram and broke this problem down into four layers with clear logic and strong practical relevance. This article provides a systematic summary and in-depth interpretation based on that video.

Data Ingestion Layer: Without a Solid Foundation, Recall Is Impossible

The most painful problem in RAG development often isn't that the LLM can't generate text—it's that it "talks nonsense with its eyes closed." The root cause is that the reference documents fed to the model are simply wrong—like taking an open-book exam where you flip through the entire book but can't find the page with the answer. Naturally, whatever you write will be incorrect.

RAG System Full-Pipeline Architecture Diagram

There are three key optimization points at the data ingestion layer:

Replace Fixed-Character Chunking with Intelligent Semantic Chunking

Many developers still chunk documents by fixed character count, which cuts off sentences mid-thought and destroys semantic integrity. The correct approach is to use intelligent semantic chunking, or employ a sliding window strategy with overlapping context regions, ensuring each document chunk carries complete semantic information.

Metadata Enrichment for Better Filtering Efficiency

Don't just store vectors. Tag documents with rich metadata—publication date, author, business module, etc. During retrieval, perform a round of metadata filtering first, which immediately eliminates a large volume of irrelevant information. According to the video, this step alone can help the model exclude approximately 90% of irrelevant documents, naturally boosting recall.

Domain Fine-tuning of Vector Models

If your business is full of specialized terminology, general-purpose embedding models will inevitably struggle. You need to perform Contrastive Learning Fine-tuning using domain-specific corpora, enabling the model to truly understand your industry's professional vocabulary.

Query Processing Layer: Is the User's Question Actually Searchable?

Users tend to be "lazy"—their questions might be just a few words and often don't express what they actually mean. If you search directly with the raw query, the results will be predictably poor.

Query Processing Layer Architecture

Multi-Path Query Rewriting to Expand Recall Coverage

Have the LLM rewrite the user's original question into 3-5 synonymous but differently phrased versions, then run concurrent retrieval on all paths. As long as one path hits the target document, overall recall succeeds. This is a classic "cast a wide net" strategy to combat semantic ambiguity.

HyDE (Hypothetical Document Embeddings) to Bridge the Semantic Gap

This is an extremely clever technique: first have the LLM "blindly guess" an answer to the user's question. Although this answer may be inaccurate, its vector features are highly similar to the actual target document. Using this hypothetical answer's vector to retrieve real documents effectively bridges the semantic gap—the disconnect between how users phrase questions and how documents express information is elegantly bridged by this intermediate step.

Query Decomposition for Complex Questions

For complex questions like "Who is the legal representative of so-and-so's company?"—this actually involves two steps: first find the person's company, then find the company's legal representative. You must perform Query Decomposition, breaking one complex question into multiple sub-questions, retrieving and reasoning step by step.

Retrieval Strategy Layer: Vector Search Isn't a Silver Bullet

Many people have blind faith in vector search, believing "semantics is king." But in reality, if a user searches for a specific equipment model number (like "XJ-2024-A3"), vector search will likely return a bunch of results that "look similar" but are completely wrong.

Retrieval Strategy Layer Architecture

Hybrid Search: Semantic Matching and Keyword Exact Matching in Parallel

You must implement Hybrid Search: vector search handles semantic understanding, while BM25 handles keyword exact matching. Combined, the system can both "understand what you mean" and "match your exact terms." This has been repeatedly validated in engineering practice as superior to any single retrieval method.

Parent-Child Document Architecture: Search Small, Recall Big

This is a strategy strongly recommended in the video: during retrieval, use finely-chunked child documents for matching to ensure retrieval precision; but when feeding the LLM, send the complete parent document to ensure contextual completeness. This is the so-called "search small, recall big" approach—both precise and comprehensive.

Graph RAG for Multi-Hop Reasoning

When relationships between documents are particularly complex, you need to introduce knowledge graphs. Through entity and relationship graph traversal, you can pull back related information hidden in corners. This is especially effective for multi-hop reasoning questions.

Reranking and Filtering Layer: The Final Gate of the Funnel

The previous layers may have recalled 50 documents to avoid missing the answer. But not all 50 are "good candidates"—they must pass through a final fine-grained screening.

Reranking and Filtering Layer Architecture

Cross-Encoder Reranking

Use a high-precision cross-encoder model to rescore the recalled documents and select the top 5. Although cross-encoders are slow at inference, they perform deep interaction computation between query and document, achieving ranking precision far beyond bi-encoder models. This embodies the "wide entry, strict exit" funnel philosophy.

Dynamic Context Compression to Address Attention Loss

Even after obtaining the top 5 documents, you can't just stuff them all into the LLM. LLMs suffer from the Lost in the Middle problem—when input text is too long, the model's attention to middle sections drops significantly. Therefore, context compression is needed to remove noise from documents and retain only the most relevant essence.

Closed-Loop Evaluation Mechanisms Drive Continuous Optimization

The most critical point: don't judge by feeling that "recall seems better." You must establish a quantitative evaluation system, tracking metrics like Recall@K, MRR, and NDCG. Optimization without metrics is running blind.

Interview Answer Framework Summary

When an interviewer asks "What do you do when RAG recall is low?", never give just one point. You should explain that this is a full-pipeline funnel engineering problem:

Early stage: Rely on data infrastructure (intelligent chunking, metadata enrichment, model fine-tuning) and query rewriting (multi-path rewriting, HyDE, query decomposition) as the baseline guarantee
Middle stage: Rely on hybrid search and parent-child document architecture to boost efficiency
Late stage: Rely on reranking models and context compression for precision filtering

Excellent engineers never put blind faith in any single model or technique—they use a combination of strategies to combat system uncertainty. This kind of systematic thinking is what interviewers truly want to assess.