Why Your RAG Pipeline Probably Underperforms
You've built a Retrieval Augmented Generation system. You've chunked your documents, created embeddings, and connected everything to an LLM. But the responses feel... mediocre. The model hallucinates despite having relevant information in your knowledge base.
The problem isn't your LLM. It's what you're feeding it.
Basic vector search retrieves documents based on embedding similarity. That sounds reasonable, but similarity scores don't always equal relevance. A chunk might score highly because it contains related vocabulary without actually answering the user's question.
This is where RAG reranking changes everything. By adding a second stage that evaluates retrieved documents more carefully, you ensure only the most relevant information reaches your LLM.
If you're new to this space, understanding retrieval augmented generation basics will help contextualize why reranking matters so much.
What Is Retrieval Reranking?
Reranking is a technique that takes your initial search results and reorders them based on actual relevance to the query. Think of it as a quality filter between your vector database and your LLM.
Here's the typical flow:
- User submits a query
- Your retriever pulls the top 50 to 100 candidate documents from a vector store
- A reranker model evaluates each candidate against the query
- The reranker assigns relevance scores and reorders results
- Only the top 3 to 10 most relevant documents reach the LLM
This two-stage approach has been standard practice in search engineering for years. What's changed is the availability of transformer-based rerankers that understand semantic relationships, not just keyword overlap.
The core insight? Your initial retrieval needs to be fast and cast a wide net. Your reranker needs to be accurate and selective. By splitting these responsibilities, you get both speed and quality.
The Problem with Embedding-Only Retrieval
To understand why reranking works, you need to understand why embeddings alone fall short.
Embedding models (bi-encoders) convert text into vectors. When you search, you compare your query vector against document vectors using similarity metrics like cosine similarity. This is fast because document embeddings are pre-computed.
But there's a fundamental limitation.
Bi-encoders must compress all possible meanings of a document into a single vector, typically 768 or 1536 dimensions. That's a lot of information loss. Additionally, the embedding is created without any knowledge of what users will eventually ask.
Consider this example: A user asks "What are the side effects of medication X?" Your knowledge base contains a document that mentions medication X briefly while focusing on medication Y's side effects. The embedding similarity might be high because both discuss side effects and medications, but the retrieved chunk doesn't actually answer the question.
The technical explanation involves how embeddings enable semantic search. Embeddings capture general semantic meaning, but they can't model the specific relationship between a particular query and a particular document. They're optimized for recall (finding potentially relevant items) rather than precision (ranking truly relevant items first).
How Two-Stage Retrieval Works
Two-stage retrieval separates the "find candidates" problem from the "rank by relevance" problem.
Stage 1: Fast Retrieval (Bi-Encoder)
The first stage uses an embedding model to find candidate documents quickly. You retrieve a larger set than you actually need, maybe 50 to 100 chunks, accepting that some won't be perfectly relevant. The goal here is recall, meaning you don't want to miss anything important.
This stage runs in milliseconds because:
- Document embeddings are pre-computed and indexed
- Approximate nearest neighbor algorithms handle billions of vectors efficiently
- You're not evaluating document content at query time
Stage 2: Precise Reranking (Cross-Encoder)
The second stage takes those candidates and evaluates them properly. A cross-encoder reranking model reads the query and each document together, producing a relevance score that reflects how well that specific document answers that specific question.
This stage is slower per document, but you're only processing 50 to 100 items, not millions. The trade-off makes sense.
The result? You get the scalability of embedding search with the accuracy of deep semantic analysis. Your LLM receives a much cleaner context window, leading to better responses.
For a complete picture of how these components fit together, check out our RAG architecture complete overview.
What Makes Cross-Encoder Reranking So Effective?
Cross-encoders are the backbone of most modern reranking systems. Understanding how they differ from bi-encoders explains their accuracy advantage.
Bi-Encoder Processing:
- Encodes query and document separately
- Creates independent vector representations
- Compares vectors using similarity metrics
- No direct interaction between query and document
Cross-Encoder Processing:
- Concatenates query and document as a single input
- Processes both through a transformer together
- Attention mechanism operates across query and document tokens
- Outputs a direct relevance score
The attention mechanism is the key difference. When a cross-encoder processes "[QUERY] What is RLHF? [DOC] Reinforcement Learning from Human Feedback is a technique...", every word in the query can attend to every word in the document. This captures nuanced relationships that separate encodings simply cannot represent.
Research from multiple sources confirms the accuracy gains. MIT studies show cross-encoder reranking improves RAG accuracy by 33% on average. Databricks found up to 48% improvement in retrieval quality. ZeroEntropy's testing showed 28% NDCG@10 improvements with corresponding reductions in LLM hallucinations.
The trade-off is computational cost. Cross-encoders can't pre-compute document representations because the representation depends on the query. Every query requires running inference on all candidate documents. This is why reranking only makes sense for small candidate sets, not million-document searches.
Popular Reranking Models Worth Knowing
The reranking landscape has matured significantly. Here are the leading options:
Cohere Rerank 4.0
Cohere's latest model represents the commercial state of the art. It supports 100+ languages, handles semi-structured data like JSON and tables, and processes context up to 4,096 tokens. The API integration takes just a few lines of code.
Cohere Rerank excels in enterprise settings where accuracy matters more than cost. It's particularly strong in financial, legal, and healthcare domains where subtle distinctions determine relevance.
BGE Reranker Family
BAAI's BGE rerankers are popular open-source options. The family includes:
- bge-reranker-base (278M parameters)
- bge-reranker-large (560M parameters)
- bge-reranker-v2-gemma2-lightweight (built on Gemma 2)
These models work well for self-hosted deployments where you control infrastructure costs. They're integrated into LangChain, LlamaIndex, and most RAG frameworks.
ColBERT (Late Interaction)
ColBERT takes a different approach called late interaction. Instead of full cross-encoding, it generates multiple embeddings per document and performs fine-grained token comparisons at query time.
This achieves better speed than pure cross-encoders while maintaining strong accuracy. Jina-ColBERT extends this to 8,000 token contexts, useful for long documents.
Mixedbread mxbai-rerank
Mixedbread's v2 models (base and large) claim state-of-the-art performance on BEIR benchmarks. They're fully open-source under Apache 2.0, making them attractive for commercial applications with legal concerns.
The 1.5B parameter large model offers top accuracy, while the 0.5B base model provides a balance of speed and quality.
FlashRank
For CPU-only deployments, FlashRank provides ONNX-optimized rerankers that run without GPUs. Performance is lower than larger models, but latency can be under 60ms for 100 candidates.
When selecting a model, consider your constraints around latency, cost, privacy, and language support. Many teams start with Cohere's API for prototyping and move to self-hosted models for production.
How Reranking Improves RAG Accuracy
Let's get concrete about the improvements reranking delivers.
Reduced Hallucinations
When your LLM receives irrelevant context, it may generate plausible-sounding but incorrect responses. Reranking filters out noise, ensuring the model reasons from relevant information. Databricks testing found reranked results reduce LLM hallucinations by 35% compared to raw embedding similarity.
Better Precision Without Sacrificing Recall
Initial retrieval casts a wide net. Reranking narrows that net to the best results. You maintain the recall from fetching many candidates while achieving precision in what reaches the LLM.
Improved Performance on Complex Queries
Simple factual queries might work fine with basic retrieval. But nuanced questions with multiple aspects, or queries requiring reasoning across multiple facts, benefit enormously from reranking. The cross-encoder can identify documents that address the full query intent, not just surface-level term overlap.
Consistent Relevance Thresholds
Cross-encoder scores are calibrated across queries, meaning a score of 0.8 represents similar relevance regardless of query type. This enables setting universal quality thresholds. Bi-encoder similarity scores vary by query, making consistent filtering difficult.
For systematic improvement measurement, understanding evaluating RAG accuracy and performance gives you the metrics framework to quantify gains.
Implementing Reranking in Your Pipeline
Adding reranking to an existing RAG system is straightforward. Here's a practical approach:
Step 1: Increase Initial Retrieval Count
Change your retriever to fetch more candidates than before. If you were retrieving 5 documents, try retrieving 25 to 50. The reranker will filter down to your final count.
Step 2: Add a Reranker
Choose an API or model based on your requirements. With Cohere:
from cohere import Client
co = Client(api_key="your-key")
results = co.rerank(
query="your user query",
documents=candidate_docs,
top_n=5,
model="rerank-v3.5"
)
With BGE (self-hosted):
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True)
scores = reranker.compute_score([['query', 'doc1'], ['query', 'doc2']])
Step 3: Pass Reranked Results to LLM
Sort by reranker scores and take the top N. These form your context for generation.
Step 4: Monitor and Tune
Track relevance metrics over time. Experiment with different retrieval counts and top_n values. The optimal settings depend on your data and query patterns.
Optimal document chunking approaches also affect reranking effectiveness. Well-chunked documents give rerankers cleaner candidates to evaluate.
Combining Reranking with Hybrid Search
Reranking works well alongside hybrid search strategies. Hybrid retrieval combining approaches merges lexical (BM25) and semantic (vector) search for broader coverage.
The pattern looks like:
- Run BM25 search for keyword matches
- Run vector search for semantic matches
- Merge and deduplicate results
- Rerank the combined candidate set
- Take top results for the LLM
This three-stage approach handles queries that need exact term matching (product codes, names, acronyms) alongside queries requiring semantic understanding. The reranker then selects the best results regardless of which retrieval method found them.
Research from Pinecone shows hybrid plus reranking pipelines achieve 48% improvement in retrieval quality compared to single-method approaches.
Best Practices for Production Reranking
After working with reranking systems, several patterns emerge:
Right-size Your Candidate Set
Retrieve enough candidates that relevant documents are likely included, but not so many that reranking becomes slow. For most applications, 50 to 75 documents hits the sweet spot. Going beyond 100 rarely improves final results but increases latency and cost.
Set Score Thresholds
Don't pass everything to your LLM. If no documents score above a certain relevance threshold, it may be better to acknowledge uncertainty than to hallucinate from weak context.
Consider Latency Budgets
Cross-encoder reranking adds 50 to 200ms depending on model size and candidate count. For real-time chat applications, this matters. For batch processing or offline analysis, it's negligible.
Monitor for Drift
Reranker performance can degrade as your document corpus changes. Regularly evaluate against test queries to catch quality degradation. AI model benchmarks and evaluation provides context on how evaluation methodologies work.
Domain-Specific Fine-Tuning
Off-the-shelf rerankers work well for general content. For specialized domains (medical, legal, technical), fine-tuning on domain-specific pairs can improve relevance significantly.
When Not to Use Reranking
Reranking isn't always necessary. Consider skipping it when:
- Your document set is small (under 1,000 chunks) and well-curated
- Latency requirements are extremely tight (under 100ms total)
- Your queries are simple lookups rather than complex questions
- Cost constraints are severe and you can't afford per-query model inference
For simple FAQ systems or small knowledge bases, basic retrieval may suffice. Reranking shines when you have large, varied document collections and complex user queries.
The Future of Retrieval in RAG
Reranking represents current best practice, but the field continues advancing.
Late interaction models like ColBERT offer promising middle ground between bi-encoder speed and cross-encoder accuracy. Multi-modal rerankers that handle images and text together are emerging. And LLM-based rerankers (using models like GPT-4 to score relevance) show competitive results, though at higher cost.
For teams building AI research assistance tools or complex RAG applications, staying current on retrieval innovations matters as much as keeping up with LLM releases.
The core principle remains constant: better retrieval means better generation. Cross-encoder reranking is currently the most reliable way to improve RAG accuracy, and it's worth implementing in any serious production system.
Wrapping Up
Two-stage retrieval with reranking transforms RAG from "sometimes works" to "reliably excellent." The technique is straightforward: retrieve broadly with embeddings, refine precisely with cross-encoders.
If your RAG system disappoints, try adding a reranking step before investing in larger LLMs or more elaborate prompting. You might be surprised how much better responses become when the LLM actually receives relevant context.
Understanding retrieval deeply, including measuring similarity for retrieval, positions you to build systems that genuinely work.
Start with a hosted API like Cohere to validate the improvement, then move to self-hosted options if cost or privacy requires it. Either way, your users will notice the difference.



