What is the difference between bi-encoders and cross-encoders in RAG reranking?

Bi-encoders create separate embeddings for queries and documents, enabling fast similarity search. Cross-encoders process query and document together, capturing interactions between them for more accurate relevance scoring. Cross-encoders are slower but significantly more precise.

How much does reranking improve RAG accuracy?

Research shows improvements ranging from 28% to 48% in retrieval quality metrics like NDCG@10. In practice, this translates to fewer hallucinations and more accurate LLM responses. The exact improvement depends on your data and baseline system.

What is two-stage retrieval?

Two-stage retrieval separates candidate generation (fast, broad search) from ranking (slow, precise evaluation). The first stage retrieves many candidates using embedding similarity. The second stage reranks these candidates using a cross-encoder to identify the most relevant results.

Which reranking model should I use?

For quick implementation and high accuracy, Cohere Rerank API is an excellent choice. For self-hosted deployments with good accuracy, BGE Reranker or Mixedbread's mxbai-rerank work well. For long documents, Jina-ColBERT handles 8,000 token contexts. Choose based on your latency, cost, and privacy requirements.

How many documents should I retrieve before reranking?

Retrieve 50 to 75 candidates for most applications. This provides enough coverage to catch relevant documents while keeping reranking fast. Going beyond 100 candidates rarely improves results but increases latency and cost significantly.

RAG Reranking: Two-Stage Retrieval for Better Results

Why Your RAG Pipeline Probably Underperforms

You've built a Retrieval Augmented Generation system. You've chunked your documents, created embeddings, and connected everything to an LLM. But the responses feel... mediocre. The model hallucinates despite having relevant information in your knowledge base.

The problem isn't your LLM. It's what you're feeding it.

Basic vector search retrieves documents based on embedding similarity. That sounds reasonable, but similarity scores don't always equal relevance. A chunk might score highly because it contains related vocabulary without actually answering the user's question.

This is where RAG reranking changes everything. By adding a second stage that evaluates retrieved documents more carefully, you ensure only the most relevant information reaches your LLM.

If you're new to this space, understanding retrieval augmented generation basics will help contextualize why reranking matters so much.

What Is Retrieval Reranking?

Reranking is a technique that takes your initial search results and reorders them based on actual relevance to the query. Think of it as a quality filter between your vector database and your LLM.

Here's the typical flow:

User submits a query
Your retriever pulls the top 50 to 100 candidate documents from a vector store
A reranker model evaluates each candidate against the query
The reranker assigns relevance scores and reorders results
Only the top 3 to 10 most relevant documents reach the LLM

This two-stage approach has been standard practice in search engineering for years. What's changed is the availability of transformer-based rerankers that understand semantic relationships, not just keyword overlap.

The core insight? Your initial retrieval needs to be fast and cast a wide net. Your reranker needs to be accurate and selective. By splitting these responsibilities, you get both speed and quality.

The Problem with Embedding-Only Retrieval

To understand why reranking works, you need to understand why embeddings alone fall short.

Embedding models (bi-encoders) convert text into vectors. When you search, you compare your query vector against document vectors using similarity metrics like cosine similarity. This is fast because document embeddings are pre-computed.

But there's a fundamental limitation.

Bi-encoders must compress all possible meanings of a document into a single vector, typically 768 or 1536 dimensions. That's a lot of information loss. Additionally, the embedding is created without any knowledge of what users will eventually ask.

Consider this example: A user asks "What are the side effects of medication X?" Your knowledge base contains a document that mentions medication X briefly while focusing on medication Y's side effects. The embedding similarity might be high because both discuss side effects and medications, but the retrieved chunk doesn't actually answer the question.

The technical explanation involves how embeddings enable semantic search. Embeddings capture general semantic meaning, but they can't model the specific relationship between a particular query and a particular document. They're optimized for recall (finding potentially relevant items) rather than precision (ranking truly relevant items first).

How Two-Stage Retrieval Works

Two-stage retrieval separates the "find candidates" problem from the "rank by relevance" problem.

Stage 1: Fast Retrieval (Bi-Encoder)

The first stage uses an embedding model to find candidate documents quickly. You retrieve a larger set than you actually need, maybe 50 to 100 chunks, accepting that some won't be perfectly relevant. The goal here is recall, meaning you don't want to miss anything important.

This stage runs in milliseconds because:

Document embeddings are pre-computed and indexed
Approximate nearest neighbor algorithms handle billions of vectors efficiently
You're not evaluating document content at query time

Stage 2: Precise Reranking (Cross-Encoder)

The second stage takes those candidates and evaluates them properly. A cross-encoder reranking model reads the query and each document together, producing a relevance score that reflects how well that specific document answers that specific question.

This stage is slower per document, but you're only processing 50 to 100 items, not millions. The trade-off makes sense.

The result? You get the scalability of embedding search with the accuracy of deep semantic analysis. Your LLM receives a much cleaner context window, leading to better responses.

For a complete picture of how these components fit together, check out our RAG architecture complete overview.

What Makes Cross-Encoder Reranking So Effective?

Cross-encoders are the backbone of most modern reranking systems. Understanding how they differ from bi-encoders explains their accuracy advantage.

Bi-Encoder Processing:

Encodes query and document separately
Creates independent vector representations
Compares vectors using similarity metrics
No direct interaction between query and document

Cross-Encoder Processing:

Concatenates query and document as a single input
Processes both through a transformer together
Attention mechanism operates across query and document tokens
Outputs a direct relevance score

The attention mechanism is the key difference. When a cross-encoder processes "[QUERY] What is RLHF? [DOC] Reinforcement Learning from Human Feedback is a technique...", every word in the query can attend to every word in the document. This captures nuanced relationships that separate encodings simply cannot represent.

Research from multiple sources confirms the accuracy gains. MIT studies show cross-encoder reranking improves RAG accuracy by 33% on average. Databricks found up to 48% improvement in retrieval quality. ZeroEntropy's testing showed 28% NDCG@10 improvements with corresponding reductions in LLM hallucinations.

The trade-off is computational cost. Cross-encoders can't pre-compute document representations because the representation depends on the query. Every query requires running inference on all candidate documents. This is why reranking only makes sense for small candidate sets, not million-document searches.

Popular Reranking Models Worth Knowing

The reranking landscape has matured significantly. Here are the leading options:

Cohere Rerank 4.0

Cohere's latest model represents the commercial state of the art. It supports 100+ languages, handles semi-structured data like JSON and tables, and processes context up to 4,096 tokens. The API integration takes just a few lines of code.

Cohere Rerank excels in enterprise settings where accuracy matters more than cost. It's particularly strong in financial, legal, and healthcare domains where subtle distinctions determine relevance.

BGE Reranker Family

BAAI's BGE rerankers are popular open-source options. The family includes:

bge-reranker-base (278M parameters)
bge-reranker-large (560M parameters)
bge-reranker-v2-gemma2-lightweight (built on Gemma 2)

These models work well for self-hosted deployments where you control infrastructure costs. They're integrated into LangChain, LlamaIndex, and most RAG frameworks.

ColBERT (Late Interaction)

ColBERT takes a different approach called late interaction. Instead of full cross-encoding, it generates multiple embeddings per document and performs fine-grained token comparisons at query time.

This achieves better speed than pure cross-encoders while maintaining strong accuracy. Jina-ColBERT extends this to 8,000 token contexts, useful for long documents.

Mixedbread mxbai-rerank

Mixedbread's v2 models (base and large) claim state-of-the-art performance on BEIR benchmarks. They're fully open-source under Apache 2.0, making them attractive for commercial applications with legal concerns.

The 1.5B parameter large model offers top accuracy, while the 0.5B base model provides a balance of speed and quality.

FlashRank

For CPU-only deployments, FlashRank provides ONNX-optimized rerankers that run without GPUs. Performance is lower than larger models, but latency can be under 60ms for 100 candidates.

When selecting a model, consider your constraints around latency, cost, privacy, and language support. Many teams start with Cohere's API for prototyping and move to self-hosted models for production.

How Reranking Improves RAG Accuracy

Let's get concrete about the improvements reranking delivers.

Reduced Hallucinations

When your LLM receives irrelevant context, it may generate plausible-sounding but incorrect responses. Reranking filters out noise, ensuring the model reasons from relevant information. Databricks testing found reranked results reduce LLM hallucinations by 35% compared to raw embedding similarity.

Better Precision Without Sacrificing Recall

Initial retrieval casts a wide net. Reranking narrows that net to the best results. You maintain the recall from fetching many candidates while achieving precision in what reaches the LLM.

Improved Performance on Complex Queries

Simple factual queries might work fine with basic retrieval. But nuanced questions with multiple aspects, or queries requiring reasoning across multiple facts, benefit enormously from reranking. The cross-encoder can identify documents that address the full query intent, not just surface-level term overlap.

Consistent Relevance Thresholds

Cross-encoder scores are calibrated across queries, meaning a score of 0.8 represents similar relevance regardless of query type. This enables setting universal quality thresholds. Bi-encoder similarity scores vary by query, making consistent filtering difficult.

For systematic improvement measurement, understanding evaluating RAG accuracy and performance gives you the metrics framework to quantify gains.

Implementing Reranking in Your Pipeline

Adding reranking to an existing RAG system is straightforward. Here's a practical approach:

Step 1: Increase Initial Retrieval Count

Change your retriever to fetch more candidates than before. If you were retrieving 5 documents, try retrieving 25 to 50. The reranker will filter down to your final count.

Step 2: Add a Reranker

Choose an API or model based on your requirements. With Cohere:

from cohere import Client

co = Client(api_key="your-key")
results = co.rerank(
    query="your user query",
    documents=candidate_docs,
    top_n=5,
    model="rerank-v3.5"
)

With BGE (self-hosted):

from FlagEmbedding import FlagReranker

reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True)
scores = reranker.compute_score([['query', 'doc1'], ['query', 'doc2']])

Step 3: Pass Reranked Results to LLM

Sort by reranker scores and take the top N. These form your context for generation.

Step 4: Monitor and Tune

Track relevance metrics over time. Experiment with different retrieval counts and top_n values. The optimal settings depend on your data and query patterns.

Optimal document chunking approaches also affect reranking effectiveness. Well-chunked documents give rerankers cleaner candidates to evaluate.

Combining Reranking with Hybrid Search

Reranking works well alongside hybrid search strategies. Hybrid retrieval combining approaches merges lexical (BM25) and semantic (vector) search for broader coverage.

The pattern looks like:

Run BM25 search for keyword matches
Run vector search for semantic matches
Merge and deduplicate results
Rerank the combined candidate set
Take top results for the LLM

This three-stage approach handles queries that need exact term matching (product codes, names, acronyms) alongside queries requiring semantic understanding. The reranker then selects the best results regardless of which retrieval method found them.

Research from Pinecone shows hybrid plus reranking pipelines achieve 48% improvement in retrieval quality compared to single-method approaches.

Best Practices for Production Reranking

After working with reranking systems, several patterns emerge:

Right-size Your Candidate Set

Retrieve enough candidates that relevant documents are likely included, but not so many that reranking becomes slow. For most applications, 50 to 75 documents hits the sweet spot. Going beyond 100 rarely improves final results but increases latency and cost.

Set Score Thresholds

Don't pass everything to your LLM. If no documents score above a certain relevance threshold, it may be better to acknowledge uncertainty than to hallucinate from weak context.

Consider Latency Budgets

Cross-encoder reranking adds 50 to 200ms depending on model size and candidate count. For real-time chat applications, this matters. For batch processing or offline analysis, it's negligible.

Monitor for Drift

Reranker performance can degrade as your document corpus changes. Regularly evaluate against test queries to catch quality degradation. AI model benchmarks and evaluation provides context on how evaluation methodologies work.

Domain-Specific Fine-Tuning

Off-the-shelf rerankers work well for general content. For specialized domains (medical, legal, technical), fine-tuning on domain-specific pairs can improve relevance significantly.

When Not to Use Reranking

Reranking isn't always necessary. Consider skipping it when:

Your document set is small (under 1,000 chunks) and well-curated
Latency requirements are extremely tight (under 100ms total)
Your queries are simple lookups rather than complex questions
Cost constraints are severe and you can't afford per-query model inference

For simple FAQ systems or small knowledge bases, basic retrieval may suffice. Reranking shines when you have large, varied document collections and complex user queries.

The Future of Retrieval in RAG

Reranking represents current best practice, but the field continues advancing.

Late interaction models like ColBERT offer promising middle ground between bi-encoder speed and cross-encoder accuracy. Multi-modal rerankers that handle images and text together are emerging. And LLM-based rerankers (using models like GPT-4 to score relevance) show competitive results, though at higher cost.

For teams building AI research assistance tools or complex RAG applications, staying current on retrieval innovations matters as much as keeping up with LLM releases.

The core principle remains constant: better retrieval means better generation. Cross-encoder reranking is currently the most reliable way to improve RAG accuracy, and it's worth implementing in any serious production system.

Wrapping Up

Two-stage retrieval with reranking transforms RAG from "sometimes works" to "reliably excellent." The technique is straightforward: retrieve broadly with embeddings, refine precisely with cross-encoders.

If your RAG system disappoints, try adding a reranking step before investing in larger LLMs or more elaborate prompting. You might be surprised how much better responses become when the LLM actually receives relevant context.

Understanding retrieval deeply, including measuring similarity for retrieval, positions you to build systems that genuinely work.

Start with a hosted API like Cohere to validate the improvement, then move to self-hosted options if cost or privacy requires it. Either way, your users will notice the difference.

Retrieval and Reranking in RAG Systems

Key takeaways