Improving BM25 Retrieval with Query Expansion: A Practical Evaluation Using Precision, Recall, and MRR
Published March 2026 · 6 min read · RAG · Retrieval · Query Expansion

One of the most effective techniques used in modern RAG systems is query expansion. It is a reliable method that engineers can apply to significantly improve retriever performance.

In this article, we walk through an experiment where we build a BM25 retrieval system, apply query expansion, and evaluate its impact using Precision, Recall, and MRR.

What is Query Expansion?

Query expansion is a simple yet powerful technique widely used in modern AI systems. The idea is to expand a single query into multiple variations that retain the same meaning but use different wording.

Instead of directly tokenizing the original query and performing a search, we introduce a query expansion layer on top of the retrieval process. In simple terms, we generate multiple semantically similar queries, increasing the chances of retrieving more relevant results.

Example

Original Query: "late payment fee"

Without query expansion, a retriever like BM25 will search for the exact terms “late”, “payment”, and “fee”. This means it may miss relevant documents containing related terms such as “charges”, “penalty”, or “overdue”.

To address this, we expand the original query into a richer version:

Expanded Query: "late payment fee penalty charges overdue"

Now, the query includes multiple semantically similar terms. This increases coverage and helps the retriever match a broader set of relevant documents. Given the lexical nature of BM25, this often leads to better retrieval performance.

While query expansion can also benefit dense retrievers, in this experiment we will focus on BM25. We will evaluate its performance using metrics such as Precision@5, Recall@5, and MRR@5 to determine whether query expansion leads to measurable improvements.

Before diving into the experiment results, let’s first look at some of the most widely used query expansion techniques. The Query Expander shown above is a generic component that can be implemented using different techniques, such as dictionary-based methods, LLM-based expansion, or PRF.

1. Dictionary-Based (Rule-Based) Expansion

In this approach, we use a predefined mapping of words to their synonyms.

Example:

Query: "late payment fee"

Expanded: "late payment fee penalty charges"

We maintain a domain-specific dictionary that maps key terms to their related words:

    "fee": ["charge", "penalty"]
    "lost": ["stolen", "missing"]
    

When a query is received, we look up each term in the dictionary, retrieve its synonyms, and expand the query accordingly. This improves the effectiveness of lexical search methods like BM25.

Pros

Cons

When to Use

2. LLM-Based Expansion (Modern Approach)

In this approach, we use a Large Language Model (LLM) to generate multiple variations of a query using prompt-based instructions.

The idea is to pass the original query to the LLM and ask it to generate 3–5 semantically similar queries with different wording.

Example:

Query: "minimum amount due"

Expanded queries:

    minimum payment required
    outstanding payment obligation
    due balance amount
    

This method is powerful because it goes beyond simple synonym replacement. The model understands the intent of the query and generates variations based on meaning rather than just keywords.

Pros

Cons

When to Use

3. Pseudo-Relevance Feedback (PRF) / Top-K Expansion

This is a data-driven technique that does not require any external knowledge source.

The process works as follows:

Example:

Query: "credit limit"

Top results may contain terms like:

    maximum credit
    available limit
    card account
    

Expanded query:

    credit limit maximum credit available limit card account
    

Pros

Cons

When to Use

Implementation & Results

Objective: The goal of this experiment is to build a retrieval system that fetches relevant terms and conditions sections based on user queries and uses those retrieved chunks to generate answers.

In this article, we focus only on the retriever component, not the generation step.

Experimental Setup

Metrics used:

The objective is to evaluate whether query expansion improves retrieval performance.

Baseline (Without Query Expansion)

Precision@50.168
Recall@50.4683
MRR0.4885

Dictionary-Based Query Expansion

We expanded queries using a predefined dictionary to improve lexical coverage.

Precision@50.1780
Recall@50.4967
MRR0.5042

Observation: There is a noticeable improvement in both Recall and MRR. While the gains are modest, they demonstrate that even simple rule-based query expansion can enhance retriever performance.

LLM-Based Query Expansion

For this approach, we used a Gemini model to generate multiple query variations. (You can also use open-source models available on Hugging Face.)

Precision@50.2020
Recall@50.5567
MRR0.5685

Observation: This significantly improves performance across all metrics. Recall improved by ~19.6% compared to the baseline MRR also shows a strong increase, indicating better ranking of relevant results

Query expansion improves recall more than precision. This is expected because expanding the query increases coverage, allowing the retriever to match more documents. However, this can also introduce noise, which is why precision improvements are smaller compared to recall.

Conclusion

From the experimental results, it is clear that query expansion improves retrieval performance when implemented correctly. Both dictionary-based and LLM-based approaches show measurable gains over the baseline, with LLM-based expansion delivering the most significant improvements.

However, choosing the right approach depends on the trade-off between latency and accuracy:

If accuracy is the priority, LLM-based query expansion is the better choice due to its ability to capture semantic meaning.

If latency and cost are critical, simpler methods like dictionary-based or Top-K (PRF) expansion are more suitable.

In practice, designing an effective retrieval system requires carefully balancing these trade-offs based on the specific use case.

Until next time, happy learning!

Full code repo: View on GitHub

LinkedIn profile: Connect on LinkedIn

← Back to Articles