Key metrics for RAG evaluation teams

Evaluating Embedding Quality: Hit Rate, MRR, and NDCG Explained

This guide explains Hit Rate, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) — three primary metrics for assessing embedding quality in retrieval-augmented generation (RAG) systems. It aims to help evaluation teams understand strengths and limitations of each metric to inform embedding model selection and tuning.

In this guide · 5 steps

01Hit Rate: Measuring Retrieval Recall
02Mean Reciprocal Rank (MRR): Accounting for Rank Position
03Normalized Discounted Cumulative Gain (NDCG): Weighted Relevance Across Ranks
04Choosing the Right Metric for Your Evaluation
05Common Pitfalls and Best Practices

Retrieval-augmented generation (RAG) systems rely heavily on embedding models to fetch relevant documents or passages. Evaluating embedding quality objectively requires clear metrics that reflect how well the model ranks relevant items against irrelevant ones. Hit Rate, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) are three of the most used metrics in research and enterprise evaluations.

1. Hit Rate: Measuring Retrieval Recall

Hit Rate, also called Recall@K, measures whether at least one relevant document appears within the top K retrieved results. For example, Hit Rate@10 reports the percentage of queries for which the target document is among the first 10 returned. This metric reflects the embedding model's ability to get relevant documents into a short candidate list for downstream tasks.

Hit Rate is straightforward and interpretable, making it popular for benchmarking. However, it lacks granularity beyond presence or absence of relevant items. It treats a relevant item at rank 1 the same as at rank 10 and ignores how many relevant items are retrieved.

Enterprises commonly track Hit Rate@5 or Hit Rate@10 in production scenarios. Gartner in 2023 observed that 73% of evaluated RAG proof of concepts report Hit Rate as a baseline metric.

2. Mean Reciprocal Rank (MRR): Accounting for Rank Position

Mean Reciprocal Rank (MRR) addresses some limitations of Hit Rate by accounting for the position of the first relevant document in the results. For each query, the reciprocal rank is 1 divided by the rank of the first relevant item (e.g., 1 if rank 1, 0.1 if rank 10). MRR averages this reciprocal rank across all queries.

MRR rewards embedding models that not only retrieve relevant documents but rank them higher. This is crucial for user-facing systems where top results drive user experience. MRR is relatively simple to compute and widely adopted in academic benchmarks such as MS MARCO (v2).

A limitation is that MRR focuses only on the first relevant item, ignoring the presence and ranks of additional relevant documents. For use cases requiring multiple relevant results, other metrics might be better suited.

3. Normalized Discounted Cumulative Gain (NDCG): Weighted Relevance Across Ranks

Normalized Discounted Cumulative Gain (NDCG) measures the usefulness of a ranked list based on graded relevance of retrieved items, discounted by their position. It sums the relevance scores of items, reducing the impact exponentially the lower they appear in the ranking. The final score is normalized against an ideal ranking to range from 0 to 1.

NDCG is more comprehensive than Hit Rate or MRR. It accounts for multiple relevant items, their varied importance, and their exact positions. This makes it especially useful for enterprise RAG systems that integrate user feedback or graded relevance labels.

Computing NDCG requires relevance judgments with graded scores, which can add annotation overhead. Also, it is more complex to interpret compared to Hit Rate and MRR.

4. Choosing the Right Metric for Your Evaluation

For initial embedding model screening, Hit Rate@K offers clarity and simplicity. It is especially suited for tasks where finding any relevant document quickly is critical.

MRR is appropriate when the position of the first relevant document directly affects user experience or downstream generation quality. For example, answering short, single-turn questions benefits from maximizing MRR.

NDCG should be used when the application demands nuanced relevance ordering and multiple results contribute value, such as multi-document synthesis or conversational history retrieval.

In practice, many evaluation teams apply multiple metrics in parallel to gain a fuller picture of embedding effectiveness. Vendors and benchmark suites like Hugging Face Datasets or BEIR provide these metrics as standard outputs.

5. Common Pitfalls and Best Practices

Ensure the labeled relevance judgments used match your use case logic. Binary labels suffice for Hit Rate and MRR, but graded labels are needed for reliable NDCG computation.

Avoid tuning embeddings solely to optimize one metric, as it may degrade other aspects of retrieval quality. Balanced evaluation across Hit Rate, MRR, and NDCG is advisable.

Consider the retrieval corpus size and diversity. Hit Rate and MRR scores may vary significantly with corpus scale, while NDCG’s normalization helps mitigate this effect.

Checklist for Embedding Quality Evaluation

Define relevant documents and their labeling schema clearly before evaluation.
Select Hit Rate@K to measure recall of relevant results in top candidates.
Use MRR to prioritize ranking of the first relevant document.
Apply NDCG when multiple graded relevance labels exist and ranking quality matters beyond the first result.
Combine multiple metrics to cover different facets of embedding quality.
Regularly validate metrics’ alignment with end-user or application KPIs.
Be cautious of metric sensitivity to corpus size and query distribution.