The ContextualRecallMetric is a reference-based RAG evaluation metric designed to measure how completely your retrieval context covers the information needed to generate the ideal answer. It uses LLM-as-a-judge to assess whether statements in the expected_output are attributable to the retrieved context. This metric is especially useful for evaluating the completeness of a retriever in your RAG pipeline.
ModelTestCase
instance with the following fields:
input
: The user’s query.actual_output
: The LLM-generated response (not used in the score computation, but required).expected_output
: The ideal response used to extract ground-truth statements.retrieval_context
: A list of strings representing retrieved context chunks.Argument | Type | Description | Default |
---|---|---|---|
threshold | float | Minimum score to be considered a “pass”. | 0.5 |
model | str | LLM to use for evaluation (e.g., 'gpt-4o' , or a custom DeepEval-compatible model). | 'gpt-4o' |
include_reason | bool | If True , includes explanation for the evaluation score. | True |
strict_mode | bool | Binary scoring mode — 1 for full match, 0 otherwise. | False |
async_mode | bool | Enables concurrent scoring for faster evaluations. | True |
verbose_mode | bool | If True , logs detailed steps to the console. | False |
evaluation_template | ContextualRecallTemplate | Optional custom prompt template class. | Default internal template |
retrieval_context
.