Required Arguments

Each test case should be a ModelTestCase instance with the following fields:

  • input: The original user query.
  • actual_output: The LLM-generated response based on retrieved context.
  • expected_output: The ideal response based on the context.
  • retrieval_context: A list of strings representing the retrieved context chunks used by the LLM.

Optional Arguments

ArgumentTypeDescriptionDefault
thresholdfloatMinimum score to be considered a “pass”.0.5
modelstrThe LLM to use for evaluation (e.g., 'gpt-4o', or a custom LLM).'gpt-4o'
include_reasonboolIf True, includes the reasoning behind the metric score.True
strict_modeboolEnforces a binary score (1 for perfect relevance order, 0 otherwise).False
async_modeboolEnables concurrent processing for faster evaluation.True
verbose_modeboolLogs intermediate steps to the console.False
evaluation_templateContextualPrecisionTemplateOptional custom prompt template class for model evaluation.Default internal template

Usage Example


from agensight.eval.test_case import ModelTestCase
from agensight.eval.metrics import ContextualPrecisionMetric

# Example: Refund policy case
test_case = ModelTestCase(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    expected_output="You are eligible for a 30 day full refund at no extra cost.",
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."]
)

# Initialize the metric
metric = ContextualPrecisionMetric(
    threshold=0.7,
    model="gpt-4o",
    include_reason=True
)

# Run the evaluation
metric.measure(test_case)
print(metric.score)   # e.g. 1.0
print(metric.reason)  # e.g. "The context was directly relevant and well-ranked."

How It Works

The metric calculates a weighted contextual precision score based on:

  • Whether each context node is relevant to the input and expected output.
  • The ranking of relevant nodes — higher ranks improve the score.
  • Uses an LLM to determine relevance, making the evaluation more aligned with human judgment.