The MultimodalContextualRecallMetric is a multimodal metric designed to evaluate the recall of contextual information in generated outputs. It assesses how well the generated output recalls relevant information from the context, serving as a proxy for evaluating the performance of models that generate text based on multimodal inputs.
threshold: A float representing the minimum passing threshold, defaulted to 0.5.
model: A string specifying which of OpenAI’s GPT models to use, or any custom LLM model of type DeepEvalBaseLLM. Defaulted to ‘gpt-4o’.
include_reason: A boolean which, when set to True, includes a reason for its evaluation score. Defaulted to True.
strict_mode: A boolean which, when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
async_mode: A boolean which, when set to True, enables concurrent execution within the measure() method. Defaulted to True.
verbose_mode: A boolean which, when set to True, prints the intermediate steps used to calculate the metric to the console. Defaulted to False.
import sysimport osfrom agensight.eval.metrics import MultimodalContextualRecallMetricfrom agensight.eval.test_case import MLLMTestCasefrom agensight.eval.test_case import MLLMImageinput_data = ["Describe the image."]actual_output = ["A cat is sitting on a windowsill."]expected_output = ["A cat is sitting on a windowsill, looking outside."]retrieval_context = ["A photo of a cat on a windowsill."]metric = MultimodalContextualRecallMetric(model="gpt-4o", threshold=0.5)test_case = MLLMTestCase(input=input_data, actual_output=actual_output, expected_output=expected_output, retrieval_context=retrieval_context)metric.measure(test_case)print(f"Score: {metric.score}")print(f"Reason: {metric.reason}")