The MultimodalToolCorrectnessMetric is a multimodal metric designed to evaluate the correctness of tool usage in generating outputs. It assesses how accurately the tools were used in the process of generating the output, serving as a proxy for evaluating the performance of models that rely on tool calls to generate text or images.

Required Arguments

  • input: A list containing the prompt or task description.

  • actual_output: A list containing the generated output or description.

  • tools_called: A list of ToolCall instances representing the tools that were actually called. Each ToolCall requires:

  • name: The name of the tool called.

  • expected_tools: A list of ToolCall instances representing the tools that were expected to be called. Each ToolCall requires:

  • name: The name of the expected tool.

Optional Arguments

  • threshold: A float representing the minimum passing threshold, defaulted to 0.5.

  • model: A string specifying which of OpenAI’s GPT models to use, or any custom LLM model of type DeepEvalBaseLLM. Defaulted to ‘gpt-4o’.

  • include_reason: A boolean which, when set to True, includes a reason for its evaluation score. Defaulted to True.

  • strict_mode: A boolean which, when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.

  • async_mode: A boolean which, when set to True, enables concurrent execution within the measure() method. Defaulted to True.

  • verbose_mode: A boolean which, when set to True, prints the intermediate steps used to calculate the metric to the console. Defaulted to False.

Usage Example

import sys
import os

from agensight.eval.metrics import MultimodalToolCorrectnessMetric
from agensight.eval.test_case import ModelTestCase, ToolCall
from agensight.eval.test_case import MLLMImage

input_data = ["Draw a cat and then describe it."]
actual_output = ["Image generated and described."]
tools_called = [ToolCall(name="draw_image"), ToolCall(name="describe_image")]
expected_tools = [ToolCall(name="s"), ToolCall(name="w")]

metric = MultimodalToolCorrectnessMetric()
test_case = ModelTestCase(input=input_data, actual_output=actual_output, tools_called=tools_called, expected_tools=expected_tools)

metric.measure(test_case)
print(f"Score: {metric.score}")
print(f"Reason: {metric.reason}")