Multimodal Tool Correctness
The MultimodalToolCorrectnessMetric is a multimodal metric designed to evaluate the correctness of tool usage in generating outputs. It assesses how accurately the tools were used in the process of generating the output, serving as a proxy for evaluating the performance of models that rely on tool calls to generate text or images.
Required Arguments
-
input: A list containing the prompt or task description.
-
actual_output: A list containing the generated output or description.
-
tools_called: A list of ToolCall instances representing the tools that were actually called. Each ToolCall requires:
-
name: The name of the tool called.
-
expected_tools: A list of ToolCall instances representing the tools that were expected to be called. Each ToolCall requires:
-
name: The name of the expected tool.
Optional Arguments
-
threshold: A float representing the minimum passing threshold, defaulted to 0.5.
-
model: A string specifying which of OpenAI’s GPT models to use, or any custom LLM model of type DeepEvalBaseLLM. Defaulted to ‘gpt-4o’.
-
include_reason: A boolean which, when set to True, includes a reason for its evaluation score. Defaulted to True.
-
strict_mode: A boolean which, when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
-
async_mode: A boolean which, when set to True, enables concurrent execution within the measure() method. Defaulted to True.
-
verbose_mode: A boolean which, when set to True, prints the intermediate steps used to calculate the metric to the console. Defaulted to False.