The ToolCorrectnessMetric is designed to evaluate the accuracy of an AI model’s tool usage. It assesses whether the tools called by the model match the expected tools, considering factors like input parameters and output. This metric is crucial for ensuring that models use the correct tools to achieve desired outcomes.
threshold: A float representing the minimum passing threshold, defaulted to 0.5.
evaluation_params: A list of ToolCallParams indicating the strictness of the correctness criteria. Options include ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT.
include_reason: A boolean indicating whether to include a reason for the evaluation score. Defaulted to True.
strict_mode: Enforces a binary metric score: 1 for perfection, 0 otherwise. Overrides the current threshold and sets it to 1. Defaulted to False.
verbose_mode: Prints intermediate steps used to calculate the metric to the console. Defaulted to False.
should_consider_ordering: Considers the order in which tools were called. Defaulted to False.
should_exact_match: Requires the tools_called and expected_tools to be exactly the same. Defaulted to False.
Here’s how you can use the ToolCorrectnessMetric in your evaluation system:
Copy
from agensight.eval.metrics import ToolCorrectnessMetricfrom agensight.eval.test_case import ModelTestCase, ToolCall# Define the metrictool_metric = ToolCorrectnessMetric( threshold=0.7, include_reason=True)# Create a test casetest_case = ModelTestCase( input="What if these shoes don't fit?", actual_output="We offer a 30-day full refund at no extra cost.", tools_called=[ ToolCall(name="WebSearch"), ToolCall(name="ToolQuery") ], expected_tools=[ ToolCall(name="WebSearch") ])# Run the evaluationtool_metric.measure(test_case)print(tool_metric.score, tool_metric.reason)