Tool Correctness

The ToolCorrectnessMetric is designed to evaluate the accuracy of an AI model’s tool usage. It assesses whether the tools called by the model match the expected tools, considering factors like input parameters and output. This metric is crucial for ensuring that models use the correct tools to achieve desired outcomes.

Required Arguments

To use the ToolCorrectnessMetric, you need to provide the following arguments when creating a ModelTestCase:

input: The task or goal the user wants the model to perform.
actual_output: The output generated by the model.
tools_called: The tools or actions the model used to accomplish the task.
expected_tools: The tools that are expected to be used by the model.

Optional Parameters

threshold: A float representing the minimum passing threshold, defaulted to 0.5.
evaluation_params: A list of ToolCallParams indicating the strictness of the correctness criteria. Options include ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT.
include_reason: A boolean indicating whether to include a reason for the evaluation score. Defaulted to True.
strict_mode: Enforces a binary metric score: 1 for perfection, 0 otherwise. Overrides the current threshold and sets it to 1. Defaulted to False.
verbose_mode: Prints intermediate steps used to calculate the metric to the console. Defaulted to False.
should_consider_ordering: Considers the order in which tools were called. Defaulted to False.
should_exact_match: Requires the tools_called and expected_tools to be exactly the same. Defaulted to False.

Usage Example

Here’s how you can use the ToolCorrectnessMetric in your evaluation system:

from agensight.eval.metrics import ToolCorrectnessMetric
from agensight.eval.test_case import ModelTestCase, ToolCall

# Define the metric
tool_metric = ToolCorrectnessMetric(
    threshold=0.7,
    include_reason=True
)

# Create a test case
test_case = ModelTestCase(
    input="What if these shoes don't fit?",
    actual_output="We offer a 30-day full refund at no extra cost.",
    tools_called=[
        ToolCall(name="WebSearch"),
        ToolCall(name="ToolQuery")
    ],
    expected_tools=[
        ToolCall(name="WebSearch")
    ]
)

# Run the evaluation
tool_metric.measure(test_case)
print(tool_metric.score, tool_metric.reason)

Get Started

Concepts

Required Arguments

Optional Parameters

Usage Example

Get Started

Concepts

​Required Arguments

​Optional Parameters

​Usage Example

Required Arguments

Optional Parameters

Usage Example