LLM Evaluations
Tool Correctness
The ToolCorrectnessMetric is designed to evaluate the accuracy of an AI model’s tool usage. It assesses whether the tools called by the model match the expected tools, considering factors like input parameters and output. This metric is crucial for ensuring that models use the correct tools to achieve desired outcomes.
Required Arguments
To use the ToolCorrectnessMetric, you need to provide the following arguments when creating a ModelTestCase:
- input: The task or goal the user wants the model to perform.
- actual_output: The output generated by the model.
- tools_called: The tools or actions the model used to accomplish the task.
- expected_tools: The tools that are expected to be used by the model.
Optional Parameters
- threshold: A float representing the minimum passing threshold, defaulted to 0.5.
- evaluation_params: A list of ToolCallParams indicating the strictness of the correctness criteria. Options include ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT.
- include_reason: A boolean indicating whether to include a reason for the evaluation score. Defaulted to True.
- strict_mode: Enforces a binary metric score: 1 for perfection, 0 otherwise. Overrides the current threshold and sets it to 1. Defaulted to False.
- verbose_mode: Prints intermediate steps used to calculate the metric to the console. Defaulted to False.
- should_consider_ordering: Considers the order in which tools were called. Defaulted to False.
- should_exact_match: Requires the tools_called and expected_tools to be exactly the same. Defaulted to False.
Usage Example
Here’s how you can use the ToolCorrectnessMetric in your evaluation system: