The TaskCompletionMetric is designed to evaluate how effectively an AI model completes a specified task. It assesses the alignment between the task’s requirements and the model’s output, considering any tools used during the process. This metric is ideal for scenarios where task success is critical.

Required Arguments

To use the TaskCompletionMetric, you need to provide the following arguments when creating a ModelTestCase:

  • input: The task or goal the user wants the model to perform.
  • actual_output: The output generated by the model.
  • tools_called: The tools or actions the model used to accomplish the task.

Optional Parameters

  • threshold: A float representing the minimum passing threshold, defaulted to 0.5.
  • model: A string specifying which model to use for evaluation, defaulted to ‘gpt-4o-mini’.
  • include_reason: A boolean indicating whether to include a reason for the evaluation score. Defaulted to True.
  • strict_mode: Enforces a binary metric score: 1 for perfection, 0 otherwise. Overrides the current threshold and sets it to 1. Defaulted to False.
  • async_mode: Enables concurrent execution within the measure() method. Defaulted to True.
  • verbose_mode: Prints intermediate steps used to calculate the metric to the console. Defaulted to False.

Usage Example

Here’s how you can use the TaskCompletionMetric in your evaluation system:

from agensight.eval.metrics import TaskCompletionMetric
from agensight.eval.test_case import ModelTestCase, ToolCall

# Define the metric
task_metric = TaskCompletionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

# Create a test case
test_case = ModelTestCase(
    input="Develop a Python script to automate data entry tasks.",
    actual_output="The script automates data entry using pandas and openpyxl.",
    tools_called=[
        ToolCall(
            name="DataEntryBot",
            description="Automates data entry tasks using Python libraries.",
            input_parameters={"library": "pandas", "task": "data entry"},
            output=["Data entry automated using pandas and openpyxl."]
        )
    ]
)

# Run the evaluation
task_metric.measure(test_case)
print(task_metric.score, task_metric.reason)