G-Eval is a versatile evaluation framework that leverages LLMs (Large Language Models) as judges to assess outputs based on custom criteria. It uses a chain-of-thoughts (CoT) approach to evaluate outputs with human-like accuracy, making it suitable for a wide range of use cases.
evaluation_steps: A list of strings outlining the exact steps the LLM should take for evaluation. If not provided, G-Eval will generate steps based on the criteria.
rubric: A list of rubrics to confine the range of the final metric score.
model: The specific model to use for evaluation, defaulting to ‘gpt-4o’.
strict_mode: Enforces a binary metric score: 1 for perfection, 0 otherwise.
async_mode: Enables concurrent execution within the measure() method.
verbose_mode: Prints intermediate steps used to calculate the metric.
Here’s how you can use the GEvalEvaluator in your evaluation system:
Copy
from agensight.eval.metrics import GEvalEvaluatorfrom agensight.eval.test_case import ModelTestCase# Define the metriccorrectness_metric = GEvalEvaluator( name="Code Correctness", criteria="Evaluate whether the generated code correctly implements the specified requirements.", threshold=0.8)# Create a test casetest_case = ModelTestCase( input="Write a function to add two numbers.", actual_output="def add(a, b): return a + b", expected_output="A function that correctly adds two numbers.")# Run the evaluationcorrectness_metric.measure(test_case)print(correctness_metric.score, correctness_metric.reason)
If you want more control over the evaluation process, you can provide evaluation_steps:
Copy
correctness_metric = GEvalEvaluator( name="Code Correctness", evaluation_steps=[ "Check whether the code correctly adds two numbers.", "Ensure the function handles edge cases like negative numbers.", "Verify that the function is well-documented." ], threshold=0.8)