GEval

G-Eval is a versatile evaluation framework that leverages LLMs (Large Language Models) as judges to assess outputs based on custom criteria. It uses a chain-of-thoughts (CoT) approach to evaluate outputs with human-like accuracy, making it suitable for a wide range of use cases.

Required Arguments

To use the GEvalEvaluator, you need to provide the following arguments:

name: A descriptive name for the metric.
criteria: A description outlining the specific evaluation aspects for each test case.
threshold: The minimum score required for a passing evaluation.

Optional Arguments

evaluation_steps: A list of strings outlining the exact steps the LLM should take for evaluation. If not provided, G-Eval will generate steps based on the criteria.
rubric: A list of rubrics to confine the range of the final metric score.
model: The specific model to use for evaluation, defaulting to ‘gpt-4o’.
strict_mode: Enforces a binary metric score: 1 for perfection, 0 otherwise.
async_mode: Enables concurrent execution within the measure() method.
verbose_mode: Prints intermediate steps used to calculate the metric.

Usage Example

Here’s how you can use the GEvalEvaluator in your evaluation system:

from agensight.eval.metrics import GEvalEvaluator
from agensight.eval.test_case import ModelTestCase

# Define the metric
correctness_metric = GEvalEvaluator(
    name="Code Correctness",
    criteria="Evaluate whether the generated code correctly implements the specified requirements.",
    threshold=0.8
)

# Create a test case
test_case = ModelTestCase(
    input="Write a function to add two numbers.",
    actual_output="def add(a, b): return a + b",
    expected_output="A function that correctly adds two numbers."
)

# Run the evaluation
correctness_metric.measure(test_case)
print(correctness_metric.score, correctness_metric.reason)

Evaluation Steps

If you want more control over the evaluation process, you can provide evaluation_steps:

correctness_metric = GEvalEvaluator(
    name="Code Correctness",
    evaluation_steps=[
        "Check whether the code correctly adds two numbers.",
        "Ensure the function handles edge cases like negative numbers.",
        "Verify that the function is well-documented."
    ],
    threshold=0.8
)

Best Practices

Define Clear Criteria: Ensure your criteria are specific and measurable.
Use Evaluation Steps: For more reliable results, provide detailed evaluation steps.
Leverage Rubrics: Use rubrics to standardize scoring and provide clear feedback.

Get Started

Concepts

Required Arguments

Optional Arguments

Usage Example

Evaluation Steps

Best Practices

Get Started

Concepts

​Required Arguments

​Optional Arguments

​Usage Example

​Evaluation Steps

​Best Practices

Required Arguments

Optional Arguments

Usage Example

Evaluation Steps

Best Practices