LLM Evaluations
GEval
G-Eval is a versatile evaluation framework that leverages LLMs (Large Language Models) as judges to assess outputs based on custom criteria. It uses a chain-of-thoughts (CoT) approach to evaluate outputs with human-like accuracy, making it suitable for a wide range of use cases.
Required Arguments
To use the GEvalEvaluator, you need to provide the following arguments:
- name: A descriptive name for the metric.
- criteria: A description outlining the specific evaluation aspects for each test case.
- threshold: The minimum score required for a passing evaluation.
Optional Arguments
- evaluation_steps: A list of strings outlining the exact steps the LLM should take for evaluation. If not provided, G-Eval will generate steps based on the criteria.
- rubric: A list of rubrics to confine the range of the final metric score.
- model: The specific model to use for evaluation, defaulting to ‘gpt-4o’.
- strict_mode: Enforces a binary metric score: 1 for perfection, 0 otherwise.
- async_mode: Enables concurrent execution within the measure() method.
- verbose_mode: Prints intermediate steps used to calculate the metric.
Usage Example
Here’s how you can use the GEvalEvaluator in your evaluation system:
Evaluation Steps
If you want more control over the evaluation process, you can provide evaluation_steps:
Best Practices
- Define Clear Criteria: Ensure your criteria are specific and measurable.
- Use Evaluation Steps: For more reliable results, provide detailed evaluation steps.
- Leverage Rubrics: Use rubrics to standardize scoring and provide clear feedback.