Welcome to the Agensight Evaluation , a comprehensive suite designed to enhance the evaluation of LLM (Large Language Model) applications. Our framework supports a wide range of evaluation metrics, including those for Retrieval-Augmented Generation (RAG) and multimodal applications. Agensight leverages the powerful capabilities of the open-source library DeepEval to provide robust and flexible evaluation metrics.
Comprehensive Metric Support: Seamlessly integrate over 20+ research-backed metrics into your evaluation processes, built on the foundation of DeepEval.
Customizable Evaluations: Easily tailor metrics to suit your specific evaluation needs, whether for end-to-end or component-level assessments.
Cloud and Local Evaluations: Run evaluations locally or leverage cloud platforms to manage and analyze your evaluation results.
Security and Robustness: Conduct red teaming and safety scans to ensure your LLM applications are secure and robust against adversarial attacks.
Create a test file and define your test cases using the metrics provided by Agensight. Here’s a quick example:
Copy
from agensight.eval.metrics import GEvalEvaluatorfrom agensight.eval.test_case import ModelTestCase# Define the metriccorrectness_metric = GEvalEvaluator( name="Code Correctness", criteria="Evaluate whether the generated code correctly implements the specified requirements.", threshold=0.8)# Create a test casetest_case = ModelTestCase( input="Write a function to add two numbers.", actual_output="def add(a, b): return a + b", expected_output="A function that correctly adds two numbers.")# Run the evaluationcorrectness_metric.measure(test_case)print(correctness_metric.score, correctness_metric.reason)
Agensight Evaluation is built with open-source principles, and we encourage contributions from the community to enhance its capabilities. If you find Agensight useful, consider giving it a star on GitHub and contributing to its development.