When and How to Use Evals?
Evals should be incorporated throughout the lifecycle of a GenAI application, from initial development to production deployment. Key use cases include:
Prompt Optimization: Testing different prompt structures to determine the most effective phrasing.
Model Comparison: Benchmarking different AI models to select the best-performing option.
Regression Testing: Ensuring new changes do not degrade existing performance.
Bias and Safety Checks: Detecting unwanted biases or potentially harmful outputs.
Inputs Sanitation: Assuring only certain inputs are indeed processed.
Using Evals effectively involves:
Defining the evaluation criteria and desired outcomes.
Selecting the appropriate type of Eval (Automated, HITL, or LLM-as-a-Judge).
Running evaluations on sample queries and reviewing the results.
Iterating on prompts and model configurations based on findings.