|
|
--- |
|
|
title: "Part 2: Basic Evaluation Workflow with Ragas" |
|
|
date: 2025-04-26T19:00:00-06:00 |
|
|
layout: blog |
|
|
description: "Learn how to set up a basic evaluation workflow for LLM applications using Ragas. This guide walks you through data preparation, metric selection, and result analysis." |
|
|
categories: ["AI", "RAG", "Evaluation", "Ragas"] |
|
|
coverImage: "https://images.unsplash.com/photo-1600132806370-bf17e65e942f?q=80&w=1988&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" |
|
|
readingTime: 8 |
|
|
published: true |
|
|
--- |
|
|
|
|
|
In our previous post, we introduced Ragas as a powerful framework for evaluating LLM applications. Now, let's dive into the practical aspects of setting up your first evaluation pipeline. |
|
|
|
|
|
## Understanding the Evaluation Workflow |
|
|
|
|
|
A typical Ragas evaluation workflow consists of four key steps: |
|
|
|
|
|
1. **Prepare your data**: Collect queries, contexts, responses, and reference answers |
|
|
2. **Select appropriate metrics**: Choose metrics that align with what you want to evaluate |
|
|
3. **Run the evaluation**: Process your data through the selected metrics |
|
|
4. **Analyze the results**: Interpret scores and identify areas for improvement |
|
|
|
|
|
Let's walk through each step with practical examples. |
|
|
|
|
|
## Step 1: Setting Up Your Environment |
|
|
|
|
|
First, ensure you have Ragas installed: |
|
|
|
|
|
```bash |
|
|
uv add ragas |
|
|
``` |
|
|
|
|
|
Next, import the necessary components: |
|
|
|
|
|
```python |
|
|
import pandas as pd |
|
|
from ragas import EvaluationDataset |
|
|
from ragas import evaluate, RunConfig |
|
|
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity |
|
|
``` |
|
|
|
|
|
## Step 2: Preparing Your Evaluation Data |
|
|
|
|
|
For a RAG system evaluation, you'll need: |
|
|
|
|
|
- **Questions**: User queries to your system |
|
|
- **Contexts**: Documents or chunks retrieved by your system |
|
|
- **Responses**: Answers generated by your system |
|
|
- **Ground truth** (optional): Reference answers or documents for comparison |
|
|
|
|
|
Here's how to organize this data: |
|
|
|
|
|
```python |
|
|
# Sample data |
|
|
data = { |
|
|
"user_input": [ |
|
|
"What are the main symptoms of COVID-19?", |
|
|
"How does machine learning differ from deep learning?" |
|
|
], |
|
|
"retrieved_contexts": [ |
|
|
[ |
|
|
"Common symptoms of COVID-19 include fever, cough, and fatigue. Some patients also report loss of taste or smell, body aches, and difficulty breathing.", |
|
|
"COVID-19 is caused by the SARS-CoV-2 virus and spreads primarily through respiratory droplets." |
|
|
], |
|
|
[ |
|
|
"Machine learning is a subset of AI focused on algorithms that learn from data without being explicitly programmed.", |
|
|
"Deep learning is a specialized form of machine learning using neural networks with many layers (deep neural networks)." |
|
|
] |
|
|
], |
|
|
"response": [ |
|
|
"The main symptoms of COVID-19 include fever, cough, fatigue, and sometimes loss of taste or smell, body aches, and breathing difficulties.", |
|
|
"Machine learning is a subset of AI that focuses on algorithms learning from data, while deep learning is a specialized form of machine learning that uses deep neural networks with multiple layers." |
|
|
], |
|
|
"reference": [ |
|
|
"COVID-19 symptoms commonly include fever, dry cough, fatigue, loss of taste or smell, body aches, sore throat, and in severe cases, difficulty breathing.", |
|
|
"Machine learning is a branch of AI where systems learn from data, identify patterns, and make decisions with minimal human intervention. Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data." |
|
|
] |
|
|
} |
|
|
|
|
|
eval_data = pd.DataFrame(data) |
|
|
|
|
|
# Convert to a format Ragas can use |
|
|
evaluation_dataset = EvaluationDataset.from_pandas(eval_data) |
|
|
evaluation_dataset |
|
|
|
|
|
``` |
|
|
|
|
|
## Step 3: Selecting and Configuring Metrics |
|
|
|
|
|
Ragas offers various metrics to evaluate different aspects of your system: |
|
|
|
|
|
### Core RAG Metrics: |
|
|
|
|
|
- **Faithfulness**: Measures if the response is factually consistent with the provided context. |
|
|
- **Factual Correctness**: Assesses if the response is accurate and free from factual errors. |
|
|
- **Response Relevancy**: Evaluates if the response directly addresses the user query. |
|
|
- **Context Entity Recall**: Measures how well the retrieved context captures relevant entities from the ground truth. |
|
|
- **Noise Sensitivity**: Assesses the robustness of the response to irrelevant or noisy context. |
|
|
- **LLM Context Recall**: Evaluates how effectively the LLM utilizes the provided context to generate the response. |
|
|
|
|
|
For metrics that require an LLM (like faithfulness), you need to configure the LLM provider: |
|
|
|
|
|
```python |
|
|
# Configure LLM for evaluation |
|
|
from langchain_openai import ChatOpenAI |
|
|
from ragas.llms import LangchainLLMWrapper |
|
|
|
|
|
# Initialize the LLM, you are going to OPENAI API key |
|
|
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o")) |
|
|
|
|
|
# Define metrics to use |
|
|
metrics = [ |
|
|
Faithfulness(), |
|
|
FactualCorrectness(), |
|
|
ResponseRelevancy(), |
|
|
ContextEntityRecall(), |
|
|
NoiseSensitivity(), |
|
|
LLMContextRecall() |
|
|
] |
|
|
``` |
|
|
|
|
|
## Step 4: Running the Evaluation |
|
|
|
|
|
Now, run the evaluation with your selected metrics: |
|
|
|
|
|
```python |
|
|
# Run evaluation |
|
|
results = evaluate( |
|
|
evaluation_dataset, |
|
|
metrics=metrics, |
|
|
llm=evaluator_llm # Required for LLM-based metrics |
|
|
) |
|
|
|
|
|
# View results |
|
|
print(results) |
|
|
``` |
|
|
### Output: |
|
|
|
|
|
*Values will vary based on your data and LLM performance.* |
|
|
|
|
|
```python |
|
|
{ |
|
|
"faithfulness": 1.0000, |
|
|
"factual_correctness": 0.6750, |
|
|
"answer_relevancy": 0.9897, |
|
|
"context_entity_recall": 0.8889, |
|
|
"noise_sensitivity_relevant": 0.1667, |
|
|
"context_recall": 0.5000 |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## Step 5: Interpreting Results |
|
|
|
|
|
Ragas metrics typically return scores between 0 and 1, where higher is better: |
|
|
|
|
|
### Understanding Score Ranges: |
|
|
|
|
|
- **0.8-1.0**: Excellent performance |
|
|
- **0.6-0.8**: Good performance |
|
|
- **0.4-0.6**: Moderate performance, needs improvement |
|
|
- **0.4 or lower**: Poor performance, requires significant attention |
|
|
|
|
|
## Advanced Use: Custom Evaluation for Specific Examples |
|
|
|
|
|
For more detailed analysis of specific examples: |
|
|
|
|
|
```python |
|
|
from ragas import SingleTurnSample |
|
|
from ragas.metrics import AspectCritic |
|
|
|
|
|
# Define a specific test case |
|
|
test_data = { |
|
|
"user_input": "What are quantum computers?", |
|
|
"response": "Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits that can only be 0 or 1.", |
|
|
"retrieved_contexts": ["Quantum computing is a type of computation that harnesses quantum mechanical phenomena."] |
|
|
} |
|
|
|
|
|
# Create a custom evaluation metric |
|
|
custom_metric = AspectCritic( |
|
|
name="quantum_accuracy", |
|
|
llm=llm, |
|
|
definition="Verify if the explanation of quantum computing is accurate and complete." |
|
|
) |
|
|
|
|
|
# Score the sample |
|
|
sample = SingleTurnSample(**test_data) |
|
|
score = await custom_metric.single_turn_ascore(sample) |
|
|
print(f"Quantum accuracy score: {score}") |
|
|
``` |
|
|
> 💡 **Try it yourself:** |
|
|
> Explore the hands-on notebook for this workflow: |
|
|
> [02_Basic_Evaluation_Workflow_with_Ragas](https://github.com/mafzaal/intro-to-ragas/blob/master/02_Basic_Evaluation_Workflow_with_Ragas.ipynb) |
|
|
|
|
|
## Common Evaluation Patterns and Metrics |
|
|
|
|
|
Below is a matrix mapping evaluation patterns to the metrics used, along with definitions for each metric: |
|
|
|
|
|
| **Metric** | **Comprehensive RAG Evaluation** | **Content Quality Evaluation** | **Retrieval Quality Evaluation** | |
|
|
|-----------------------------|----------------------------------|---------------------------------|-----------------------------------| |
|
|
| **Faithfulness** | ✓ | ✓ | | |
|
|
| **Answer Relevancy** | ✓ | ✓ | | |
|
|
| **Context Recall** | ✓ | | ✓ | |
|
|
| **Context Precision** | ✓ | | ✓ | |
|
|
| **Harmfulness** | | ✓ | | |
|
|
| **Coherence** | | ✓ | | |
|
|
| **Context Relevancy** | | | ✓ | |
|
|
|
|
|
### Metric Definitions |
|
|
|
|
|
- **Faithfulness**: Measures if the response is factually consistent with the provided context. |
|
|
- **Answer Relevancy**: Assesses if the response addresses the question. |
|
|
- **Context Recall**: Measures how well the retrieved context covers the information in the ground truth. |
|
|
- **Context Precision**: Evaluates the proportion of relevant information in the retrieved context. |
|
|
- **Harmfulness**: Evaluates if the response contains harmful or inappropriate content. |
|
|
- **Coherence**: Measures the logical flow and clarity of the response. |
|
|
- **Context Relevancy**: Evaluates if the retrieved context is relevant to the question. |
|
|
|
|
|
This matrix provides a clear overview of which metrics to use for specific evaluation patterns and their respective definitions. |
|
|
|
|
|
## Best Practices for Ragas Evaluation |
|
|
|
|
|
1. **Start simple**: Begin with core metrics before adding more specialized ones |
|
|
2. **Use diverse test cases**: Include a variety of questions, from simple to complex |
|
|
3. **Consider edge cases**: Test with queries that might challenge your system |
|
|
4. **Compare versions**: Track metrics across different versions of your application |
|
|
5. **Combine with human evaluation**: Use Ragas alongside human feedback for a comprehensive assessment |
|
|
|
|
|
## Conclusion |
|
|
|
|
|
Setting up a basic evaluation workflow with Ragas is straightforward yet powerful. By systematically evaluating your LLM applications, you gain objective insights into their performance and clear directions for improvement. |
|
|
|
|
|
In our next post, we'll delve deeper into specialized evaluation techniques for RAG systems, exploring advanced metrics and evaluation strategies for retrieval-augmented generation applications. |
|
|
|
|
|
--- |
|
|
|
|
|
**[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)** |
|
|
**Part 2: Basic Evaluation Workflow — _You are here_** |
|
|
*Next up in the series:* |
|
|
**[Part 3: Evaluating RAG Systems](/blog/evaluating-rag-systems-with-ragas/)** |
|
|
**[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)** |
|
|
**[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)** |
|
|
**[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)** |
|
|
**[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)** |
|
|
**[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)** |
|
|
|
|
|
|
|
|
*Have you set up your first Ragas evaluation? What aspects of your LLM application are you most interested in measuring? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!* |