Spaces:

mafzaal
/

lets_talk

Runtime error

App Files Files Community

lets_talk / data /basic-evaluation-workflow-with-ragas /index.md

mafzaal

Add multiple blog posts on Ragas evaluation framework and metric-driven development

4ba7a19 10 months ago

preview code

raw

history blame contribute delete

11.3 kB

	---
	title: "Part 2: Basic Evaluation Workflow with Ragas"
	date: 2025-04-26T19:00:00-06:00
	layout: blog
	description: "Learn how to set up a basic evaluation workflow for LLM applications using Ragas. This guide walks you through data preparation, metric selection, and result analysis."
	categories: ["AI", "RAG", "Evaluation", "Ragas"]
	coverImage: "https://images.unsplash.com/photo-1600132806370-bf17e65e942f?q=80&w=1988&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
	readingTime: 8
	published: true
	---

	In our previous post, we introduced Ragas as a powerful framework for evaluating LLM applications. Now, let's dive into the practical aspects of setting up your first evaluation pipeline.

	## Understanding the Evaluation Workflow

	A typical Ragas evaluation workflow consists of four key steps:

	1. Prepare your data: Collect queries, contexts, responses, and reference answers
	2. Select appropriate metrics: Choose metrics that align with what you want to evaluate
	3. Run the evaluation: Process your data through the selected metrics
	4. Analyze the results: Interpret scores and identify areas for improvement

	Let's walk through each step with practical examples.

	## Step 1: Setting Up Your Environment

	First, ensure you have Ragas installed:

	```bash
	uv add ragas
	```

	Next, import the necessary components:

	```python
	import pandas as pd
	from ragas import EvaluationDataset
	from ragas import evaluate, RunConfig
	from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
	```

	## Step 2: Preparing Your Evaluation Data

	For a RAG system evaluation, you'll need:

	- Questions: User queries to your system
	- Contexts: Documents or chunks retrieved by your system
	- Responses: Answers generated by your system
	- Ground truth (optional): Reference answers or documents for comparison

	Here's how to organize this data:

	```python
	# Sample data
	data = {
	"user_input": [
	"What are the main symptoms of COVID-19?",
	"How does machine learning differ from deep learning?"
	],
	"retrieved_contexts": [
	[
	"Common symptoms of COVID-19 include fever, cough, and fatigue. Some patients also report loss of taste or smell, body aches, and difficulty breathing.",
	"COVID-19 is caused by the SARS-CoV-2 virus and spreads primarily through respiratory droplets."
	],
	[
	"Machine learning is a subset of AI focused on algorithms that learn from data without being explicitly programmed.",
	"Deep learning is a specialized form of machine learning using neural networks with many layers (deep neural networks)."
	]
	],
	"response": [
	"The main symptoms of COVID-19 include fever, cough, fatigue, and sometimes loss of taste or smell, body aches, and breathing difficulties.",
	"Machine learning is a subset of AI that focuses on algorithms learning from data, while deep learning is a specialized form of machine learning that uses deep neural networks with multiple layers."
	],
	"reference": [
	"COVID-19 symptoms commonly include fever, dry cough, fatigue, loss of taste or smell, body aches, sore throat, and in severe cases, difficulty breathing.",
	"Machine learning is a branch of AI where systems learn from data, identify patterns, and make decisions with minimal human intervention. Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data."
	]
	}

	eval_data = pd.DataFrame(data)

	# Convert to a format Ragas can use
	evaluation_dataset = EvaluationDataset.from_pandas(eval_data)
	evaluation_dataset

	```

	## Step 3: Selecting and Configuring Metrics

	Ragas offers various metrics to evaluate different aspects of your system:

	### Core RAG Metrics:

	- Faithfulness: Measures if the response is factually consistent with the provided context.
	- Factual Correctness: Assesses if the response is accurate and free from factual errors.
	- Response Relevancy: Evaluates if the response directly addresses the user query.
	- Context Entity Recall: Measures how well the retrieved context captures relevant entities from the ground truth.
	- Noise Sensitivity: Assesses the robustness of the response to irrelevant or noisy context.
	- LLM Context Recall: Evaluates how effectively the LLM utilizes the provided context to generate the response.

	For metrics that require an LLM (like faithfulness), you need to configure the LLM provider:

	```python
	# Configure LLM for evaluation
	from langchain_openai import ChatOpenAI
	from ragas.llms import LangchainLLMWrapper

	# Initialize the LLM, you are going to OPENAI API key
	evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

	# Define metrics to use
	metrics = [
	Faithfulness(),
	FactualCorrectness(),
	ResponseRelevancy(),
	ContextEntityRecall(),
	NoiseSensitivity(),
	LLMContextRecall()
	]
	```

	## Step 4: Running the Evaluation

	Now, run the evaluation with your selected metrics:

	```python
	# Run evaluation
	results = evaluate(
	evaluation_dataset,
	metrics=metrics,
	llm=evaluator_llm # Required for LLM-based metrics
	)

	# View results
	print(results)
	```
	### Output:

	Values will vary based on your data and LLM performance.

	```python
	{
	"faithfulness": 1.0000,
	"factual_correctness": 0.6750,
	"answer_relevancy": 0.9897,
	"context_entity_recall": 0.8889,
	"noise_sensitivity_relevant": 0.1667,
	"context_recall": 0.5000
	}
	```


	## Step 5: Interpreting Results

	Ragas metrics typically return scores between 0 and 1, where higher is better:

	### Understanding Score Ranges:

	- 0.8-1.0: Excellent performance
	- 0.6-0.8: Good performance
	- 0.4-0.6: Moderate performance, needs improvement
	- 0.4 or lower: Poor performance, requires significant attention

	## Advanced Use: Custom Evaluation for Specific Examples

	For more detailed analysis of specific examples:

	```python
	from ragas import SingleTurnSample
	from ragas.metrics import AspectCritic

	# Define a specific test case
	test_data = {
	"user_input": "What are quantum computers?",
	"response": "Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits that can only be 0 or 1.",
	"retrieved_contexts": ["Quantum computing is a type of computation that harnesses quantum mechanical phenomena."]
	}

	# Create a custom evaluation metric
	custom_metric = AspectCritic(
	name="quantum_accuracy",
	llm=llm,
	definition="Verify if the explanation of quantum computing is accurate and complete."
	)

	# Score the sample
	sample = SingleTurnSample(**test_data)
	score = await custom_metric.single_turn_ascore(sample)
	print(f"Quantum accuracy score: {score}")
	```
	> 💡 Try it yourself:
	> Explore the hands-on notebook for this workflow:
	> [02_Basic_Evaluation_Workflow_with_Ragas](https://github.com/mafzaal/intro-to-ragas/blob/master/02_Basic_Evaluation_Workflow_with_Ragas.ipynb)

	## Common Evaluation Patterns and Metrics

	Below is a matrix mapping evaluation patterns to the metrics used, along with definitions for each metric:

	\| Metric \| Comprehensive RAG Evaluation \| Content Quality Evaluation \| Retrieval Quality Evaluation \|
	\|-----------------------------\|----------------------------------\|---------------------------------\|-----------------------------------\|
	\| Faithfulness \| ✓ \| ✓ \| \|
	\| Answer Relevancy \| ✓ \| ✓ \| \|
	\| Context Recall \| ✓ \| \| ✓ \|
	\| Context Precision \| ✓ \| \| ✓ \|
	\| Harmfulness \| \| ✓ \| \|
	\| Coherence \| \| ✓ \| \|
	\| Context Relevancy \| \| \| ✓ \|

	### Metric Definitions

	- Faithfulness: Measures if the response is factually consistent with the provided context.
	- Answer Relevancy: Assesses if the response addresses the question.
	- Context Recall: Measures how well the retrieved context covers the information in the ground truth.
	- Context Precision: Evaluates the proportion of relevant information in the retrieved context.
	- Harmfulness: Evaluates if the response contains harmful or inappropriate content.
	- Coherence: Measures the logical flow and clarity of the response.
	- Context Relevancy: Evaluates if the retrieved context is relevant to the question.

	This matrix provides a clear overview of which metrics to use for specific evaluation patterns and their respective definitions.

	## Best Practices for Ragas Evaluation

	1. Start simple: Begin with core metrics before adding more specialized ones
	2. Use diverse test cases: Include a variety of questions, from simple to complex
	3. Consider edge cases: Test with queries that might challenge your system
	4. Compare versions: Track metrics across different versions of your application
	5. Combine with human evaluation: Use Ragas alongside human feedback for a comprehensive assessment

	## Conclusion

	Setting up a basic evaluation workflow with Ragas is straightforward yet powerful. By systematically evaluating your LLM applications, you gain objective insights into their performance and clear directions for improvement.

	In our next post, we'll delve deeper into specialized evaluation techniques for RAG systems, exploring advanced metrics and evaluation strategies for retrieval-augmented generation applications.

	---

	[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)
	Part 2: Basic Evaluation Workflow — _You are here_
	Next up in the series:
	[Part 3: Evaluating RAG Systems](/blog/evaluating-rag-systems-with-ragas/)
	[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)
	[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)
	[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)
	[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)
	[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)


	Have you set up your first Ragas evaluation? What aspects of your LLM application are you most interested in measuring? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)—we’d love to help!