--- title: "Part 2: Basic Evaluation Workflow with Ragas" date: 2025-04-26T19:00:00-06:00 layout: blog description: "Learn how to set up a basic evaluation workflow for LLM applications using Ragas. This guide walks you through data preparation, metric selection, and result analysis." categories: ["AI", "RAG", "Evaluation", "Ragas"] coverImage: "https://images.unsplash.com/photo-1600132806370-bf17e65e942f?q=80&w=1988&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" readingTime: 8 published: true --- In our previous post, we introduced Ragas as a powerful framework for evaluating LLM applications. Now, let's dive into the practical aspects of setting up your first evaluation pipeline. ## Understanding the Evaluation Workflow A typical Ragas evaluation workflow consists of four key steps: 1. **Prepare your data**: Collect queries, contexts, responses, and reference answers 2. **Select appropriate metrics**: Choose metrics that align with what you want to evaluate 3. **Run the evaluation**: Process your data through the selected metrics 4. **Analyze the results**: Interpret scores and identify areas for improvement Let's walk through each step with practical examples. ## Step 1: Setting Up Your Environment First, ensure you have Ragas installed: ```bash uv add ragas ``` Next, import the necessary components: ```python import pandas as pd from ragas import EvaluationDataset from ragas import evaluate, RunConfig from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity ``` ## Step 2: Preparing Your Evaluation Data For a RAG system evaluation, you'll need: - **Questions**: User queries to your system - **Contexts**: Documents or chunks retrieved by your system - **Responses**: Answers generated by your system - **Ground truth** (optional): Reference answers or documents for comparison Here's how to organize this data: ```python # Sample data data = { "user_input": [ "What are the main symptoms of COVID-19?", "How does machine learning differ from deep learning?" ], "retrieved_contexts": [ [ "Common symptoms of COVID-19 include fever, cough, and fatigue. Some patients also report loss of taste or smell, body aches, and difficulty breathing.", "COVID-19 is caused by the SARS-CoV-2 virus and spreads primarily through respiratory droplets." ], [ "Machine learning is a subset of AI focused on algorithms that learn from data without being explicitly programmed.", "Deep learning is a specialized form of machine learning using neural networks with many layers (deep neural networks)." ] ], "response": [ "The main symptoms of COVID-19 include fever, cough, fatigue, and sometimes loss of taste or smell, body aches, and breathing difficulties.", "Machine learning is a subset of AI that focuses on algorithms learning from data, while deep learning is a specialized form of machine learning that uses deep neural networks with multiple layers." ], "reference": [ "COVID-19 symptoms commonly include fever, dry cough, fatigue, loss of taste or smell, body aches, sore throat, and in severe cases, difficulty breathing.", "Machine learning is a branch of AI where systems learn from data, identify patterns, and make decisions with minimal human intervention. Deep learning is a subset of machine learning that uses neural networks with multiple layers (deep neural networks) to analyze various factors of data." ] } eval_data = pd.DataFrame(data) # Convert to a format Ragas can use evaluation_dataset = EvaluationDataset.from_pandas(eval_data) evaluation_dataset ``` ## Step 3: Selecting and Configuring Metrics Ragas offers various metrics to evaluate different aspects of your system: ### Core RAG Metrics: - **Faithfulness**: Measures if the response is factually consistent with the provided context. - **Factual Correctness**: Assesses if the response is accurate and free from factual errors. - **Response Relevancy**: Evaluates if the response directly addresses the user query. - **Context Entity Recall**: Measures how well the retrieved context captures relevant entities from the ground truth. - **Noise Sensitivity**: Assesses the robustness of the response to irrelevant or noisy context. - **LLM Context Recall**: Evaluates how effectively the LLM utilizes the provided context to generate the response. For metrics that require an LLM (like faithfulness), you need to configure the LLM provider: ```python # Configure LLM for evaluation from langchain_openai import ChatOpenAI from ragas.llms import LangchainLLMWrapper # Initialize the LLM, you are going to OPENAI API key evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o")) # Define metrics to use metrics = [ Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity(), LLMContextRecall() ] ``` ## Step 4: Running the Evaluation Now, run the evaluation with your selected metrics: ```python # Run evaluation results = evaluate( evaluation_dataset, metrics=metrics, llm=evaluator_llm # Required for LLM-based metrics ) # View results print(results) ``` ### Output: *Values will vary based on your data and LLM performance.* ```python { "faithfulness": 1.0000, "factual_correctness": 0.6750, "answer_relevancy": 0.9897, "context_entity_recall": 0.8889, "noise_sensitivity_relevant": 0.1667, "context_recall": 0.5000 } ``` ## Step 5: Interpreting Results Ragas metrics typically return scores between 0 and 1, where higher is better: ### Understanding Score Ranges: - **0.8-1.0**: Excellent performance - **0.6-0.8**: Good performance - **0.4-0.6**: Moderate performance, needs improvement - **0.4 or lower**: Poor performance, requires significant attention ## Advanced Use: Custom Evaluation for Specific Examples For more detailed analysis of specific examples: ```python from ragas import SingleTurnSample from ragas.metrics import AspectCritic # Define a specific test case test_data = { "user_input": "What are quantum computers?", "response": "Quantum computers use quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits that can only be 0 or 1.", "retrieved_contexts": ["Quantum computing is a type of computation that harnesses quantum mechanical phenomena."] } # Create a custom evaluation metric custom_metric = AspectCritic( name="quantum_accuracy", llm=llm, definition="Verify if the explanation of quantum computing is accurate and complete." ) # Score the sample sample = SingleTurnSample(**test_data) score = await custom_metric.single_turn_ascore(sample) print(f"Quantum accuracy score: {score}") ``` > πŸ’‘ **Try it yourself:** > Explore the hands-on notebook for this workflow: > [02_Basic_Evaluation_Workflow_with_Ragas](https://github.com/mafzaal/intro-to-ragas/blob/master/02_Basic_Evaluation_Workflow_with_Ragas.ipynb) ## Common Evaluation Patterns and Metrics Below is a matrix mapping evaluation patterns to the metrics used, along with definitions for each metric: | **Metric** | **Comprehensive RAG Evaluation** | **Content Quality Evaluation** | **Retrieval Quality Evaluation** | |-----------------------------|----------------------------------|---------------------------------|-----------------------------------| | **Faithfulness** | βœ“ | βœ“ | | | **Answer Relevancy** | βœ“ | βœ“ | | | **Context Recall** | βœ“ | | βœ“ | | **Context Precision** | βœ“ | | βœ“ | | **Harmfulness** | | βœ“ | | | **Coherence** | | βœ“ | | | **Context Relevancy** | | | βœ“ | ### Metric Definitions - **Faithfulness**: Measures if the response is factually consistent with the provided context. - **Answer Relevancy**: Assesses if the response addresses the question. - **Context Recall**: Measures how well the retrieved context covers the information in the ground truth. - **Context Precision**: Evaluates the proportion of relevant information in the retrieved context. - **Harmfulness**: Evaluates if the response contains harmful or inappropriate content. - **Coherence**: Measures the logical flow and clarity of the response. - **Context Relevancy**: Evaluates if the retrieved context is relevant to the question. This matrix provides a clear overview of which metrics to use for specific evaluation patterns and their respective definitions. ## Best Practices for Ragas Evaluation 1. **Start simple**: Begin with core metrics before adding more specialized ones 2. **Use diverse test cases**: Include a variety of questions, from simple to complex 3. **Consider edge cases**: Test with queries that might challenge your system 4. **Compare versions**: Track metrics across different versions of your application 5. **Combine with human evaluation**: Use Ragas alongside human feedback for a comprehensive assessment ## Conclusion Setting up a basic evaluation workflow with Ragas is straightforward yet powerful. By systematically evaluating your LLM applications, you gain objective insights into their performance and clear directions for improvement. In our next post, we'll delve deeper into specialized evaluation techniques for RAG systems, exploring advanced metrics and evaluation strategies for retrieval-augmented generation applications. --- **[Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications](/blog/introduction-to-ragas/)** **Part 2: Basic Evaluation Workflow β€” _You are here_** *Next up in the series:* **[Part 3: Evaluating RAG Systems](/blog/evaluating-rag-systems-with-ragas/)** **[Part 4: Test Data Generation](/blog/generating-test-data-with-ragas/)** **[Part 5: Advanced Evaluation Techniques](/blog/advanced-metrics-and-customization-with-ragas)** **[Part 6: Evaluating AI Agents](/blog/evaluating-ai-agents-with-ragas/)** **[Part 7: Integrations and Observability](/blog/integrations-and-observability-with-ragas/)** **[Part 8: Building Feedback Loops](/blog/building-feedback-loops-with-ragas/)** *Have you set up your first Ragas evaluation? What aspects of your LLM application are you most interested in measuring? If you’re facing specific evaluation hurdles, don’t hesitate to [reach out](https://www.linkedin.com/in/muhammadafzaal/)β€”we’d love to help!*