# RAG Capstone Project - Evaluation System Guide ## Overview The RAG Capstone Project uses the **TRACe evaluation framework** (from the RAGBench paper: arXiv:2407.11005) to assess the quality of Retrieval-Augmented Generation (RAG) responses. TRACe is a 4-metric framework that evaluates both the retriever and generator components: - **T** — **u**T**ilization** (Context Utilization): How much of the retrieved context the generator actually uses to produce the response - **R** — **R**elevance (Context Relevance): How much of the retrieved context is relevant to the query - **A** — **A**dherence** (Faithfulness/Groundedness/Attribution): Whether the response is grounded in and supported by the provided context (no hallucinations) - **C** — **C**ompleteness: How much of the relevant information in the context is actually covered by the response These 4 metrics provide comprehensive evaluation of RAG system quality, examining retriever performance (Relevance), generator quality (Adherence, Completeness), and effective resource utilization (Utilization). --- ## Evaluation Architecture ### 1. **High-Level Flow** ``` User selects dataset + samples ↓ Load test data from dataset ↓ For each test sample: ├─ Query the RAG system with question ├─ Get response + retrieved documents └─ Store as test case ↓ Run TRACE metrics on all test cases ↓ Aggregate results + Display metrics ``` --- ## 2. **TRACe Metrics Explained (Per RAGBench Paper)** ### **T — uTilization (Context Utilization)** **What it measures:** The fraction of the retrieved context that the generator actually uses to produce the response. Identifies if the LLM effectively leverages the provided documents. **Paper Definition:** $$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$ Where: - $U_i$ = set of utilized (used) spans/tokens in document $d_i$ - $d_i$ = the full document $i$ - $\text{Len}()$ = length of the span (sentence, token, or character level) **Interpretation:** - **Low Utilization + Low Relevance** → Greedy retriever returning irrelevant docs - **Low Utilization alone** → Weak generator fails to leverage good context - **High Utilization** → Generator efficiently uses provided context --- ### **R — Relevance (Context Relevance)** **What it measures:** The fraction of the retrieved context that is actually relevant to answering the query. Evaluates retriever quality—does it return useful documents? **Paper Definition:** $$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$ Where: - $R_i$ = set of relevant (useful) spans/tokens in document $d_i$ - $d_i$ = the full document $i$ **Interpretation:** - **High Relevance** → Retriever returned mostly relevant documents - **Low Relevance** → Retriever returned many irrelevant/noisy documents - **High Relevance but Low Utilization** → Good docs retrieved, but generator doesn't use them --- ### **A — Adherence (Faithfulness / Groundedness / Attribution)** **What it measures:** Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations—claims made without evidence in the documents. **Paper Definition:** Example-level: **Boolean** — True if all response sentences are supported by the context; False if any part of the response is unsupported/hallucinated Span/Sentence-level: Can also annotate which specific response sentences or spans are grounded. **Interpretation:** - **High Adherence (1.0)** → Response fully grounded, no hallucinations ✅ - **Low Adherence (0.0)** → Response contains unsupported claims ❌ - **Mid Adherence** → Partially grounded response (some claims supported, others not) --- ### **C — Completeness** **What it measures:** How much of the relevant information in the context is actually covered/incorporated by the response. Identifies missing information. **Paper Definition:** $$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$ Where: - $R_i \cap U_i$ = intersection of relevant AND utilized spans (info that is both relevant and used) - $R_i$ = all relevant spans - Extended to example-level by aggregating across all documents **Interpretation:** - **High Completeness** → Generator covers all relevant information from context - **Low Completeness + High Utilization** → Generator uses context but misses key relevant facts - **High Relevance + High Utilization + High Completeness** → Ideal RAG system ✅ --- ## 3. **Evaluation Workflow in the Application** ### **Step 1: Configuration (Sidebar)** ``` User inputs: ├─ Groq API Key ├─ Selects dataset (e.g., "wiki_qa", "hotpot_qa", etc.) ├─ Selects LLM for evaluation (can differ from chat LLM) └─ Clicks "Load Data & Create Collection" ``` ### **Step 2: Test Data Loading** ```python # In streamlit_app.py - run_evaluation() loader = RAGBenchLoader() test_data = loader.get_test_data( dataset_name="wiki_qa", # Selected dataset num_samples=10 # Number to evaluate ) # Returns: [{"question": "...", "answer": "..."}, ...] ``` **Available Datasets:** - wiki_qa - hotpot_qa - nq_open - And 9 more from RAGBench ### **Step 3: Test Case Preparation** ```python # For each test sample: for sample in test_data: # Query RAG system result = rag_pipeline.query( sample["question"], n_results=5 # Retrieve top 5 documents ) # Create test case test_case = { "query": sample["question"], "response": result["response"], "retrieved_documents": [doc["document"] for doc in result["retrieved_documents"]], "ground_truth": sample.get("answer", "") } ``` **What happens in `rag_pipeline.query()`:** 1. **Retrieval Phase:** ```python retrieved_docs = vector_store.get_retrieved_documents(query, n_results=5) # Returns: Top 5 most relevant documents from ChromaDB ``` 2. **Generation Phase:** ```python response = llm.generate_with_context(query, doc_texts, max_tokens=1024) # Uses Groq LLM with context to generate response ``` 3. **Result:** ```python { "query": "What is X?", "response": "Generated answer based on docs...", "retrieved_documents": [ { "document": "doc content", "distance": 0.123, "metadata": {...} }, ... ] } ``` ### **Step 4: TRACE Evaluation** ```python # In trace_evaluator.py evaluator = TRACEEvaluator() results = evaluator.evaluate_batch(test_cases) # For each test case: for test_case in test_cases: scores = evaluator.evaluate( query=test_case["query"], response=test_case["response"], retrieved_documents=test_case["retrieved_documents"], ground_truth=test_case["ground_truth"] ) # Returns TRACEScores with 4 metrics ``` ### **Step 5: Aggregation** ```python # Average scores across all test cases { "utilization": 0.75, # Average utilization across samples "relevance": 0.82, # Average relevance across samples "adherence": 0.79, # Average adherence across samples "completeness": 0.88, # Average completeness across samples "average": 0.81, # Overall TRACE score "num_samples": 10, # Number of samples evaluated "individual_scores": [ # Per-sample scores { "utilization": 0.70, "relevance": 0.85, "adherence": 0.75, "completeness": 0.90, "average": 0.80 }, ... ] } ``` --- ## 4. **Results Display** ### **In Streamlit UI:** ``` 📊 Evaluation Results: ┌────────────────────────────────────────────┐ │ 📊 Utilization: 0.751 │ │ 🎯 Relevance: 0.823 │ │ ✅ Adherence: 0.789 │ │ 📝 Completeness: 0.881 │ │ ⭐ Average: 0.811 │ └────────────────────────────────────────────┘ 📋 Detailed Results: [Expandable table with individual scores] 💾 Download Results (JSON) [Export button for results] ``` --- ## 5. **Logging During Evaluation** The application provides real-time logging: ``` 📋 Evaluation Logs: ⏱️ Evaluation started at 2025-12-18 10:30:45 📊 Dataset: wiki_qa 📈 Total samples: 10 🤖 LLM Model: llama-3.1-8b 🔗 Vector Store: wiki_qa_dense_all_mpnet 🧠 Embedding Model: all-mpnet-base-v2 📥 Loading test data... ✅ Loaded 10 test samples 🔍 Processing samples... ✓ Processed 10/10 samples 📊 Running TRACE evaluation metrics... ✅ Evaluation completed successfully! • Utilization: 75.10% • Relevance: 82.34% • Adherence: 78.91% • Completeness: 88.12% ⏱️ Evaluation completed at 2025-12-18 10:31:30 ``` --- ## 6. **Key Components** ### **trace_evaluator.py** **Main Classes:** - `TRACEScores`: Dataclass holding 4 metric scores - `TRACEEvaluator`: Main evaluator class **Key Methods:** ```python evaluate() # Evaluate single test case evaluate_batch() # Evaluate multiple test cases _compute_utilization() # Metric: utilization _compute_relevance() # Metric: relevance _compute_adherence() # Metric: adherence _compute_completeness() # Metric: completeness ``` ### **dataset_loader.py** **Key Methods:** ```python get_test_data(dataset_name, num_samples) # Load test samples get_test_data_size(dataset_name) # Get max available samples ``` ### **llm_client.py - RAGPipeline** **Key Method:** ```python query(query_str, n_results=5) # Query RAG system # Returns: {"query", "response", "retrieved_documents"} ``` --- ## 7. **Performance Considerations** ### **Time Complexity** - Loading 10 samples: ~5-10 seconds - Processing per sample: ~2-5 seconds (LLM generation) - TRACE evaluation per sample: ~100-500ms - **Total for 10 samples: ~3-7 minutes** (depending on LLM) ### **Optimization Tips** 1. Start with smaller sample sizes (5-10) for testing 2. Use faster LLM models for initial evaluation 3. Results are cached in session state 4. Can download and reuse evaluation results --- ## 8. **Interpreting Scores** ### **Score Ranges:** | Range | Interpretation | |-------|-----------------| | 0.80-1.00 | Excellent ✅ | | 0.60-0.79 | Good 👍 | | 0.40-0.59 | Fair ⚠️ | | 0.00-0.39 | Poor ❌ | ### **What Each Metric Tells You:** | Metric | Indicates | Action if Low | |--------|-----------|---------------| | Utilization | Are docs used? | Add more relevant docs, improve retrieval | | Relevance | Are retrieved docs relevant? | Improve embedding model or retrieval strategy | | Adherence | Is response grounded? | Add guardrails to prevent hallucination | | Completeness | Is response complete? | Increase response length or improve generation | --- ## 9. **Example Evaluation Scenario** ### **Scenario: Evaluating "wiki_qa" Dataset** ``` 1. User Action: - Selects "wiki_qa" dataset - Selects "llama-3.1-8b" LLM - Sets 10 test samples - Clicks "Run Evaluation" 2. System Processing: - Loads 10 test questions from wiki_qa - For each question: a) Retrieves top 5 relevant Wikipedia articles b) Generates answer using LLM + context - Runs TRACE metrics on all 10 Q&A pairs 3. Results: Sample 1: "Who is Albert Einstein?" - Retrieved: Einstein biography article - Generated: "Albert Einstein was a theoretical physicist..." - Utilization: 0.85 ✅ (uses doc content) - Relevance: 0.92 ✅ (doc is about Einstein) - Adherence: 0.88 ✅ (response stays in doc) - Completeness: 0.90 ✅ (answers completely) - Average: 0.89 Sample 2: "What did Einstein discover?" - Retrieved: Articles on relativity, quantum theory - Generated: "Einstein discovered the theory of relativity..." - Utilization: 0.78 ✅ - Relevance: 0.85 ✅ - Adherence: 0.82 ✅ - Completeness: 0.85 ✅ - Average: 0.82 [Samples 3-10 evaluated similarly] 4. Final Results: - Average Utilization: 0.82 - Average Relevance: 0.88 - Average Adherence: 0.85 - Average Completeness: 0.87 - Overall TRACE Score: 0.855 (Excellent! ✅) ``` --- ## 10. **Troubleshooting** ### **Common Issues:** 1. **Error: "No attribute dataset_name"** - Solution: Load a collection first (sidebar config) 2. **Evaluation very slow** - Solution: Reduce sample size or use faster LLM 3. **All scores near 0.5** - Solution: Check if retrieval is working properly 4. **High variance in scores** - Solution: Normal for diverse datasets; try more samples --- ## 11. **Advanced Usage** ### **Comparing Different Configurations** You can evaluate the same dataset with different: - Embedding models - Chunking strategies - LLM models Then compare results to find optimal configuration. ### **Exporting Results** ```json { "utilization": 0.82, "relevance": 0.88, "adherence": 0.85, "completeness": 0.87, "average": 0.855, "num_samples": 10, "individual_scores": [...] } ``` Save and track over time to measure improvements! --- ## Summary The evaluation system provides a comprehensive framework for assessing RAG application quality across 4 key dimensions. By understanding TRACE metrics, you can identify bottlenecks and optimize your RAG system for better performance. **Key Takeaway:** TRACE evaluation helps you objectively measure and improve your RAG system! 🎯