Spaces:
Running
Running
| # RAG Capstone Project - Evaluation System Guide | |
| ## Overview | |
| The RAG Capstone Project uses the **TRACe evaluation framework** (from the RAGBench paper: arXiv:2407.11005) to assess the quality of Retrieval-Augmented Generation (RAG) responses. TRACe is a 4-metric framework that evaluates both the retriever and generator components: | |
| - **T** β **u**T**ilization** (Context Utilization): How much of the retrieved context the generator actually uses to produce the response | |
| - **R** β **R**elevance (Context Relevance): How much of the retrieved context is relevant to the query | |
| - **A** β **A**dherence** (Faithfulness/Groundedness/Attribution): Whether the response is grounded in and supported by the provided context (no hallucinations) | |
| - **C** β **C**ompleteness: How much of the relevant information in the context is actually covered by the response | |
| These 4 metrics provide comprehensive evaluation of RAG system quality, examining retriever performance (Relevance), generator quality (Adherence, Completeness), and effective resource utilization (Utilization). | |
| --- | |
| ## Evaluation Architecture | |
| ### 1. **High-Level Flow** | |
| ``` | |
| User selects dataset + samples | |
| β | |
| Load test data from dataset | |
| β | |
| For each test sample: | |
| ββ Query the RAG system with question | |
| ββ Get response + retrieved documents | |
| ββ Store as test case | |
| β | |
| Run TRACE metrics on all test | |
| cases | |
| β | |
| Aggregate results + Display metrics | |
| ``` | |
| --- | |
| ## 2. **TRACe Metrics Explained (Per RAGBench Paper)** | |
| ### **T β uTilization (Context Utilization)** | |
| **What it measures:** | |
| The fraction of the retrieved context that the generator actually uses to produce the response. Identifies if the LLM effectively leverages the provided documents. | |
| **Paper Definition:** | |
| $$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$ | |
| Where: | |
| - $U_i$ = set of utilized (used) spans/tokens in document $d_i$ | |
| - $d_i$ = the full document $i$ | |
| - $\text{Len}()$ = length of the span (sentence, token, or character level) | |
| **Interpretation:** | |
| - **Low Utilization + Low Relevance** β Greedy retriever returning irrelevant docs | |
| - **Low Utilization alone** β Weak generator fails to leverage good context | |
| - **High Utilization** β Generator efficiently uses provided context | |
| --- | |
| ### **R β Relevance (Context Relevance)** | |
| **What it measures:** | |
| The fraction of the retrieved context that is actually relevant to answering the query. Evaluates retriever qualityβdoes it return useful documents? | |
| **Paper Definition:** | |
| $$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$ | |
| Where: | |
| - $R_i$ = set of relevant (useful) spans/tokens in document $d_i$ | |
| - $d_i$ = the full document $i$ | |
| **Interpretation:** | |
| - **High Relevance** β Retriever returned mostly relevant documents | |
| - **Low Relevance** β Retriever returned many irrelevant/noisy documents | |
| - **High Relevance but Low Utilization** β Good docs retrieved, but generator doesn't use them | |
| --- | |
| ### **A β Adherence (Faithfulness / Groundedness / Attribution)** | |
| **What it measures:** | |
| Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinationsβclaims made without evidence in the documents. | |
| **Paper Definition:** | |
| Example-level: **Boolean** β True if all response sentences are supported by the context; False if any part of the response is unsupported/hallucinated | |
| Span/Sentence-level: Can also annotate which specific response sentences or spans are grounded. | |
| **Interpretation:** | |
| - **High Adherence (1.0)** β Response fully grounded, no hallucinations β | |
| - **Low Adherence (0.0)** β Response contains unsupported claims β | |
| - **Mid Adherence** β Partially grounded response (some claims supported, others not) | |
| --- | |
| ### **C β Completeness** | |
| **What it measures:** | |
| How much of the relevant information in the context is actually covered/incorporated by the response. Identifies missing information. | |
| **Paper Definition:** | |
| $$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$ | |
| Where: | |
| - $R_i \cap U_i$ = intersection of relevant AND utilized spans (info that is both relevant and used) | |
| - $R_i$ = all relevant spans | |
| - Extended to example-level by aggregating across all documents | |
| **Interpretation:** | |
| - **High Completeness** β Generator covers all relevant information from context | |
| - **Low Completeness + High Utilization** β Generator uses context but misses key relevant facts | |
| - **High Relevance + High Utilization + High Completeness** β Ideal RAG system β | |
| --- | |
| ## 3. **Evaluation Workflow in the Application** | |
| ### **Step 1: Configuration (Sidebar)** | |
| ``` | |
| User inputs: | |
| ββ Groq API Key | |
| ββ Selects dataset (e.g., "wiki_qa", "hotpot_qa", etc.) | |
| ββ Selects LLM for evaluation (can differ from chat LLM) | |
| ββ Clicks "Load Data & Create Collection" | |
| ``` | |
| ### **Step 2: Test Data Loading** | |
| ```python | |
| # In streamlit_app.py - run_evaluation() | |
| loader = RAGBenchLoader() | |
| test_data = loader.get_test_data( | |
| dataset_name="wiki_qa", # Selected dataset | |
| num_samples=10 # Number to evaluate | |
| ) | |
| # Returns: [{"question": "...", "answer": "..."}, ...] | |
| ``` | |
| **Available Datasets:** | |
| - wiki_qa | |
| - hotpot_qa | |
| - nq_open | |
| - And 9 more from RAGBench | |
| ### **Step 3: Test Case Preparation** | |
| ```python | |
| # For each test sample: | |
| for sample in test_data: | |
| # Query RAG system | |
| result = rag_pipeline.query( | |
| sample["question"], | |
| n_results=5 # Retrieve top 5 documents | |
| ) | |
| # Create test case | |
| test_case = { | |
| "query": sample["question"], | |
| "response": result["response"], | |
| "retrieved_documents": [doc["document"] for doc in result["retrieved_documents"]], | |
| "ground_truth": sample.get("answer", "") | |
| } | |
| ``` | |
| **What happens in `rag_pipeline.query()`:** | |
| 1. **Retrieval Phase:** | |
| ```python | |
| retrieved_docs = vector_store.get_retrieved_documents(query, n_results=5) | |
| # Returns: Top 5 most relevant documents from ChromaDB | |
| ``` | |
| 2. **Generation Phase:** | |
| ```python | |
| response = llm.generate_with_context(query, doc_texts, max_tokens=1024) | |
| # Uses Groq LLM with context to generate response | |
| ``` | |
| 3. **Result:** | |
| ```python | |
| { | |
| "query": "What is X?", | |
| "response": "Generated answer based on docs...", | |
| "retrieved_documents": [ | |
| { | |
| "document": "doc content", | |
| "distance": 0.123, | |
| "metadata": {...} | |
| }, | |
| ... | |
| ] | |
| } | |
| ``` | |
| ### **Step 4: TRACE Evaluation** | |
| ```python | |
| # In trace_evaluator.py | |
| evaluator = TRACEEvaluator() | |
| results = evaluator.evaluate_batch(test_cases) | |
| # For each test case: | |
| for test_case in test_cases: | |
| scores = evaluator.evaluate( | |
| query=test_case["query"], | |
| response=test_case["response"], | |
| retrieved_documents=test_case["retrieved_documents"], | |
| ground_truth=test_case["ground_truth"] | |
| ) | |
| # Returns TRACEScores with 4 metrics | |
| ``` | |
| ### **Step 5: Aggregation** | |
| ```python | |
| # Average scores across all test cases | |
| { | |
| "utilization": 0.75, # Average utilization across samples | |
| "relevance": 0.82, # Average relevance across samples | |
| "adherence": 0.79, # Average adherence across samples | |
| "completeness": 0.88, # Average completeness across samples | |
| "average": 0.81, # Overall TRACE score | |
| "num_samples": 10, # Number of samples evaluated | |
| "individual_scores": [ # Per-sample scores | |
| { | |
| "utilization": 0.70, | |
| "relevance": 0.85, | |
| "adherence": 0.75, | |
| "completeness": 0.90, | |
| "average": 0.80 | |
| }, | |
| ... | |
| ] | |
| } | |
| ``` | |
| --- | |
| ## 4. **Results Display** | |
| ### **In Streamlit UI:** | |
| ``` | |
| π Evaluation Results: | |
| ββββββββββββββββββββββββββββββββββββββββββββββ | |
| β π Utilization: 0.751 β | |
| β π― Relevance: 0.823 β | |
| β β Adherence: 0.789 β | |
| β π Completeness: 0.881 β | |
| β β Average: 0.811 β | |
| ββββββββββββββββββββββββββββββββββββββββββββββ | |
| π Detailed Results: | |
| [Expandable table with individual scores] | |
| πΎ Download Results (JSON) | |
| [Export button for results] | |
| ``` | |
| --- | |
| ## 5. **Logging During Evaluation** | |
| The application provides real-time logging: | |
| ``` | |
| π Evaluation Logs: | |
| β±οΈ Evaluation started at 2025-12-18 10:30:45 | |
| π Dataset: wiki_qa | |
| π Total samples: 10 | |
| π€ LLM Model: llama-3.1-8b | |
| π Vector Store: wiki_qa_dense_all_mpnet | |
| π§ Embedding Model: all-mpnet-base-v2 | |
| π₯ Loading test data... | |
| β Loaded 10 test samples | |
| π Processing samples... | |
| β Processed 10/10 samples | |
| π Running TRACE evaluation metrics... | |
| β Evaluation completed successfully! | |
| β’ Utilization: 75.10% | |
| β’ Relevance: 82.34% | |
| β’ Adherence: 78.91% | |
| β’ Completeness: 88.12% | |
| β±οΈ Evaluation completed at 2025-12-18 10:31:30 | |
| ``` | |
| --- | |
| ## 6. **Key Components** | |
| ### **trace_evaluator.py** | |
| **Main Classes:** | |
| - `TRACEScores`: Dataclass holding 4 metric scores | |
| - `TRACEEvaluator`: Main evaluator class | |
| **Key Methods:** | |
| ```python | |
| evaluate() # Evaluate single test case | |
| evaluate_batch() # Evaluate multiple test cases | |
| _compute_utilization() # Metric: utilization | |
| _compute_relevance() # Metric: relevance | |
| _compute_adherence() # Metric: adherence | |
| _compute_completeness() # Metric: completeness | |
| ``` | |
| ### **dataset_loader.py** | |
| **Key Methods:** | |
| ```python | |
| get_test_data(dataset_name, num_samples) # Load test samples | |
| get_test_data_size(dataset_name) # Get max available samples | |
| ``` | |
| ### **llm_client.py - RAGPipeline** | |
| **Key Method:** | |
| ```python | |
| query(query_str, n_results=5) # Query RAG system | |
| # Returns: {"query", "response", "retrieved_documents"} | |
| ``` | |
| --- | |
| ## 7. **Performance Considerations** | |
| ### **Time Complexity** | |
| - Loading 10 samples: ~5-10 seconds | |
| - Processing per sample: ~2-5 seconds (LLM generation) | |
| - TRACE evaluation per sample: ~100-500ms | |
| - **Total for 10 samples: ~3-7 minutes** (depending on LLM) | |
| ### **Optimization Tips** | |
| 1. Start with smaller sample sizes (5-10) for testing | |
| 2. Use faster LLM models for initial evaluation | |
| 3. Results are cached in session state | |
| 4. Can download and reuse evaluation results | |
| --- | |
| ## 8. **Interpreting Scores** | |
| ### **Score Ranges:** | |
| | Range | Interpretation | | |
| |-------|-----------------| | |
| | 0.80-1.00 | Excellent β | | |
| | 0.60-0.79 | Good π | | |
| | 0.40-0.59 | Fair β οΈ | | |
| | 0.00-0.39 | Poor β | | |
| ### **What Each Metric Tells You:** | |
| | Metric | Indicates | Action if Low | | |
| |--------|-----------|---------------| | |
| | Utilization | Are docs used? | Add more relevant docs, improve retrieval | | |
| | Relevance | Are retrieved docs relevant? | Improve embedding model or retrieval strategy | | |
| | Adherence | Is response grounded? | Add guardrails to prevent hallucination | | |
| | Completeness | Is response complete? | Increase response length or improve generation | | |
| --- | |
| ## 9. **Example Evaluation Scenario** | |
| ### **Scenario: Evaluating "wiki_qa" Dataset** | |
| ``` | |
| 1. User Action: | |
| - Selects "wiki_qa" dataset | |
| - Selects "llama-3.1-8b" LLM | |
| - Sets 10 test samples | |
| - Clicks "Run Evaluation" | |
| 2. System Processing: | |
| - Loads 10 test questions from wiki_qa | |
| - For each question: | |
| a) Retrieves top 5 relevant Wikipedia articles | |
| b) Generates answer using LLM + context | |
| - Runs TRACE metrics on all 10 Q&A pairs | |
| 3. Results: | |
| Sample 1: "Who is Albert Einstein?" | |
| - Retrieved: Einstein biography article | |
| - Generated: "Albert Einstein was a theoretical physicist..." | |
| - Utilization: 0.85 β (uses doc content) | |
| - Relevance: 0.92 β (doc is about Einstein) | |
| - Adherence: 0.88 β (response stays in doc) | |
| - Completeness: 0.90 β (answers completely) | |
| - Average: 0.89 | |
| Sample 2: "What did Einstein discover?" | |
| - Retrieved: Articles on relativity, quantum theory | |
| - Generated: "Einstein discovered the theory of relativity..." | |
| - Utilization: 0.78 β | |
| - Relevance: 0.85 β | |
| - Adherence: 0.82 β | |
| - Completeness: 0.85 β | |
| - Average: 0.82 | |
| [Samples 3-10 evaluated similarly] | |
| 4. Final Results: | |
| - Average Utilization: 0.82 | |
| - Average Relevance: 0.88 | |
| - Average Adherence: 0.85 | |
| - Average Completeness: 0.87 | |
| - Overall TRACE Score: 0.855 (Excellent! β ) | |
| ``` | |
| --- | |
| ## 10. **Troubleshooting** | |
| ### **Common Issues:** | |
| 1. **Error: "No attribute dataset_name"** | |
| - Solution: Load a collection first (sidebar config) | |
| 2. **Evaluation very slow** | |
| - Solution: Reduce sample size or use faster LLM | |
| 3. **All scores near 0.5** | |
| - Solution: Check if retrieval is working properly | |
| 4. **High variance in scores** | |
| - Solution: Normal for diverse datasets; try more samples | |
| --- | |
| ## 11. **Advanced Usage** | |
| ### **Comparing Different Configurations** | |
| You can evaluate the same dataset with different: | |
| - Embedding models | |
| - Chunking strategies | |
| - LLM models | |
| Then compare results to find optimal configuration. | |
| ### **Exporting Results** | |
| ```json | |
| { | |
| "utilization": 0.82, | |
| "relevance": 0.88, | |
| "adherence": 0.85, | |
| "completeness": 0.87, | |
| "average": 0.855, | |
| "num_samples": 10, | |
| "individual_scores": [...] | |
| } | |
| ``` | |
| Save and track over time to measure improvements! | |
| --- | |
| ## Summary | |
| The evaluation system provides a comprehensive framework for assessing RAG application quality across 4 key dimensions. By understanding TRACE metrics, you can identify bottlenecks and optimize your RAG system for better performance. | |
| **Key Takeaway:** TRACE evaluation helps you objectively measure and improve your RAG system! π― | |