CapStoneRAG10 / docs /EVALUATION_GUIDE.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# RAG Capstone Project - Evaluation System Guide
## Overview
The RAG Capstone Project uses the **TRACe evaluation framework** (from the RAGBench paper: arXiv:2407.11005) to assess the quality of Retrieval-Augmented Generation (RAG) responses. TRACe is a 4-metric framework that evaluates both the retriever and generator components:
- **T** β€” **u**T**ilization** (Context Utilization): How much of the retrieved context the generator actually uses to produce the response
- **R** β€” **R**elevance (Context Relevance): How much of the retrieved context is relevant to the query
- **A** β€” **A**dherence** (Faithfulness/Groundedness/Attribution): Whether the response is grounded in and supported by the provided context (no hallucinations)
- **C** β€” **C**ompleteness: How much of the relevant information in the context is actually covered by the response
These 4 metrics provide comprehensive evaluation of RAG system quality, examining retriever performance (Relevance), generator quality (Adherence, Completeness), and effective resource utilization (Utilization).
---
## Evaluation Architecture
### 1. **High-Level Flow**
```
User selects dataset + samples
↓
Load test data from dataset
↓
For each test sample:
β”œβ”€ Query the RAG system with question
β”œβ”€ Get response + retrieved documents
└─ Store as test case
↓
Run TRACE metrics on all test
cases
↓
Aggregate results + Display metrics
```
---
## 2. **TRACe Metrics Explained (Per RAGBench Paper)**
### **T β€” uTilization (Context Utilization)**
**What it measures:**
The fraction of the retrieved context that the generator actually uses to produce the response. Identifies if the LLM effectively leverages the provided documents.
**Paper Definition:**
$$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$
Where:
- $U_i$ = set of utilized (used) spans/tokens in document $d_i$
- $d_i$ = the full document $i$
- $\text{Len}()$ = length of the span (sentence, token, or character level)
**Interpretation:**
- **Low Utilization + Low Relevance** β†’ Greedy retriever returning irrelevant docs
- **Low Utilization alone** β†’ Weak generator fails to leverage good context
- **High Utilization** β†’ Generator efficiently uses provided context
---
### **R β€” Relevance (Context Relevance)**
**What it measures:**
The fraction of the retrieved context that is actually relevant to answering the query. Evaluates retriever qualityβ€”does it return useful documents?
**Paper Definition:**
$$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$
Where:
- $R_i$ = set of relevant (useful) spans/tokens in document $d_i$
- $d_i$ = the full document $i$
**Interpretation:**
- **High Relevance** β†’ Retriever returned mostly relevant documents
- **Low Relevance** β†’ Retriever returned many irrelevant/noisy documents
- **High Relevance but Low Utilization** β†’ Good docs retrieved, but generator doesn't use them
---
### **A β€” Adherence (Faithfulness / Groundedness / Attribution)**
**What it measures:**
Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinationsβ€”claims made without evidence in the documents.
**Paper Definition:**
Example-level: **Boolean** β€” True if all response sentences are supported by the context; False if any part of the response is unsupported/hallucinated
Span/Sentence-level: Can also annotate which specific response sentences or spans are grounded.
**Interpretation:**
- **High Adherence (1.0)** β†’ Response fully grounded, no hallucinations βœ…
- **Low Adherence (0.0)** β†’ Response contains unsupported claims ❌
- **Mid Adherence** β†’ Partially grounded response (some claims supported, others not)
---
### **C β€” Completeness**
**What it measures:**
How much of the relevant information in the context is actually covered/incorporated by the response. Identifies missing information.
**Paper Definition:**
$$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$
Where:
- $R_i \cap U_i$ = intersection of relevant AND utilized spans (info that is both relevant and used)
- $R_i$ = all relevant spans
- Extended to example-level by aggregating across all documents
**Interpretation:**
- **High Completeness** β†’ Generator covers all relevant information from context
- **Low Completeness + High Utilization** β†’ Generator uses context but misses key relevant facts
- **High Relevance + High Utilization + High Completeness** β†’ Ideal RAG system βœ…
---
## 3. **Evaluation Workflow in the Application**
### **Step 1: Configuration (Sidebar)**
```
User inputs:
β”œβ”€ Groq API Key
β”œβ”€ Selects dataset (e.g., "wiki_qa", "hotpot_qa", etc.)
β”œβ”€ Selects LLM for evaluation (can differ from chat LLM)
└─ Clicks "Load Data & Create Collection"
```
### **Step 2: Test Data Loading**
```python
# In streamlit_app.py - run_evaluation()
loader = RAGBenchLoader()
test_data = loader.get_test_data(
dataset_name="wiki_qa", # Selected dataset
num_samples=10 # Number to evaluate
)
# Returns: [{"question": "...", "answer": "..."}, ...]
```
**Available Datasets:**
- wiki_qa
- hotpot_qa
- nq_open
- And 9 more from RAGBench
### **Step 3: Test Case Preparation**
```python
# For each test sample:
for sample in test_data:
# Query RAG system
result = rag_pipeline.query(
sample["question"],
n_results=5 # Retrieve top 5 documents
)
# Create test case
test_case = {
"query": sample["question"],
"response": result["response"],
"retrieved_documents": [doc["document"] for doc in result["retrieved_documents"]],
"ground_truth": sample.get("answer", "")
}
```
**What happens in `rag_pipeline.query()`:**
1. **Retrieval Phase:**
```python
retrieved_docs = vector_store.get_retrieved_documents(query, n_results=5)
# Returns: Top 5 most relevant documents from ChromaDB
```
2. **Generation Phase:**
```python
response = llm.generate_with_context(query, doc_texts, max_tokens=1024)
# Uses Groq LLM with context to generate response
```
3. **Result:**
```python
{
"query": "What is X?",
"response": "Generated answer based on docs...",
"retrieved_documents": [
{
"document": "doc content",
"distance": 0.123,
"metadata": {...}
},
...
]
}
```
### **Step 4: TRACE Evaluation**
```python
# In trace_evaluator.py
evaluator = TRACEEvaluator()
results = evaluator.evaluate_batch(test_cases)
# For each test case:
for test_case in test_cases:
scores = evaluator.evaluate(
query=test_case["query"],
response=test_case["response"],
retrieved_documents=test_case["retrieved_documents"],
ground_truth=test_case["ground_truth"]
)
# Returns TRACEScores with 4 metrics
```
### **Step 5: Aggregation**
```python
# Average scores across all test cases
{
"utilization": 0.75, # Average utilization across samples
"relevance": 0.82, # Average relevance across samples
"adherence": 0.79, # Average adherence across samples
"completeness": 0.88, # Average completeness across samples
"average": 0.81, # Overall TRACE score
"num_samples": 10, # Number of samples evaluated
"individual_scores": [ # Per-sample scores
{
"utilization": 0.70,
"relevance": 0.85,
"adherence": 0.75,
"completeness": 0.90,
"average": 0.80
},
...
]
}
```
---
## 4. **Results Display**
### **In Streamlit UI:**
```
πŸ“Š Evaluation Results:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ πŸ“Š Utilization: 0.751 β”‚
β”‚ 🎯 Relevance: 0.823 β”‚
β”‚ βœ… Adherence: 0.789 β”‚
β”‚ πŸ“ Completeness: 0.881 β”‚
β”‚ ⭐ Average: 0.811 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
πŸ“‹ Detailed Results:
[Expandable table with individual scores]
πŸ’Ύ Download Results (JSON)
[Export button for results]
```
---
## 5. **Logging During Evaluation**
The application provides real-time logging:
```
πŸ“‹ Evaluation Logs:
⏱️ Evaluation started at 2025-12-18 10:30:45
πŸ“Š Dataset: wiki_qa
πŸ“ˆ Total samples: 10
πŸ€– LLM Model: llama-3.1-8b
πŸ”— Vector Store: wiki_qa_dense_all_mpnet
🧠 Embedding Model: all-mpnet-base-v2
πŸ“₯ Loading test data...
βœ… Loaded 10 test samples
πŸ” Processing samples...
βœ“ Processed 10/10 samples
πŸ“Š Running TRACE evaluation metrics...
βœ… Evaluation completed successfully!
β€’ Utilization: 75.10%
β€’ Relevance: 82.34%
β€’ Adherence: 78.91%
β€’ Completeness: 88.12%
⏱️ Evaluation completed at 2025-12-18 10:31:30
```
---
## 6. **Key Components**
### **trace_evaluator.py**
**Main Classes:**
- `TRACEScores`: Dataclass holding 4 metric scores
- `TRACEEvaluator`: Main evaluator class
**Key Methods:**
```python
evaluate() # Evaluate single test case
evaluate_batch() # Evaluate multiple test cases
_compute_utilization() # Metric: utilization
_compute_relevance() # Metric: relevance
_compute_adherence() # Metric: adherence
_compute_completeness() # Metric: completeness
```
### **dataset_loader.py**
**Key Methods:**
```python
get_test_data(dataset_name, num_samples) # Load test samples
get_test_data_size(dataset_name) # Get max available samples
```
### **llm_client.py - RAGPipeline**
**Key Method:**
```python
query(query_str, n_results=5) # Query RAG system
# Returns: {"query", "response", "retrieved_documents"}
```
---
## 7. **Performance Considerations**
### **Time Complexity**
- Loading 10 samples: ~5-10 seconds
- Processing per sample: ~2-5 seconds (LLM generation)
- TRACE evaluation per sample: ~100-500ms
- **Total for 10 samples: ~3-7 minutes** (depending on LLM)
### **Optimization Tips**
1. Start with smaller sample sizes (5-10) for testing
2. Use faster LLM models for initial evaluation
3. Results are cached in session state
4. Can download and reuse evaluation results
---
## 8. **Interpreting Scores**
### **Score Ranges:**
| Range | Interpretation |
|-------|-----------------|
| 0.80-1.00 | Excellent βœ… |
| 0.60-0.79 | Good πŸ‘ |
| 0.40-0.59 | Fair ⚠️ |
| 0.00-0.39 | Poor ❌ |
### **What Each Metric Tells You:**
| Metric | Indicates | Action if Low |
|--------|-----------|---------------|
| Utilization | Are docs used? | Add more relevant docs, improve retrieval |
| Relevance | Are retrieved docs relevant? | Improve embedding model or retrieval strategy |
| Adherence | Is response grounded? | Add guardrails to prevent hallucination |
| Completeness | Is response complete? | Increase response length or improve generation |
---
## 9. **Example Evaluation Scenario**
### **Scenario: Evaluating "wiki_qa" Dataset**
```
1. User Action:
- Selects "wiki_qa" dataset
- Selects "llama-3.1-8b" LLM
- Sets 10 test samples
- Clicks "Run Evaluation"
2. System Processing:
- Loads 10 test questions from wiki_qa
- For each question:
a) Retrieves top 5 relevant Wikipedia articles
b) Generates answer using LLM + context
- Runs TRACE metrics on all 10 Q&A pairs
3. Results:
Sample 1: "Who is Albert Einstein?"
- Retrieved: Einstein biography article
- Generated: "Albert Einstein was a theoretical physicist..."
- Utilization: 0.85 βœ… (uses doc content)
- Relevance: 0.92 βœ… (doc is about Einstein)
- Adherence: 0.88 βœ… (response stays in doc)
- Completeness: 0.90 βœ… (answers completely)
- Average: 0.89
Sample 2: "What did Einstein discover?"
- Retrieved: Articles on relativity, quantum theory
- Generated: "Einstein discovered the theory of relativity..."
- Utilization: 0.78 βœ…
- Relevance: 0.85 βœ…
- Adherence: 0.82 βœ…
- Completeness: 0.85 βœ…
- Average: 0.82
[Samples 3-10 evaluated similarly]
4. Final Results:
- Average Utilization: 0.82
- Average Relevance: 0.88
- Average Adherence: 0.85
- Average Completeness: 0.87
- Overall TRACE Score: 0.855 (Excellent! βœ…)
```
---
## 10. **Troubleshooting**
### **Common Issues:**
1. **Error: "No attribute dataset_name"**
- Solution: Load a collection first (sidebar config)
2. **Evaluation very slow**
- Solution: Reduce sample size or use faster LLM
3. **All scores near 0.5**
- Solution: Check if retrieval is working properly
4. **High variance in scores**
- Solution: Normal for diverse datasets; try more samples
---
## 11. **Advanced Usage**
### **Comparing Different Configurations**
You can evaluate the same dataset with different:
- Embedding models
- Chunking strategies
- LLM models
Then compare results to find optimal configuration.
### **Exporting Results**
```json
{
"utilization": 0.82,
"relevance": 0.88,
"adherence": 0.85,
"completeness": 0.87,
"average": 0.855,
"num_samples": 10,
"individual_scores": [...]
}
```
Save and track over time to measure improvements!
---
## Summary
The evaluation system provides a comprehensive framework for assessing RAG application quality across 4 key dimensions. By understanding TRACE metrics, you can identify bottlenecks and optimize your RAG system for better performance.
**Key Takeaway:** TRACE evaluation helps you objectively measure and improve your RAG system! 🎯