Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

File size: 13,721 Bytes

1d10b0a

# RAG Capstone Project - Evaluation System Guide

## Overview

The RAG Capstone Project uses the **TRACe evaluation framework** (from the RAGBench paper: arXiv:2407.11005) to assess the quality of Retrieval-Augmented Generation (RAG) responses. TRACe is a 4-metric framework that evaluates both the retriever and generator components:

- **T** — **u**T**ilization** (Context Utilization): How much of the retrieved context the generator actually uses to produce the response
- **R** — **R**elevance (Context Relevance): How much of the retrieved context is relevant to the query
- **A** — **A**dherence** (Faithfulness/Groundedness/Attribution): Whether the response is grounded in and supported by the provided context (no hallucinations)
- **C** — **C**ompleteness: How much of the relevant information in the context is actually covered by the response

These 4 metrics provide comprehensive evaluation of RAG system quality, examining retriever performance (Relevance), generator quality (Adherence, Completeness), and effective resource utilization (Utilization).

---

## Evaluation Architecture

### 1. **High-Level Flow**

```
User selects dataset + samples
        ↓
Load test data from dataset
        ↓
For each test sample:
  ├─ Query the RAG system with question
  ├─ Get response + retrieved documents
  └─ Store as test case
        ↓
Run TRACE metrics on all test 
cases
        ↓
Aggregate results + Display metrics
```

---

## 2. **TRACe Metrics Explained (Per RAGBench Paper)**

### **T — uTilization (Context Utilization)**

**What it measures:**  
The fraction of the retrieved context that the generator actually uses to produce the response. Identifies if the LLM effectively leverages the provided documents.

**Paper Definition:**
$$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$

Where:
- $U_i$ = set of utilized (used) spans/tokens in document $d_i$
- $d_i$ = the full document $i$
- $\text{Len}()$ = length of the span (sentence, token, or character level)

**Interpretation:**
- **Low Utilization + Low Relevance** → Greedy retriever returning irrelevant docs
- **Low Utilization alone** → Weak generator fails to leverage good context
- **High Utilization** → Generator efficiently uses provided context

---

### **R — Relevance (Context Relevance)**

**What it measures:**  
The fraction of the retrieved context that is actually relevant to answering the query. Evaluates retriever quality—does it return useful documents?

**Paper Definition:**
$$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$

Where:
- $R_i$ = set of relevant (useful) spans/tokens in document $d_i$
- $d_i$ = the full document $i$

**Interpretation:**
- **High Relevance** → Retriever returned mostly relevant documents
- **Low Relevance** → Retriever returned many irrelevant/noisy documents
- **High Relevance but Low Utilization** → Good docs retrieved, but generator doesn't use them

---

### **A — Adherence (Faithfulness / Groundedness / Attribution)**

**What it measures:**  
Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations—claims made without evidence in the documents.

**Paper Definition:**  
Example-level: **Boolean** — True if all response sentences are supported by the context; False if any part of the response is unsupported/hallucinated

Span/Sentence-level: Can also annotate which specific response sentences or spans are grounded.

**Interpretation:**
- **High Adherence (1.0)** → Response fully grounded, no hallucinations ✅
- **Low Adherence (0.0)** → Response contains unsupported claims ❌
- **Mid Adherence** → Partially grounded response (some claims supported, others not)

---

### **C — Completeness**

**What it measures:**  
How much of the relevant information in the context is actually covered/incorporated by the response. Identifies missing information.

**Paper Definition:**
$$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$

Where:
- $R_i \cap U_i$ = intersection of relevant AND utilized spans (info that is both relevant and used)
- $R_i$ = all relevant spans
- Extended to example-level by aggregating across all documents

**Interpretation:**
- **High Completeness** → Generator covers all relevant information from context
- **Low Completeness + High Utilization** → Generator uses context but misses key relevant facts
- **High Relevance + High Utilization + High Completeness** → Ideal RAG system ✅

---

## 3. **Evaluation Workflow in the Application**

### **Step 1: Configuration (Sidebar)**

```
User inputs:
├─ Groq API Key
├─ Selects dataset (e.g., "wiki_qa", "hotpot_qa", etc.)
├─ Selects LLM for evaluation (can differ from chat LLM)
└─ Clicks "Load Data & Create Collection"
```

### **Step 2: Test Data Loading**

```python
# In streamlit_app.py - run_evaluation()
loader = RAGBenchLoader()
test_data = loader.get_test_data(
    dataset_name="wiki_qa",      # Selected dataset
    num_samples=10               # Number to evaluate
)
# Returns: [{"question": "...", "answer": "..."}, ...]
```

**Available Datasets:**
- wiki_qa
- hotpot_qa
- nq_open
- And 9 more from RAGBench

### **Step 3: Test Case Preparation**

```python
# For each test sample:
for sample in test_data:
    # Query RAG system
    result = rag_pipeline.query(
        sample["question"],
        n_results=5              # Retrieve top 5 documents
    )
    
    # Create test case
    test_case = {
        "query": sample["question"],
        "response": result["response"],
        "retrieved_documents": [doc["document"] for doc in result["retrieved_documents"]],
        "ground_truth": sample.get("answer", "")
    }
```

**What happens in `rag_pipeline.query()`:**

1. **Retrieval Phase:**
   ```python
   retrieved_docs = vector_store.get_retrieved_documents(query, n_results=5)
   # Returns: Top 5 most relevant documents from ChromaDB
   ```

2. **Generation Phase:**
   ```python
   response = llm.generate_with_context(query, doc_texts, max_tokens=1024)
   # Uses Groq LLM with context to generate response
   ```

3. **Result:**
   ```python
   {
     "query": "What is X?",
     "response": "Generated answer based on docs...",
     "retrieved_documents": [
       {
         "document": "doc content",
         "distance": 0.123,
         "metadata": {...}
       },
       ...
     ]
   }
   ```

### **Step 4: TRACE Evaluation**

```python
# In trace_evaluator.py
evaluator = TRACEEvaluator()
results = evaluator.evaluate_batch(test_cases)

# For each test case:
for test_case in test_cases:
    scores = evaluator.evaluate(
        query=test_case["query"],
        response=test_case["response"],
        retrieved_documents=test_case["retrieved_documents"],
        ground_truth=test_case["ground_truth"]
    )
    # Returns TRACEScores with 4 metrics
```

### **Step 5: Aggregation**

```python
# Average scores across all test cases
{
    "utilization": 0.75,      # Average utilization across samples
    "relevance": 0.82,        # Average relevance across samples
    "adherence": 0.79,        # Average adherence across samples
    "completeness": 0.88,     # Average completeness across samples
    "average": 0.81,          # Overall TRACE score
    "num_samples": 10,        # Number of samples evaluated
    "individual_scores": [    # Per-sample scores
        {
            "utilization": 0.70,
            "relevance": 0.85,
            "adherence": 0.75,
            "completeness": 0.90,
            "average": 0.80
        },
        ...
    ]
}
```

---

## 4. **Results Display**

### **In Streamlit UI:**

```
📊 Evaluation Results:
┌────────────────────────────────────────────┐
│ 📊 Utilization: 0.751                      │
│ 🎯 Relevance: 0.823                        │
│ ✅ Adherence: 0.789                        │
│ 📝 Completeness: 0.881                     │
│ ⭐ Average: 0.811                          │
└────────────────────────────────────────────┘

📋 Detailed Results:
[Expandable table with individual scores]

💾 Download Results (JSON)
[Export button for results]
```

---

## 5. **Logging During Evaluation**

The application provides real-time logging:

```
📋 Evaluation Logs:
⏱️ Evaluation started at 2025-12-18 10:30:45
📊 Dataset: wiki_qa
📈 Total samples: 10
🤖 LLM Model: llama-3.1-8b
🔗 Vector Store: wiki_qa_dense_all_mpnet
🧠 Embedding Model: all-mpnet-base-v2
📥 Loading test data...
✅ Loaded 10 test samples
🔍 Processing samples...
  ✓ Processed 10/10 samples
📊 Running TRACE evaluation metrics...
✅ Evaluation completed successfully!
  • Utilization: 75.10%
  • Relevance: 82.34%
  • Adherence: 78.91%
  • Completeness: 88.12%
⏱️ Evaluation completed at 2025-12-18 10:31:30
```

---

## 6. **Key Components**

### **trace_evaluator.py**

**Main Classes:**
- `TRACEScores`: Dataclass holding 4 metric scores
- `TRACEEvaluator`: Main evaluator class

**Key Methods:**
```python
evaluate()           # Evaluate single test case
evaluate_batch()     # Evaluate multiple test cases
_compute_utilization()    # Metric: utilization
_compute_relevance()      # Metric: relevance
_compute_adherence()      # Metric: adherence
_compute_completeness()   # Metric: completeness
```

### **dataset_loader.py**

**Key Methods:**
```python
get_test_data(dataset_name, num_samples)    # Load test samples
get_test_data_size(dataset_name)            # Get max available samples
```

### **llm_client.py - RAGPipeline**

**Key Method:**
```python
query(query_str, n_results=5)    # Query RAG system
# Returns: {"query", "response", "retrieved_documents"}
```

---

## 7. **Performance Considerations**

### **Time Complexity**
- Loading 10 samples: ~5-10 seconds
- Processing per sample: ~2-5 seconds (LLM generation)
- TRACE evaluation per sample: ~100-500ms
- **Total for 10 samples: ~3-7 minutes** (depending on LLM)

### **Optimization Tips**
1. Start with smaller sample sizes (5-10) for testing
2. Use faster LLM models for initial evaluation
3. Results are cached in session state
4. Can download and reuse evaluation results

---

## 8. **Interpreting Scores**

### **Score Ranges:**

| Range | Interpretation |
|-------|-----------------|
| 0.80-1.00 | Excellent ✅ |
| 0.60-0.79 | Good 👍 |
| 0.40-0.59 | Fair ⚠️ |
| 0.00-0.39 | Poor ❌ |

### **What Each Metric Tells You:**

| Metric | Indicates | Action if Low |
|--------|-----------|---------------|
| Utilization | Are docs used? | Add more relevant docs, improve retrieval |
| Relevance | Are retrieved docs relevant? | Improve embedding model or retrieval strategy |
| Adherence | Is response grounded? | Add guardrails to prevent hallucination |
| Completeness | Is response complete? | Increase response length or improve generation |

---

## 9. **Example Evaluation Scenario**

### **Scenario: Evaluating "wiki_qa" Dataset**

```
1. User Action:
   - Selects "wiki_qa" dataset
   - Selects "llama-3.1-8b" LLM
   - Sets 10 test samples
   - Clicks "Run Evaluation"

2. System Processing:
   - Loads 10 test questions from wiki_qa
   - For each question:
     a) Retrieves top 5 relevant Wikipedia articles
     b) Generates answer using LLM + context
   - Runs TRACE metrics on all 10 Q&A pairs

3. Results:
   Sample 1: "Who is Albert Einstein?"
     - Retrieved: Einstein biography article
     - Generated: "Albert Einstein was a theoretical physicist..."
     - Utilization: 0.85 ✅ (uses doc content)
     - Relevance: 0.92 ✅ (doc is about Einstein)
     - Adherence: 0.88 ✅ (response stays in doc)
     - Completeness: 0.90 ✅ (answers completely)
     - Average: 0.89

   Sample 2: "What did Einstein discover?"
     - Retrieved: Articles on relativity, quantum theory
     - Generated: "Einstein discovered the theory of relativity..."
     - Utilization: 0.78 ✅
     - Relevance: 0.85 ✅
     - Adherence: 0.82 ✅
     - Completeness: 0.85 ✅
     - Average: 0.82

   [Samples 3-10 evaluated similarly]

4. Final Results:
   - Average Utilization: 0.82
   - Average Relevance: 0.88
   - Average Adherence: 0.85
   - Average Completeness: 0.87
   - Overall TRACE Score: 0.855 (Excellent! ✅)
```

---

## 10. **Troubleshooting**

### **Common Issues:**

1. **Error: "No attribute dataset_name"**
   - Solution: Load a collection first (sidebar config)

2. **Evaluation very slow**
   - Solution: Reduce sample size or use faster LLM

3. **All scores near 0.5**
   - Solution: Check if retrieval is working properly

4. **High variance in scores**
   - Solution: Normal for diverse datasets; try more samples

---

## 11. **Advanced Usage**

### **Comparing Different Configurations**

You can evaluate the same dataset with different:
- Embedding models
- Chunking strategies
- LLM models

Then compare results to find optimal configuration.

### **Exporting Results**

```json
{
  "utilization": 0.82,
  "relevance": 0.88,
  "adherence": 0.85,
  "completeness": 0.87,
  "average": 0.855,
  "num_samples": 10,
  "individual_scores": [...]
}
```

Save and track over time to measure improvements!

---

## Summary

The evaluation system provides a comprehensive framework for assessing RAG application quality across 4 key dimensions. By understanding TRACE metrics, you can identify bottlenecks and optimize your RAG system for better performance.

**Key Takeaway:** TRACE evaluation helps you objectively measure and improve your RAG system! 🎯