Spaces:

gopikrishnait
/

CapStoneRAG10

Running

App Files Files Community

CapStoneRAG10 / docs /EVALUATION_GUIDE.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 1 month ago

preview code

raw

history blame contribute delete

13.7 kB

	# RAG Capstone Project - Evaluation System Guide

	## Overview

	The RAG Capstone Project uses the TRACe evaluation framework (from the RAGBench paper: arXiv:2407.11005) to assess the quality of Retrieval-Augmented Generation (RAG) responses. TRACe is a 4-metric framework that evaluates both the retriever and generator components:

	- T — uTilization (Context Utilization): How much of the retrieved context the generator actually uses to produce the response
	- R — Relevance (Context Relevance): How much of the retrieved context is relevant to the query
	- A — Adherence** (Faithfulness/Groundedness/Attribution): Whether the response is grounded in and supported by the provided context (no hallucinations)
	- C — Completeness: How much of the relevant information in the context is actually covered by the response

	These 4 metrics provide comprehensive evaluation of RAG system quality, examining retriever performance (Relevance), generator quality (Adherence, Completeness), and effective resource utilization (Utilization).

	---

	## Evaluation Architecture

	### 1. High-Level Flow

	```
	User selects dataset + samples
	↓
	Load test data from dataset
	↓
	For each test sample:
	├─ Query the RAG system with question
	├─ Get response + retrieved documents
	└─ Store as test case
	↓
	Run TRACE metrics on all test
	cases
	↓
	Aggregate results + Display metrics
	```

	---

	## 2. TRACe Metrics Explained (Per RAGBench Paper)

	### T — uTilization (Context Utilization)

	What it measures:
	The fraction of the retrieved context that the generator actually uses to produce the response. Identifies if the LLM effectively leverages the provided documents.

	Paper Definition:
	$$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$

	Where:
	- $U_i$ = set of utilized (used) spans/tokens in document $d_i$
	- $d_i$ = the full document $i$
	- $\text{Len}()$ = length of the span (sentence, token, or character level)

	Interpretation:
	- Low Utilization + Low Relevance → Greedy retriever returning irrelevant docs
	- Low Utilization alone → Weak generator fails to leverage good context
	- High Utilization → Generator efficiently uses provided context

	---

	### R — Relevance (Context Relevance)

	What it measures:
	The fraction of the retrieved context that is actually relevant to answering the query. Evaluates retriever quality—does it return useful documents?

	Paper Definition:
	$$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$

	Where:
	- $R_i$ = set of relevant (useful) spans/tokens in document $d_i$
	- $d_i$ = the full document $i$

	Interpretation:
	- High Relevance → Retriever returned mostly relevant documents
	- Low Relevance → Retriever returned many irrelevant/noisy documents
	- High Relevance but Low Utilization → Good docs retrieved, but generator doesn't use them

	---

	### A — Adherence (Faithfulness / Groundedness / Attribution)

	What it measures:
	Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations—claims made without evidence in the documents.

	Paper Definition:
	Example-level: Boolean — True if all response sentences are supported by the context; False if any part of the response is unsupported/hallucinated

	Span/Sentence-level: Can also annotate which specific response sentences or spans are grounded.

	Interpretation:
	- High Adherence (1.0) → Response fully grounded, no hallucinations ✅
	- Low Adherence (0.0) → Response contains unsupported claims ❌
	- Mid Adherence → Partially grounded response (some claims supported, others not)

	---

	### C — Completeness

	What it measures:
	How much of the relevant information in the context is actually covered/incorporated by the response. Identifies missing information.

	Paper Definition:
	$$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$

	Where:
	- $R_i \cap U_i$ = intersection of relevant AND utilized spans (info that is both relevant and used)
	- $R_i$ = all relevant spans
	- Extended to example-level by aggregating across all documents

	Interpretation:
	- High Completeness → Generator covers all relevant information from context
	- Low Completeness + High Utilization → Generator uses context but misses key relevant facts
	- High Relevance + High Utilization + High Completeness → Ideal RAG system ✅

	---

	## 3. Evaluation Workflow in the Application

	### Step 1: Configuration (Sidebar)

	```
	User inputs:
	├─ Groq API Key
	├─ Selects dataset (e.g., "wiki_qa", "hotpot_qa", etc.)
	├─ Selects LLM for evaluation (can differ from chat LLM)
	└─ Clicks "Load Data & Create Collection"
	```

	### Step 2: Test Data Loading

	```python
	# In streamlit_app.py - run_evaluation()
	loader = RAGBenchLoader()
	test_data = loader.get_test_data(
	dataset_name="wiki_qa", # Selected dataset
	num_samples=10 # Number to evaluate
	)
	# Returns: [{"question": "...", "answer": "..."}, ...]
	```

	Available Datasets:
	- wiki_qa
	- hotpot_qa
	- nq_open
	- And 9 more from RAGBench

	### Step 3: Test Case Preparation

	```python
	# For each test sample:
	for sample in test_data:
	# Query RAG system
	result = rag_pipeline.query(
	sample["question"],
	n_results=5 # Retrieve top 5 documents
	)

	# Create test case
	test_case = {
	"query": sample["question"],
	"response": result["response"],
	"retrieved_documents": [doc["document"] for doc in result["retrieved_documents"]],
	"ground_truth": sample.get("answer", "")
	}
	```

	What happens in `rag_pipeline.query()`:

	1. Retrieval Phase:
	```python
	retrieved_docs = vector_store.get_retrieved_documents(query, n_results=5)
	# Returns: Top 5 most relevant documents from ChromaDB
	```

	2. Generation Phase:
	```python
	response = llm.generate_with_context(query, doc_texts, max_tokens=1024)
	# Uses Groq LLM with context to generate response
	```

	3. Result:
	```python
	{
	"query": "What is X?",
	"response": "Generated answer based on docs...",
	"retrieved_documents": [
	{
	"document": "doc content",
	"distance": 0.123,
	"metadata": {...}
	},
	...
	]
	}
	```

	### Step 4: TRACE Evaluation

	```python
	# In trace_evaluator.py
	evaluator = TRACEEvaluator()
	results = evaluator.evaluate_batch(test_cases)

	# For each test case:
	for test_case in test_cases:
	scores = evaluator.evaluate(
	query=test_case["query"],
	response=test_case["response"],
	retrieved_documents=test_case["retrieved_documents"],
	ground_truth=test_case["ground_truth"]
	)
	# Returns TRACEScores with 4 metrics
	```

	### Step 5: Aggregation

	```python
	# Average scores across all test cases
	{
	"utilization": 0.75, # Average utilization across samples
	"relevance": 0.82, # Average relevance across samples
	"adherence": 0.79, # Average adherence across samples
	"completeness": 0.88, # Average completeness across samples
	"average": 0.81, # Overall TRACE score
	"num_samples": 10, # Number of samples evaluated
	"individual_scores": [ # Per-sample scores
	{
	"utilization": 0.70,
	"relevance": 0.85,
	"adherence": 0.75,
	"completeness": 0.90,
	"average": 0.80
	},
	...
	]
	}
	```

	---

	## 4. Results Display

	### In Streamlit UI:

	```
	📊 Evaluation Results:
	┌────────────────────────────────────────────┐
	│ 📊 Utilization: 0.751 │
	│ 🎯 Relevance: 0.823 │
	│ ✅ Adherence: 0.789 │
	│ 📝 Completeness: 0.881 │
	│ ⭐ Average: 0.811 │
	└────────────────────────────────────────────┘

	📋 Detailed Results:
	[Expandable table with individual scores]

	💾 Download Results (JSON)
	[Export button for results]
	```

	---

	## 5. Logging During Evaluation

	The application provides real-time logging:

	```
	📋 Evaluation Logs:
	⏱️ Evaluation started at 2025-12-18 10:30:45
	📊 Dataset: wiki_qa
	📈 Total samples: 10
	🤖 LLM Model: llama-3.1-8b
	🔗 Vector Store: wiki_qa_dense_all_mpnet
	🧠 Embedding Model: all-mpnet-base-v2
	📥 Loading test data...
	✅ Loaded 10 test samples
	🔍 Processing samples...
	✓ Processed 10/10 samples
	📊 Running TRACE evaluation metrics...
	✅ Evaluation completed successfully!
	• Utilization: 75.10%
	• Relevance: 82.34%
	• Adherence: 78.91%
	• Completeness: 88.12%
	⏱️ Evaluation completed at 2025-12-18 10:31:30
	```

	---

	## 6. Key Components

	### trace_evaluator.py

	Main Classes:
	- `TRACEScores`: Dataclass holding 4 metric scores
	- `TRACEEvaluator`: Main evaluator class

	Key Methods:
	```python
	evaluate() # Evaluate single test case
	evaluate_batch() # Evaluate multiple test cases
	_compute_utilization() # Metric: utilization
	_compute_relevance() # Metric: relevance
	_compute_adherence() # Metric: adherence
	_compute_completeness() # Metric: completeness
	```

	### dataset_loader.py

	Key Methods:
	```python
	get_test_data(dataset_name, num_samples) # Load test samples
	get_test_data_size(dataset_name) # Get max available samples
	```

	### llm_client.py - RAGPipeline

	Key Method:
	```python
	query(query_str, n_results=5) # Query RAG system
	# Returns: {"query", "response", "retrieved_documents"}
	```

	---

	## 7. Performance Considerations

	### Time Complexity
	- Loading 10 samples: ~5-10 seconds
	- Processing per sample: ~2-5 seconds (LLM generation)
	- TRACE evaluation per sample: ~100-500ms
	- Total for 10 samples: ~3-7 minutes (depending on LLM)

	### Optimization Tips
	1. Start with smaller sample sizes (5-10) for testing
	2. Use faster LLM models for initial evaluation
	3. Results are cached in session state
	4. Can download and reuse evaluation results

	---

	## 8. Interpreting Scores

	### Score Ranges:

	\| Range \| Interpretation \|
	\|-------\|-----------------\|
	\| 0.80-1.00 \| Excellent ✅ \|
	\| 0.60-0.79 \| Good 👍 \|
	\| 0.40-0.59 \| Fair ⚠️ \|
	\| 0.00-0.39 \| Poor ❌ \|

	### What Each Metric Tells You:

	\| Metric \| Indicates \| Action if Low \|
	\|--------\|-----------\|---------------\|
	\| Utilization \| Are docs used? \| Add more relevant docs, improve retrieval \|
	\| Relevance \| Are retrieved docs relevant? \| Improve embedding model or retrieval strategy \|
	\| Adherence \| Is response grounded? \| Add guardrails to prevent hallucination \|
	\| Completeness \| Is response complete? \| Increase response length or improve generation \|

	---

	## 9. Example Evaluation Scenario

	### Scenario: Evaluating "wiki_qa" Dataset

	```
	1. User Action:
	- Selects "wiki_qa" dataset
	- Selects "llama-3.1-8b" LLM
	- Sets 10 test samples
	- Clicks "Run Evaluation"

	2. System Processing:
	- Loads 10 test questions from wiki_qa
	- For each question:
	a) Retrieves top 5 relevant Wikipedia articles
	b) Generates answer using LLM + context
	- Runs TRACE metrics on all 10 Q&A pairs

	3. Results:
	Sample 1: "Who is Albert Einstein?"
	- Retrieved: Einstein biography article
	- Generated: "Albert Einstein was a theoretical physicist..."
	- Utilization: 0.85 ✅ (uses doc content)
	- Relevance: 0.92 ✅ (doc is about Einstein)
	- Adherence: 0.88 ✅ (response stays in doc)
	- Completeness: 0.90 ✅ (answers completely)
	- Average: 0.89

	Sample 2: "What did Einstein discover?"
	- Retrieved: Articles on relativity, quantum theory
	- Generated: "Einstein discovered the theory of relativity..."
	- Utilization: 0.78 ✅
	- Relevance: 0.85 ✅
	- Adherence: 0.82 ✅
	- Completeness: 0.85 ✅
	- Average: 0.82

	[Samples 3-10 evaluated similarly]

	4. Final Results:
	- Average Utilization: 0.82
	- Average Relevance: 0.88
	- Average Adherence: 0.85
	- Average Completeness: 0.87
	- Overall TRACE Score: 0.855 (Excellent! ✅)
	```

	---

	## 10. Troubleshooting

	### Common Issues:

	1. Error: "No attribute dataset_name"
	- Solution: Load a collection first (sidebar config)

	2. Evaluation very slow
	- Solution: Reduce sample size or use faster LLM

	3. All scores near 0.5
	- Solution: Check if retrieval is working properly

	4. High variance in scores
	- Solution: Normal for diverse datasets; try more samples

	---

	## 11. Advanced Usage

	### Comparing Different Configurations

	You can evaluate the same dataset with different:
	- Embedding models
	- Chunking strategies
	- LLM models

	Then compare results to find optimal configuration.

	### Exporting Results

	```json
	{
	"utilization": 0.82,
	"relevance": 0.88,
	"adherence": 0.85,
	"completeness": 0.87,
	"average": 0.855,
	"num_samples": 10,
	"individual_scores": [...]
	}
	```

	Save and track over time to measure improvements!

	---

	## Summary

	The evaluation system provides a comprehensive framework for assessing RAG application quality across 4 key dimensions. By understanding TRACE metrics, you can identify bottlenecks and optimize your RAG system for better performance.

	Key Takeaway: TRACE evaluation helps you objectively measure and improve your RAG system! 🎯