Spaces:
Running
RAG Capstone Project - Evaluation System Guide
Overview
The RAG Capstone Project uses the TRACe evaluation framework (from the RAGBench paper: arXiv:2407.11005) to assess the quality of Retrieval-Augmented Generation (RAG) responses. TRACe is a 4-metric framework that evaluates both the retriever and generator components:
- T β uTilization (Context Utilization): How much of the retrieved context the generator actually uses to produce the response
- R β Relevance (Context Relevance): How much of the retrieved context is relevant to the query
- A β Adherence** (Faithfulness/Groundedness/Attribution): Whether the response is grounded in and supported by the provided context (no hallucinations)
- C β Completeness: How much of the relevant information in the context is actually covered by the response
These 4 metrics provide comprehensive evaluation of RAG system quality, examining retriever performance (Relevance), generator quality (Adherence, Completeness), and effective resource utilization (Utilization).
Evaluation Architecture
1. High-Level Flow
User selects dataset + samples
β
Load test data from dataset
β
For each test sample:
ββ Query the RAG system with question
ββ Get response + retrieved documents
ββ Store as test case
β
Run TRACE metrics on all test
cases
β
Aggregate results + Display metrics
2. TRACe Metrics Explained (Per RAGBench Paper)
T β uTilization (Context Utilization)
What it measures:
The fraction of the retrieved context that the generator actually uses to produce the response. Identifies if the LLM effectively leverages the provided documents.
Paper Definition:
Where:
- $U_i$ = set of utilized (used) spans/tokens in document $d_i$
- $d_i$ = the full document $i$
- $\text{Len}()$ = length of the span (sentence, token, or character level)
Interpretation:
- Low Utilization + Low Relevance β Greedy retriever returning irrelevant docs
- Low Utilization alone β Weak generator fails to leverage good context
- High Utilization β Generator efficiently uses provided context
R β Relevance (Context Relevance)
What it measures:
The fraction of the retrieved context that is actually relevant to answering the query. Evaluates retriever qualityβdoes it return useful documents?
Paper Definition:
Where:
- $R_i$ = set of relevant (useful) spans/tokens in document $d_i$
- $d_i$ = the full document $i$
Interpretation:
- High Relevance β Retriever returned mostly relevant documents
- Low Relevance β Retriever returned many irrelevant/noisy documents
- High Relevance but Low Utilization β Good docs retrieved, but generator doesn't use them
A β Adherence (Faithfulness / Groundedness / Attribution)
What it measures:
Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinationsβclaims made without evidence in the documents.
Paper Definition:
Example-level: Boolean β True if all response sentences are supported by the context; False if any part of the response is unsupported/hallucinated
Span/Sentence-level: Can also annotate which specific response sentences or spans are grounded.
Interpretation:
- High Adherence (1.0) β Response fully grounded, no hallucinations β
- Low Adherence (0.0) β Response contains unsupported claims β
- Mid Adherence β Partially grounded response (some claims supported, others not)
C β Completeness
What it measures:
How much of the relevant information in the context is actually covered/incorporated by the response. Identifies missing information.
Paper Definition:
Where:
- $R_i \cap U_i$ = intersection of relevant AND utilized spans (info that is both relevant and used)
- $R_i$ = all relevant spans
- Extended to example-level by aggregating across all documents
Interpretation:
- High Completeness β Generator covers all relevant information from context
- Low Completeness + High Utilization β Generator uses context but misses key relevant facts
- High Relevance + High Utilization + High Completeness β Ideal RAG system β
3. Evaluation Workflow in the Application
Step 1: Configuration (Sidebar)
User inputs:
ββ Groq API Key
ββ Selects dataset (e.g., "wiki_qa", "hotpot_qa", etc.)
ββ Selects LLM for evaluation (can differ from chat LLM)
ββ Clicks "Load Data & Create Collection"
Step 2: Test Data Loading
# In streamlit_app.py - run_evaluation()
loader = RAGBenchLoader()
test_data = loader.get_test_data(
dataset_name="wiki_qa", # Selected dataset
num_samples=10 # Number to evaluate
)
# Returns: [{"question": "...", "answer": "..."}, ...]
Available Datasets:
- wiki_qa
- hotpot_qa
- nq_open
- And 9 more from RAGBench
Step 3: Test Case Preparation
# For each test sample:
for sample in test_data:
# Query RAG system
result = rag_pipeline.query(
sample["question"],
n_results=5 # Retrieve top 5 documents
)
# Create test case
test_case = {
"query": sample["question"],
"response": result["response"],
"retrieved_documents": [doc["document"] for doc in result["retrieved_documents"]],
"ground_truth": sample.get("answer", "")
}
What happens in rag_pipeline.query():
Retrieval Phase:
retrieved_docs = vector_store.get_retrieved_documents(query, n_results=5) # Returns: Top 5 most relevant documents from ChromaDBGeneration Phase:
response = llm.generate_with_context(query, doc_texts, max_tokens=1024) # Uses Groq LLM with context to generate responseResult:
{ "query": "What is X?", "response": "Generated answer based on docs...", "retrieved_documents": [ { "document": "doc content", "distance": 0.123, "metadata": {...} }, ... ] }
Step 4: TRACE Evaluation
# In trace_evaluator.py
evaluator = TRACEEvaluator()
results = evaluator.evaluate_batch(test_cases)
# For each test case:
for test_case in test_cases:
scores = evaluator.evaluate(
query=test_case["query"],
response=test_case["response"],
retrieved_documents=test_case["retrieved_documents"],
ground_truth=test_case["ground_truth"]
)
# Returns TRACEScores with 4 metrics
Step 5: Aggregation
# Average scores across all test cases
{
"utilization": 0.75, # Average utilization across samples
"relevance": 0.82, # Average relevance across samples
"adherence": 0.79, # Average adherence across samples
"completeness": 0.88, # Average completeness across samples
"average": 0.81, # Overall TRACE score
"num_samples": 10, # Number of samples evaluated
"individual_scores": [ # Per-sample scores
{
"utilization": 0.70,
"relevance": 0.85,
"adherence": 0.75,
"completeness": 0.90,
"average": 0.80
},
...
]
}
4. Results Display
In Streamlit UI:
π Evaluation Results:
ββββββββββββββββββββββββββββββββββββββββββββββ
β π Utilization: 0.751 β
β π― Relevance: 0.823 β
β β
Adherence: 0.789 β
β π Completeness: 0.881 β
β β Average: 0.811 β
ββββββββββββββββββββββββββββββββββββββββββββββ
π Detailed Results:
[Expandable table with individual scores]
πΎ Download Results (JSON)
[Export button for results]
5. Logging During Evaluation
The application provides real-time logging:
π Evaluation Logs:
β±οΈ Evaluation started at 2025-12-18 10:30:45
π Dataset: wiki_qa
π Total samples: 10
π€ LLM Model: llama-3.1-8b
π Vector Store: wiki_qa_dense_all_mpnet
π§ Embedding Model: all-mpnet-base-v2
π₯ Loading test data...
β
Loaded 10 test samples
π Processing samples...
β Processed 10/10 samples
π Running TRACE evaluation metrics...
β
Evaluation completed successfully!
β’ Utilization: 75.10%
β’ Relevance: 82.34%
β’ Adherence: 78.91%
β’ Completeness: 88.12%
β±οΈ Evaluation completed at 2025-12-18 10:31:30
6. Key Components
trace_evaluator.py
Main Classes:
TRACEScores: Dataclass holding 4 metric scoresTRACEEvaluator: Main evaluator class
Key Methods:
evaluate() # Evaluate single test case
evaluate_batch() # Evaluate multiple test cases
_compute_utilization() # Metric: utilization
_compute_relevance() # Metric: relevance
_compute_adherence() # Metric: adherence
_compute_completeness() # Metric: completeness
dataset_loader.py
Key Methods:
get_test_data(dataset_name, num_samples) # Load test samples
get_test_data_size(dataset_name) # Get max available samples
llm_client.py - RAGPipeline
Key Method:
query(query_str, n_results=5) # Query RAG system
# Returns: {"query", "response", "retrieved_documents"}
7. Performance Considerations
Time Complexity
- Loading 10 samples: ~5-10 seconds
- Processing per sample: ~2-5 seconds (LLM generation)
- TRACE evaluation per sample: ~100-500ms
- Total for 10 samples: ~3-7 minutes (depending on LLM)
Optimization Tips
- Start with smaller sample sizes (5-10) for testing
- Use faster LLM models for initial evaluation
- Results are cached in session state
- Can download and reuse evaluation results
8. Interpreting Scores
Score Ranges:
| Range | Interpretation |
|---|---|
| 0.80-1.00 | Excellent β |
| 0.60-0.79 | Good π |
| 0.40-0.59 | Fair β οΈ |
| 0.00-0.39 | Poor β |
What Each Metric Tells You:
| Metric | Indicates | Action if Low |
|---|---|---|
| Utilization | Are docs used? | Add more relevant docs, improve retrieval |
| Relevance | Are retrieved docs relevant? | Improve embedding model or retrieval strategy |
| Adherence | Is response grounded? | Add guardrails to prevent hallucination |
| Completeness | Is response complete? | Increase response length or improve generation |
9. Example Evaluation Scenario
Scenario: Evaluating "wiki_qa" Dataset
1. User Action:
- Selects "wiki_qa" dataset
- Selects "llama-3.1-8b" LLM
- Sets 10 test samples
- Clicks "Run Evaluation"
2. System Processing:
- Loads 10 test questions from wiki_qa
- For each question:
a) Retrieves top 5 relevant Wikipedia articles
b) Generates answer using LLM + context
- Runs TRACE metrics on all 10 Q&A pairs
3. Results:
Sample 1: "Who is Albert Einstein?"
- Retrieved: Einstein biography article
- Generated: "Albert Einstein was a theoretical physicist..."
- Utilization: 0.85 β
(uses doc content)
- Relevance: 0.92 β
(doc is about Einstein)
- Adherence: 0.88 β
(response stays in doc)
- Completeness: 0.90 β
(answers completely)
- Average: 0.89
Sample 2: "What did Einstein discover?"
- Retrieved: Articles on relativity, quantum theory
- Generated: "Einstein discovered the theory of relativity..."
- Utilization: 0.78 β
- Relevance: 0.85 β
- Adherence: 0.82 β
- Completeness: 0.85 β
- Average: 0.82
[Samples 3-10 evaluated similarly]
4. Final Results:
- Average Utilization: 0.82
- Average Relevance: 0.88
- Average Adherence: 0.85
- Average Completeness: 0.87
- Overall TRACE Score: 0.855 (Excellent! β
)
10. Troubleshooting
Common Issues:
Error: "No attribute dataset_name"
- Solution: Load a collection first (sidebar config)
Evaluation very slow
- Solution: Reduce sample size or use faster LLM
All scores near 0.5
- Solution: Check if retrieval is working properly
High variance in scores
- Solution: Normal for diverse datasets; try more samples
11. Advanced Usage
Comparing Different Configurations
You can evaluate the same dataset with different:
- Embedding models
- Chunking strategies
- LLM models
Then compare results to find optimal configuration.
Exporting Results
{
"utilization": 0.82,
"relevance": 0.88,
"adherence": 0.85,
"completeness": 0.87,
"average": 0.855,
"num_samples": 10,
"individual_scores": [...]
}
Save and track over time to measure improvements!
Summary
The evaluation system provides a comprehensive framework for assessing RAG application quality across 4 key dimensions. By understanding TRACE metrics, you can identify bottlenecks and optimize your RAG system for better performance.
Key Takeaway: TRACE evaluation helps you objectively measure and improve your RAG system! π―