CapStoneRAG10 / docs /TRACE_EVALUATION_ALIGNMENT.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# TRACe Evaluation Framework - Alignment with RAGBench Paper
## Summary of Changes
This document outlines the updates made to align the RAG Capstone Project's evaluation metrics with the **TRACe framework** as defined in the RAGBench paper (arXiv:2407.11005).
---
## Key Clarifications
### The TRACe Framework is **4 metrics**, NOT 5
❌ **Incorrect**: T, R, A, C, **E** (with "E = Evaluation" as a separate metric)
βœ… **Correct**: T, R, A, C (as defined in the RAGBench paper)
The stylization "TRACe" is just how the acronym is capitalized; there is no 5th "E" metric.
---
## The 4 TRACe Metrics (Per RAGBench Paper)
### 1. **T β€” uTilization (Context Utilization)**
**Definition:**
The fraction of retrieved context that the generator actually uses to produce the response.
**Formula:**
$$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$
Where:
- $U_i$ = utilized (used) spans/tokens in document $d_i$
- $d_i$ = full document $i$
- Len = length (sentence, token, or character level)
**Interpretation:**
- Low Utilization + Low Relevance β†’ Greedy retriever returning irrelevant docs
- Low Utilization alone β†’ Weak generator fails to leverage good context
- High Utilization β†’ Generator efficiently uses provided context
---
### 2. **R β€” Relevance (Context Relevance)**
**Definition:**
The fraction of retrieved context that is actually relevant to answering the query.
**Formula:**
$$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$
Where:
- $R_i$ = relevant (useful) spans/tokens in document $d_i$
- $d_i$ = full document $i$
**Interpretation:**
- High Relevance β†’ Retriever returned mostly relevant documents
- Low Relevance β†’ Retriever returned many irrelevant/noisy documents
- High Relevance but Low Utilization β†’ Good docs retrieved, but generator doesn't use them
---
### 3. **A β€” Adherence (Faithfulness / Groundedness / Attribution)**
**Definition:**
Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations.
**Paper Definition:**
- Example-level: **Boolean** β€” True if all response sentences are supported; False if any part is unsupported
- Span/Sentence-level: Can annotate which specific response sentences are grounded
**Interpretation:**
- High Adherence (1.0) β†’ Response fully grounded, no hallucinations βœ…
- Low Adherence (0.0) β†’ Response contains unsupported claims ❌
- Mid Adherence β†’ Partially grounded response
---
### 4. **C β€” Completeness**
**Definition:**
How much of the relevant information in the context is actually covered/incorporated by the response.
**Formula:**
$$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$
Where:
- $R_i \cap U_i$ = intersection of relevant AND utilized spans
- $R_i$ = all relevant spans
- Extended to example-level by aggregating across documents
**Interpretation:**
- High Completeness β†’ Generator covers all relevant information
- Low Completeness + High Utilization β†’ Generator uses context but misses key facts
- Ideal RAG: High Relevance + High Utilization + High Completeness
---
## Code Changes Made
### 1. **EVALUATION_GUIDE.md**
- βœ… Updated header to reference RAGBench paper and TRACe (not TRACE)
- βœ… Removed incorrect "E = Evaluation" metric
- βœ… Added formal mathematical definitions for each metric per the paper
- βœ… Clarified when each metric is high/low and what it means for RAG systems
### 2. **trace_evaluator.py**
- βœ… Updated module docstring with paper reference and correct 4-metric framework
- βœ… Enhanced `TRACEEvaluator.__init__()` to accept metadata:
- `chunking_strategy`: Which chunking strategy was used
- `embedding_model`: Which embedding model was used
- `chunk_size`: Chunk size configuration
- `chunk_overlap`: Chunk overlap configuration
- βœ… Updated `evaluate_batch()` to include evaluation config in results dict for reproducibility
- βœ… Fixed type hints to use `Optional[str]` and `Optional[int]` for optional parameters
- βœ… Fixed numpy return types (wrap with `float()` to ensure proper type)
### 3. **vector_store.py (ChromaDBManager)**
- βœ… Added instance attributes to track evaluation-related metadata:
- `self.chunking_strategy`
- `self.chunk_size`
- `self.chunk_overlap`
- βœ… Updated `load_dataset_into_collection()` to store chunking metadata
- βœ… Updated `get_collection()` to restore chunking metadata from collection metadata when loading existing collections
- βœ… Ensures same chunking/embedding config is used for all questions in a test
### 4. **streamlit_app.py**
- βœ… Updated `run_evaluation()` to extract and log chunking/embedding metadata:
- Logs chunking strategy, chunk size, chunk overlap
- Logs embedding model used
- Passes this metadata to TRACEEvaluator for tracking
- βœ… Added new log entries in evaluation flow:
```
πŸ”§ Retrieval Configuration:
β€’ Chunking Strategy: <strategy>
β€’ Chunk Size: <size>
β€’ Chunk Overlap: <overlap>
β€’ Embedding Model: <model>
```
---
## Benefits of These Changes
1. **Alignment with Paper**: Metrics now follow RAGBench paper definitions exactly
2. **Reproducibility**: Evaluation config (chunking, embedding) is stored and logged with results
3. **Consistency**: Same chunking/embedding used for all test questions per evaluation
4. **Clarity**: Clear distinction between 4 metrics (no misleading "5-metric" interpretation)
5. **Traceability**: Results can be audited to understand what retrieval config was used
---
## Usage Example
```python
from trace_evaluator import TRACEEvaluator
# Initialize with metadata
evaluator = TRACEEvaluator(
chunking_strategy="dense",
embedding_model="sentence-transformers/all-mpnet-base-v2",
chunk_size=512,
chunk_overlap=50
)
# Run evaluation
results = evaluator.evaluate_batch(test_cases)
# Results now include evaluation config
print(results["evaluation_config"])
# Output: {
# "chunking_strategy": "dense",
# "embedding_model": "sentence-transformers/all-mpnet-base-v2",
# "chunk_size": 512,
# "chunk_overlap": 50
# }
```
---
## Future Improvements
1. Implement **span-level annotation** following RAGBench approach for ground truth metrics
2. Add **fine-tuned evaluator models** (e.g., DeBERTa) for more accurate metric computation
3. Store evaluation results with full metadata in persistent storage for historical tracking
4. Add comparison tools to analyze how different chunking/embedding strategies affect TRACe scores
---
## References
- **RAGBench Paper**: "RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems"
- arXiv: 2407.11005v2
- Dataset: https://huggingface.co/datasets/rungalileo/ragbench
- GitHub: https://github.com/rungalileo/ragbench