Spaces:
Running
Running
File size: 6,792 Bytes
1d10b0a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
# TRACe Evaluation Framework - Alignment with RAGBench Paper
## Summary of Changes
This document outlines the updates made to align the RAG Capstone Project's evaluation metrics with the **TRACe framework** as defined in the RAGBench paper (arXiv:2407.11005).
---
## Key Clarifications
### The TRACe Framework is **4 metrics**, NOT 5
β **Incorrect**: T, R, A, C, **E** (with "E = Evaluation" as a separate metric)
β
**Correct**: T, R, A, C (as defined in the RAGBench paper)
The stylization "TRACe" is just how the acronym is capitalized; there is no 5th "E" metric.
---
## The 4 TRACe Metrics (Per RAGBench Paper)
### 1. **T β uTilization (Context Utilization)**
**Definition:**
The fraction of retrieved context that the generator actually uses to produce the response.
**Formula:**
$$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$
Where:
- $U_i$ = utilized (used) spans/tokens in document $d_i$
- $d_i$ = full document $i$
- Len = length (sentence, token, or character level)
**Interpretation:**
- Low Utilization + Low Relevance β Greedy retriever returning irrelevant docs
- Low Utilization alone β Weak generator fails to leverage good context
- High Utilization β Generator efficiently uses provided context
---
### 2. **R β Relevance (Context Relevance)**
**Definition:**
The fraction of retrieved context that is actually relevant to answering the query.
**Formula:**
$$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$
Where:
- $R_i$ = relevant (useful) spans/tokens in document $d_i$
- $d_i$ = full document $i$
**Interpretation:**
- High Relevance β Retriever returned mostly relevant documents
- Low Relevance β Retriever returned many irrelevant/noisy documents
- High Relevance but Low Utilization β Good docs retrieved, but generator doesn't use them
---
### 3. **A β Adherence (Faithfulness / Groundedness / Attribution)**
**Definition:**
Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations.
**Paper Definition:**
- Example-level: **Boolean** β True if all response sentences are supported; False if any part is unsupported
- Span/Sentence-level: Can annotate which specific response sentences are grounded
**Interpretation:**
- High Adherence (1.0) β Response fully grounded, no hallucinations β
- Low Adherence (0.0) β Response contains unsupported claims β
- Mid Adherence β Partially grounded response
---
### 4. **C β Completeness**
**Definition:**
How much of the relevant information in the context is actually covered/incorporated by the response.
**Formula:**
$$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$
Where:
- $R_i \cap U_i$ = intersection of relevant AND utilized spans
- $R_i$ = all relevant spans
- Extended to example-level by aggregating across documents
**Interpretation:**
- High Completeness β Generator covers all relevant information
- Low Completeness + High Utilization β Generator uses context but misses key facts
- Ideal RAG: High Relevance + High Utilization + High Completeness
---
## Code Changes Made
### 1. **EVALUATION_GUIDE.md**
- β
Updated header to reference RAGBench paper and TRACe (not TRACE)
- β
Removed incorrect "E = Evaluation" metric
- β
Added formal mathematical definitions for each metric per the paper
- β
Clarified when each metric is high/low and what it means for RAG systems
### 2. **trace_evaluator.py**
- β
Updated module docstring with paper reference and correct 4-metric framework
- β
Enhanced `TRACEEvaluator.__init__()` to accept metadata:
- `chunking_strategy`: Which chunking strategy was used
- `embedding_model`: Which embedding model was used
- `chunk_size`: Chunk size configuration
- `chunk_overlap`: Chunk overlap configuration
- β
Updated `evaluate_batch()` to include evaluation config in results dict for reproducibility
- β
Fixed type hints to use `Optional[str]` and `Optional[int]` for optional parameters
- β
Fixed numpy return types (wrap with `float()` to ensure proper type)
### 3. **vector_store.py (ChromaDBManager)**
- β
Added instance attributes to track evaluation-related metadata:
- `self.chunking_strategy`
- `self.chunk_size`
- `self.chunk_overlap`
- β
Updated `load_dataset_into_collection()` to store chunking metadata
- β
Updated `get_collection()` to restore chunking metadata from collection metadata when loading existing collections
- β
Ensures same chunking/embedding config is used for all questions in a test
### 4. **streamlit_app.py**
- β
Updated `run_evaluation()` to extract and log chunking/embedding metadata:
- Logs chunking strategy, chunk size, chunk overlap
- Logs embedding model used
- Passes this metadata to TRACEEvaluator for tracking
- β
Added new log entries in evaluation flow:
```
π§ Retrieval Configuration:
β’ Chunking Strategy: <strategy>
β’ Chunk Size: <size>
β’ Chunk Overlap: <overlap>
β’ Embedding Model: <model>
```
---
## Benefits of These Changes
1. **Alignment with Paper**: Metrics now follow RAGBench paper definitions exactly
2. **Reproducibility**: Evaluation config (chunking, embedding) is stored and logged with results
3. **Consistency**: Same chunking/embedding used for all test questions per evaluation
4. **Clarity**: Clear distinction between 4 metrics (no misleading "5-metric" interpretation)
5. **Traceability**: Results can be audited to understand what retrieval config was used
---
## Usage Example
```python
from trace_evaluator import TRACEEvaluator
# Initialize with metadata
evaluator = TRACEEvaluator(
chunking_strategy="dense",
embedding_model="sentence-transformers/all-mpnet-base-v2",
chunk_size=512,
chunk_overlap=50
)
# Run evaluation
results = evaluator.evaluate_batch(test_cases)
# Results now include evaluation config
print(results["evaluation_config"])
# Output: {
# "chunking_strategy": "dense",
# "embedding_model": "sentence-transformers/all-mpnet-base-v2",
# "chunk_size": 512,
# "chunk_overlap": 50
# }
```
---
## Future Improvements
1. Implement **span-level annotation** following RAGBench approach for ground truth metrics
2. Add **fine-tuned evaluator models** (e.g., DeBERTa) for more accurate metric computation
3. Store evaluation results with full metadata in persistent storage for historical tracking
4. Add comparison tools to analyze how different chunking/embedding strategies affect TRACe scores
---
## References
- **RAGBench Paper**: "RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems"
- arXiv: 2407.11005v2
- Dataset: https://huggingface.co/datasets/rungalileo/ragbench
- GitHub: https://github.com/rungalileo/ragbench
|