File size: 6,792 Bytes
1d10b0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
# TRACe Evaluation Framework - Alignment with RAGBench Paper

## Summary of Changes

This document outlines the updates made to align the RAG Capstone Project's evaluation metrics with the **TRACe framework** as defined in the RAGBench paper (arXiv:2407.11005).

---

## Key Clarifications

### The TRACe Framework is **4 metrics**, NOT 5

❌ **Incorrect**: T, R, A, C, **E** (with "E = Evaluation" as a separate metric)  
βœ… **Correct**: T, R, A, C (as defined in the RAGBench paper)

The stylization "TRACe" is just how the acronym is capitalized; there is no 5th "E" metric.

---

## The 4 TRACe Metrics (Per RAGBench Paper)

### 1. **T β€” uTilization (Context Utilization)**

**Definition:**  
The fraction of retrieved context that the generator actually uses to produce the response.

**Formula:**
$$\text{Utilization} = \frac{\sum_i \text{Len}(U_i)}{\sum_i \text{Len}(d_i)}$$

Where:
- $U_i$ = utilized (used) spans/tokens in document $d_i$
- $d_i$ = full document $i$
- Len = length (sentence, token, or character level)

**Interpretation:**
- Low Utilization + Low Relevance β†’ Greedy retriever returning irrelevant docs
- Low Utilization alone β†’ Weak generator fails to leverage good context
- High Utilization β†’ Generator efficiently uses provided context

---

### 2. **R β€” Relevance (Context Relevance)**

**Definition:**  
The fraction of retrieved context that is actually relevant to answering the query.

**Formula:**
$$\text{Relevance} = \frac{\sum_i \text{Len}(R_i)}{\sum_i \text{Len}(d_i)}$$

Where:
- $R_i$ = relevant (useful) spans/tokens in document $d_i$
- $d_i$ = full document $i$

**Interpretation:**
- High Relevance β†’ Retriever returned mostly relevant documents
- Low Relevance β†’ Retriever returned many irrelevant/noisy documents
- High Relevance but Low Utilization β†’ Good docs retrieved, but generator doesn't use them

---

### 3. **A β€” Adherence (Faithfulness / Groundedness / Attribution)**

**Definition:**  
Whether the response is grounded in and fully supported by the retrieved context. Detects hallucinations.

**Paper Definition:**
- Example-level: **Boolean** β€” True if all response sentences are supported; False if any part is unsupported
- Span/Sentence-level: Can annotate which specific response sentences are grounded

**Interpretation:**
- High Adherence (1.0) β†’ Response fully grounded, no hallucinations βœ…
- Low Adherence (0.0) β†’ Response contains unsupported claims ❌
- Mid Adherence β†’ Partially grounded response

---

### 4. **C β€” Completeness**

**Definition:**  
How much of the relevant information in the context is actually covered/incorporated by the response.

**Formula:**
$$\text{Completeness} = \frac{\text{Len}(R_i \cap U_i)}{\text{Len}(R_i)}$$

Where:
- $R_i \cap U_i$ = intersection of relevant AND utilized spans
- $R_i$ = all relevant spans
- Extended to example-level by aggregating across documents

**Interpretation:**
- High Completeness β†’ Generator covers all relevant information
- Low Completeness + High Utilization β†’ Generator uses context but misses key facts
- Ideal RAG: High Relevance + High Utilization + High Completeness

---

## Code Changes Made

### 1. **EVALUATION_GUIDE.md**
- βœ… Updated header to reference RAGBench paper and TRACe (not TRACE)
- βœ… Removed incorrect "E = Evaluation" metric
- βœ… Added formal mathematical definitions for each metric per the paper
- βœ… Clarified when each metric is high/low and what it means for RAG systems

### 2. **trace_evaluator.py**
- βœ… Updated module docstring with paper reference and correct 4-metric framework
- βœ… Enhanced `TRACEEvaluator.__init__()` to accept metadata:
  - `chunking_strategy`: Which chunking strategy was used
  - `embedding_model`: Which embedding model was used
  - `chunk_size`: Chunk size configuration
  - `chunk_overlap`: Chunk overlap configuration
- βœ… Updated `evaluate_batch()` to include evaluation config in results dict for reproducibility
- βœ… Fixed type hints to use `Optional[str]` and `Optional[int]` for optional parameters
- βœ… Fixed numpy return types (wrap with `float()` to ensure proper type)

### 3. **vector_store.py (ChromaDBManager)**
- βœ… Added instance attributes to track evaluation-related metadata:
  - `self.chunking_strategy`
  - `self.chunk_size`
  - `self.chunk_overlap`
- βœ… Updated `load_dataset_into_collection()` to store chunking metadata
- βœ… Updated `get_collection()` to restore chunking metadata from collection metadata when loading existing collections
- βœ… Ensures same chunking/embedding config is used for all questions in a test

### 4. **streamlit_app.py**
- βœ… Updated `run_evaluation()` to extract and log chunking/embedding metadata:
  - Logs chunking strategy, chunk size, chunk overlap
  - Logs embedding model used
  - Passes this metadata to TRACEEvaluator for tracking
- βœ… Added new log entries in evaluation flow:
  ```
  πŸ”§ Retrieval Configuration:
    β€’ Chunking Strategy: <strategy>
    β€’ Chunk Size: <size>
    β€’ Chunk Overlap: <overlap>
    β€’ Embedding Model: <model>
  ```

---

## Benefits of These Changes

1. **Alignment with Paper**: Metrics now follow RAGBench paper definitions exactly
2. **Reproducibility**: Evaluation config (chunking, embedding) is stored and logged with results
3. **Consistency**: Same chunking/embedding used for all test questions per evaluation
4. **Clarity**: Clear distinction between 4 metrics (no misleading "5-metric" interpretation)
5. **Traceability**: Results can be audited to understand what retrieval config was used

---

## Usage Example

```python
from trace_evaluator import TRACEEvaluator

# Initialize with metadata
evaluator = TRACEEvaluator(
    chunking_strategy="dense",
    embedding_model="sentence-transformers/all-mpnet-base-v2",
    chunk_size=512,
    chunk_overlap=50
)

# Run evaluation
results = evaluator.evaluate_batch(test_cases)

# Results now include evaluation config
print(results["evaluation_config"])
# Output: {
#   "chunking_strategy": "dense",
#   "embedding_model": "sentence-transformers/all-mpnet-base-v2",
#   "chunk_size": 512,
#   "chunk_overlap": 50
# }
```

---

## Future Improvements

1. Implement **span-level annotation** following RAGBench approach for ground truth metrics
2. Add **fine-tuned evaluator models** (e.g., DeBERTa) for more accurate metric computation
3. Store evaluation results with full metadata in persistent storage for historical tracking
4. Add comparison tools to analyze how different chunking/embedding strategies affect TRACe scores

---

## References

- **RAGBench Paper**: "RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems"
  - arXiv: 2407.11005v2
  - Dataset: https://huggingface.co/datasets/rungalileo/ragbench
  - GitHub: https://github.com/rungalileo/ragbench