File size: 12,579 Bytes
1d10b0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
# TRACE Metrics Calculation - Visual Guide

## Step-by-Step Visualization

### STEP 1: Sentencization

```
DOCUMENTS                          RESPONSE
═══════════════════════════════    ══════════════════════════════
Doc 0:                             "Machine learning is AI that learns
"ML is AI. It learns from data.    from data. Deep learning uses neural
Algorithms improve through time."  networks. It's powerful for images."

↓ Split by sentence ends          ↓ Split by sentence ends

0a: "ML is AI."                   a: "Machine learning is AI that
0b: "It learns from data."           learns from data."
0c: "Algorithms improve            b: "Deep learning uses neural
     through time."                   networks."
                                   c: "It's powerful for images."
```

### STEP 2: GPT Analysis

```
GPT MODEL PROCESSES:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚  INPUT: Sentencized docs + response + question             β”‚
β”‚                                                             β”‚
β”‚  ANALYSIS:                                                  β”‚
β”‚  βœ“ Which doc sentences are relevant to question?           β”‚
β”‚  βœ“ Which doc sentences does response use?                  β”‚
β”‚  βœ“ Is each response sentence fully/partially supported?    β”‚
β”‚                                                             β”‚
β”‚  OUTPUT: JSON with sentence keys and support mappings      β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### STEP 3: Metric Calculation

```
GPT OUTPUT (SIMPLIFIED):
{
  "all_relevant_sentence_keys": ["0a", "0b"],
  "all_utilized_sentence_keys": ["0a", "0b"],
  "sentence_support_information": [
    {"response_sentence_key": "a", "fully_supported": true},
    {"response_sentence_key": "b", "fully_supported": true},
    {"response_sentence_key": "c", "fully_supported": false}
  ]
}

                    ↓

METRIC CALCULATION:
β”œβ”€ Context Relevance = |relevant| / 20 = 2/20 = 0.10
β”œβ”€ Context Utilization = |utilized| / |relevant| = 2/2 = 1.0
β”œβ”€ Completeness = |relevant ∩ utilized| / |relevant| = 2/2 = 1.0
└─ Adherence = all_fully_supported? = false β†’ 0.0
```

---

## Metric Formulas with Venn Diagrams

### Context Relevance (R)

```
ALL RETRIEVED SENTENCES
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          β”‚
β”‚  Total: ~20 sentences    β”‚
β”‚                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ RELEVANT:        β”‚    β”‚
β”‚  β”‚ ["0a", "0b"]     β”‚    β”‚
β”‚  β”‚ Count: 2         β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                          β”‚
β”‚  Irrelevant: 18          β”‚
β”‚                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Formula: R = 2 / 20 = 0.10 (10%)
Interpretation: 10% of retrieved content is relevant to question
```

### Context Utilization (T)

```
RELEVANT SENTENCES
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RELEVANT: ["0a", "0b"]   β”‚
β”‚                          β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚ β”‚ UTILIZED:          β”‚   β”‚
β”‚ β”‚ ["0a", "0b"]       β”‚   β”‚
β”‚ β”‚ Count: 2           β”‚   β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                          β”‚
β”‚ NOT USED: 0              β”‚
β”‚                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Formula: U = 2 / 2 = 1.0 (100%)
Interpretation: All relevant information was used
```

### Completeness (C)

```
        RELEVANT              UTILIZED
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ ["0a", "0b"] β”‚      β”‚ ["0a", "0b"] β”‚
   β”‚              β”‚      β”‚              β”‚
   β”‚   COUNT: 2   β”‚      β”‚   COUNT: 2   β”‚
   β”‚              β”‚      β”‚              β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                    β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
            OVERLAP: ["0a", "0b"]
            COUNT: 2

Formula: C = 2 / 2 = 1.0 (100%)
Interpretation: All relevant info is in response
```

### Adherence (A)

```
RESPONSE SENTENCES:           SUPPORT STATUS:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a: "ML is AI..." β”‚ ───────→│ βœ“ Fully          β”‚
β”‚                  β”‚         β”‚   Supported      β”‚
β”‚ b: "Deep learns..β”‚ ───────→│ βœ“ Fully          β”‚
β”‚                  β”‚         β”‚   Supported      β”‚
β”‚ c: "Powerful..." β”‚ ───────→│ βœ— Not            β”‚
β”‚                  β”‚         β”‚   Supported      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Formula: A = (all_supported) ? 1.0 : 0.0
       = (true AND true AND false) ? 1.0 : 0.0
       = 0.0 (100% = 0 because of one failure)

Interpretation: Response contains hallucination (adherence fails)
```

---

## Complete Example Walkthrough

### Input

```
QUESTION:
"What makes machine learning different from traditional programming?"

RETRIEVED DOCUMENTS:
0: "Machine learning is a subset of AI. It learns patterns from data.
    Traditional programming requires explicit instructions."
1: "ML algorithms improve through experience. They adapt to new data.
    Rule-based systems are rigid and hard to maintain."

LLM RESPONSE:
"Machine learning differs because it learns from data rather than 
requiring explicit instructions. ML algorithms improve over time.
It's the future of all computing."
```

### Step 1: Sentencization

```
DOCUMENTS:
0a: "Machine learning is a subset of AI."
0b: "It learns patterns from data."
0c: "Traditional programming requires explicit instructions."
1a: "ML algorithms improve through experience."
1b: "They adapt to new data."
1c: "Rule-based systems are rigid and hard to maintain."

RESPONSE:
a: "Machine learning differs because it learns from data rather than
    requiring explicit instructions."
b: "ML algorithms improve over time."
c: "It's the future of all computing."
```

### Step 2: GPT Labeling

```
ANALYSIS BY GPT:

Question focus: Differences between ML and traditional programming
└─ "learns from data" vs "explicit instructions"
└─ "improves through experience"
└─ Adaptability

RELEVANT SENTENCES (to question):
β”œβ”€ 0a: "subset of AI" β†’ Partially relevant
β”œβ”€ 0b: "learns patterns from data" β†’ RELEVANT βœ“
β”œβ”€ 0c: "requires explicit instructions" β†’ RELEVANT βœ“
β”œβ”€ 1a: "improve through experience" β†’ RELEVANT βœ“
β”œβ”€ 1b: "adapt to new data" β†’ RELEVANT βœ“
└─ 1c: "rule-based systems rigid" β†’ Partially relevant

UTILIZED SENTENCES (used in response):
β”œβ”€ response_a uses: 0b, 0c β†’ Document references: [0b, 0c]
β”œβ”€ response_b uses: 1a β†’ Document references: [1a]
└─ response_c uses: NONE β†’ No support β†’ [hallucination]

FULLY SUPPORTED CHECK:
β”œβ”€ response_a "learns from data, not explicit" β†’ Supported by 0b, 0c βœ“
β”œβ”€ response_b "algorithms improve" β†’ Supported by 1a βœ“
└─ response_c "future of all computing" β†’ NOT in documents βœ—
```

### Step 3: Metric Calculation

```
EXTRACTED DATA:
all_relevant_sentence_keys = ["0b", "0c", "1a", "1b"]  (4 sentences)
all_utilized_sentence_keys = ["0b", "0c", "1a"]        (3 sentences)
sentence_support_information = [
  {key: "a", fully_supported: true},
  {key: "b", fully_supported: true},
  {key: "c", fully_supported: false}
]

CALCULATIONS:

1. Context Relevance
   = |relevant| / 20
   = 4 / 20
   = 0.20 (20%)
   
2. Context Utilization
   = |utilized| / |relevant|
   = 3 / 4
   = 0.75 (75%)
   
3. Completeness
   = |relevant ∩ utilized| / |relevant|
   = |{0b, 0c, 1a}| / |{0b, 0c, 1a, 1b}|
   = 3 / 4
   = 0.75 (75%)
   
4. Adherence
   = all fully_supported?
   = true AND true AND false
   = FALSE β†’ 0.0 (0%)
```

### Results

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TRACE METRICS RESULTS                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Context Relevance:  0.20 (20%)         β”‚
β”‚ Context Utilization: 0.75 (75%)        β”‚
β”‚ Completeness:       0.75 (75%)         β”‚
β”‚ Adherence:          0.0  (0%)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Average:            0.425 (42.5%)      β”‚
β”‚ RMSE Aggregation:   0.437               β”‚
β”‚ Consistency Score:  0.563               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

INTERPRETATION:
βœ“ Good relevance targeting (20%)
βœ“ Decent information usage (75%)
βœ“ Good coverage of relevant info (75%)
βœ— Contains hallucination (0% adherence)

ACTION: Address the hallucination about "future of all computing"
```

---

## Calculation Pseudocode

```python
# INPUT: GPT labeled output
gpt_labels = {
    "all_relevant_sentence_keys": [...],
    "all_utilized_sentence_keys": [...],
    "sentence_support_information": [...]
}

# METRIC 1: Context Relevance
relevant_count = len(gpt_labels["all_relevant_sentence_keys"])
context_relevance = min(1.0, relevant_count / 20.0)

# METRIC 2: Context Utilization
utilized_count = len(gpt_labels["all_utilized_sentence_keys"])
if relevant_count == 0:
    context_utilization = 0.0
else:
    context_utilization = min(1.0, utilized_count / relevant_count)

# METRIC 3: Completeness
relevant_set = set(gpt_labels["all_relevant_sentence_keys"])
utilized_set = set(gpt_labels["all_utilized_sentence_keys"])
overlap_count = len(relevant_set & utilized_set)
if len(relevant_set) == 0:
    completeness = 1.0 if len(utilized_set) == 0 else 0.0
else:
    completeness = overlap_count / len(relevant_set)

# METRIC 4: Adherence
fully_supported_count = sum(
    1 for sentence in gpt_labels["sentence_support_information"]
    if sentence["fully_supported"]
)
total_sentences = len(gpt_labels["sentence_support_information"])
if total_sentences == 0:
    adherence = 1.0
else:
    adherence = 1.0 if fully_supported_count == total_sentences else 0.0

# OUTPUT
scores = {
    "context_relevance": context_relevance,
    "context_utilization": context_utilization,
    "completeness": completeness,
    "adherence": adherence,
    "average": (context_relevance + context_utilization + 
               completeness + adherence) / 4
}
```

---

## Key Takeaways

### 1. Each Metric Answers a Different Question

| Metric | Question | Data Source |
|--------|----------|-------------|
| **R** | Is retrieval good? | Relevant sentences |
| **U** | Does LLM use it? | Utilized sentences |
| **C** | Is response comprehensive? | Overlap |
| **A** | Is response truthful? | Support flags |

### 2. Metrics Are Independent

- Low R, high U is possible (ignore irrelevant)
- Low U, high R is possible (retrieval good, generation bad)
- Low C, high A is possible (limited but correct)

### 3. GPT Labeling is Sentence-Level

- Fine-grained sentence keys (0a, 0b, 1c, etc.)
- Exact mapping of support
- Transparent and verifiable

### 4. All Four Metrics Required for Full Picture

```
Relevance:    ← "Did we retrieve the right docs?"
Utilization:  ← "Did the LLM use them?"
Completeness: ← "Did it cover the information?"
Adherence:    ← "Is it accurate?"
```

All four needed to understand RAG quality.