File size: 10,502 Bytes
1d10b0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
# GPT Labeling Prompt β†’ TRACE Metrics: Complete Explanation ✨

## 🎯 The Big Picture

Your RAG Capstone Project uses **GPT (LLM) to evaluate RAG responses** instead of simple keyword matching. Here's how it works:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Query      β”‚
β”‚ + Response   β”‚
β”‚ + Documents  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sentencize (Add keys:         β”‚
β”‚  doc_0_s0, resp_s0, etc.)    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Generate Structured GPT      β”‚
β”‚ Labeling Prompt              β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Call Groq LLM API            β”‚
β”‚ (llm_client.generate)        β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LLM Returns JSON with:       β”‚
β”‚ - relevant_sentence_keys     β”‚
β”‚ - utilized_sentence_keys     β”‚
β”‚ - support_info               β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Extract and Calculate:       β”‚
β”‚ R (Relevance)   = 0.15       β”‚
β”‚ T (Utilization) = 0.67       β”‚
β”‚ C (Completeness)= 0.67       β”‚
β”‚ A (Adherence)   = 1.0        β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Return AdvancedTRACEScores   β”‚
β”‚ with all metrics + metadata  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## πŸ“‹ What the GPT Prompt Asks

The GPT labeling prompt (in `advanced_rag_evaluator.py`, line 305) instructs the LLM to:

**"You are a Fact-Checking and Citation Specialist"**

1. **Identify Relevant Information**: Which document sentences are relevant to the question?
2. **Verify Support**: Which document sentences support each response sentence?
3. **Check Completeness**: Is all important information covered?
4. **Detect Hallucinations**: Are there any unsupported claims?

---

## πŸ” What the LLM Returns (JSON)

```json
{
  "relevance_explanation": "Documents 1-2 are relevant, document 3 is not",
  
  "all_relevant_sentence_keys": [
    "doc_0_s0",  ← Sentence 0 from document 0
    "doc_0_s1",  ← Sentence 1 from document 0
    "doc_1_s0"   ← Sentence 0 from document 1
  ],
  
  "sentence_support_information": [
    {
      "response_sentence_key": "resp_s0",
      "explanation": "Matches doc_0_s0 about COVID-19",
      "supporting_sentence_keys": ["doc_0_s0"],
      "fully_supported": true  ← βœ“ No hallucination
    },
    {
      "response_sentence_key": "resp_s1",
      "explanation": "Matches doc_0_s1 about droplet spread",
      "supporting_sentence_keys": ["doc_0_s1"],
      "fully_supported": true  ← βœ“ No hallucination
    }
  ],
  
  "all_utilized_sentence_keys": [
    "doc_0_s0",
    "doc_0_s1"
  ],
  
  "overall_supported": true  ← Response is fully grounded
}
```

---

## πŸ“Š How Each TRACE Metric is Calculated

### **Metric 1: RELEVANCE (R)**

**Question Being Answered**: "How much of the retrieved documents are relevant to the question?"

**Code Location**: `advanced_rag_evaluator.py`, Lines 554-562

**Calculation**:
```python
R = len(all_relevant_sentence_keys) / 20
```

**From GPT Response**:
- Uses: `all_relevant_sentence_keys` count
- Example: `["doc_0_s0", "doc_0_s1", "doc_1_s0"]` β†’ 3 keys
- Divided by 20 (normalized max)
- Result: 3/20 = **0.15** (15%)

**Interpretation**: Only 15% of the document context is relevant to the query. Rest is noise.

---

### **Metric 2: UTILIZATION (T)**

**Question Being Answered**: "Of the relevant information, how much did the LLM actually use?"

**Code Location**: `advanced_rag_evaluator.py`, Lines 564-576

**Calculation**:
```python
T = len(all_utilized_sentence_keys) / len(all_relevant_sentence_keys)
```

**From GPT Response**:
- Numerator: `all_utilized_sentence_keys` count (e.g., 2)
- Denominator: `all_relevant_sentence_keys` count (e.g., 3)
- Result: 2/3 = **0.67** (67%)

**Interpretation**: The LLM used 67% of the relevant information. It ignored one relevant sentence.

---

### **Metric 3: COMPLETENESS (C)**

**Question Being Answered**: "Does the response cover all the relevant information?"

**Code Location**: `advanced_rag_evaluator.py`, Lines 577-591

**Calculation**:
```python
C = len(relevant_AND_utilized) / len(relevant)
```

**From GPT Response**:
- Find intersection of:
  - `all_relevant_sentence_keys` = {doc_0_s0, doc_0_s1, doc_1_s0}
  - `all_utilized_sentence_keys` = {doc_0_s0, doc_0_s1}
- Intersection = {doc_0_s0, doc_0_s1} β†’ 2 items
- Result: 2/3 = **0.67** (67%)

**Interpretation**: The response covers 67% of the relevant information. Missing doc_1_s0.

---

### **Metric 4: ADHERENCE (A) - Hallucination Detection**

**Question Being Answered**: "Does the response contain hallucinations? (Are all claims supported by documents?)"

**Code Location**: `advanced_rag_evaluator.py`, Lines 593-609

**Calculation**:
```python
if ALL response sentences have fully_supported=true:
    A = 1.0
else:
    A = 0.0  (at least one hallucination found!)
```

**From GPT Response**:
- Check each item in `sentence_support_information`
- Look at the `fully_supported` field
- Example:
  ```
  resp_s0: fully_supported = true βœ“
  resp_s1: fully_supported = true βœ“
  ```
- All are true β†’ Result: **1.0** (No hallucinations!)
  
- If any were false:
  ```
  resp_s0: fully_supported = true βœ“
  resp_s1: fully_supported = false βœ— HALLUCINATION!
  ```
  Result: **0.0** (Contains hallucination)

**Interpretation**: 1.0 = Response is completely grounded in documents. 0.0 = Contains at least one unsupported claim.

---

## πŸ“ˆ Real Example: Full Walkthrough

### **Input**:
```
Question:  "What is COVID-19?"
Response:  "COVID-19 is a respiratory disease. It spreads via droplets."

Documents:
1. "COVID-19 is a respiratory disease caused by SARS-CoV-2."
2. "The virus spreads through respiratory droplets."
3. "Vaccines help prevent infection."
```

### **Step 1: Sentencize**
```
doc_0_s0: "COVID-19 is a respiratory disease caused by SARS-CoV-2."
doc_0_s1: "The virus spreads through respiratory droplets."
doc_1_s0: "Vaccines help prevent infection."

resp_s0: "COVID-19 is a respiratory disease."
resp_s1: "It spreads via droplets."
```

### **Step 2: Send to GPT Labeling Prompt**
GPT analyzes and returns:

```json
{
  "all_relevant_sentence_keys": ["doc_0_s0", "doc_0_s1"],
  "all_utilized_sentence_keys": ["doc_0_s0", "doc_0_s1"],
  "sentence_support_information": [
    {"response_sentence_key": "resp_s0", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s0"]},
    {"response_sentence_key": "resp_s1", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s1"]}
  ]
}
```

### **Step 3: Calculate TRACE Metrics**

**Relevance (R)**:
- Relevant keys: 2 (doc_0_s0, doc_0_s1)
- Formula: 2/20 = **0.10** (10%)
- Meaning: 10% of the documents are relevant

**Utilization (T)**:
- Used: 2, Relevant: 2
- Formula: 2/2 = **1.00** (100%)
- Meaning: Used all relevant information

**Completeness (C)**:
- Relevant ∩ Used = 2
- Formula: 2/2 = **1.00** (100%)
- Meaning: Response covers all relevant info

**Adherence (A)**:
- All sentences: fully_supported=true?
- YES β†’ **1.0** (No hallucinations!)

**Average Score**:
- (0.10 + 1.00 + 1.00 + 1.0) / 4 = **0.775** (77.5% overall quality)

---

## πŸŽ“ Why This Is Better Than Simple Metrics

| Aspect | Simple Keywords | GPT Labeling |
|--------|-----------------|--------------|
| Understanding | ❌ Keyword matching | βœ… Semantic understanding |
| Hallucination Detection | ❌ Can't detect | βœ… Detects all unsupported claims |
| Paraphrasing | ❌ Misses rephrased info | βœ… Understands meaning |
| Explainability | ❌ "Just a number" | βœ… Shows exact support mapping |
| Domain Specificity | ⚠️ Needs tuning | βœ… Works across all domains |

---

## πŸ”‘ Key Files to Reference

| File | Purpose | Key Lines |
|------|---------|-----------|
| `advanced_rag_evaluator.py` | Main evaluation engine | All calculations |
| `advanced_rag_evaluator.py` | Prompt template | Lines 305-350 |
| `advanced_rag_evaluator.py` | Get GPT response | Lines 470-552 |
| `advanced_rag_evaluator.py` | Calculate R metric | Lines 554-562 |
| `advanced_rag_evaluator.py` | Calculate T metric | Lines 564-576 |
| `advanced_rag_evaluator.py` | Calculate C metric | Lines 577-591 |
| `advanced_rag_evaluator.py` | Calculate A metric | Lines 593-609 |
| `llm_client.py` | Groq API calls | LLM integration |

---

## πŸ’‘ Key Insights

1. **All metrics come from ONE GPT response**: They're consistent and complementary
2. **Sentence keys enable traceability**: Can show exactly which doc supported which claim
3. **Adherence is binary**: Either fully supported (1.0) or not (0.0) - catches all hallucinations
4. **Relevance normalization**: Divided by 20 to ensure 0-1 range regardless of doc length
5. **LLM as Judge**: Semantic understanding without any code-based rule engineering

---

## 🎯 Summary in One Sentence

**GPT analyzes which document sentences support which response sentences, then metrics are calculated from this mapping to assess RAG quality.**

---

## πŸ“š Complete Documentation Available

1. **TRACE_METRICS_QUICK_REFERENCE.md** - Quick lookup
2. **TRACE_METRICS_EXPLANATION.md** - Detailed explanation
3. **TRACE_Metrics_Flow.png** - Visual process flow
4. **Sentence_Mapping_Example.png** - Sentence-level details
5. **RAG_Architecture_Diagram.png** - System overview
6. **RAG_Data_Flow_Diagram.png** - Complete pipeline
7. **RAG_Capstone_Project_Presentation.pptx** - Full presentation
8. **DOCUMENTATION_INDEX.md** - Navigation guide