CapStoneRAG10 / docs /HOW_GPT_LABELING_CALCULATES_TRACE_METRICS.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# GPT Labeling Prompt β†’ TRACE Metrics: Complete Explanation ✨
## 🎯 The Big Picture
Your RAG Capstone Project uses **GPT (LLM) to evaluate RAG responses** instead of simple keyword matching. Here's how it works:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Query β”‚
β”‚ + Response β”‚
β”‚ + Documents β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sentencize (Add keys: β”‚
β”‚ doc_0_s0, resp_s0, etc.) β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Generate Structured GPT β”‚
β”‚ Labeling Prompt β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Call Groq LLM API β”‚
β”‚ (llm_client.generate) β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LLM Returns JSON with: β”‚
β”‚ - relevant_sentence_keys β”‚
β”‚ - utilized_sentence_keys β”‚
β”‚ - support_info β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Extract and Calculate: β”‚
β”‚ R (Relevance) = 0.15 β”‚
β”‚ T (Utilization) = 0.67 β”‚
β”‚ C (Completeness)= 0.67 β”‚
β”‚ A (Adherence) = 1.0 β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Return AdvancedTRACEScores β”‚
β”‚ with all metrics + metadata β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## πŸ“‹ What the GPT Prompt Asks
The GPT labeling prompt (in `advanced_rag_evaluator.py`, line 305) instructs the LLM to:
**"You are a Fact-Checking and Citation Specialist"**
1. **Identify Relevant Information**: Which document sentences are relevant to the question?
2. **Verify Support**: Which document sentences support each response sentence?
3. **Check Completeness**: Is all important information covered?
4. **Detect Hallucinations**: Are there any unsupported claims?
---
## πŸ” What the LLM Returns (JSON)
```json
{
"relevance_explanation": "Documents 1-2 are relevant, document 3 is not",
"all_relevant_sentence_keys": [
"doc_0_s0", ← Sentence 0 from document 0
"doc_0_s1", ← Sentence 1 from document 0
"doc_1_s0" ← Sentence 0 from document 1
],
"sentence_support_information": [
{
"response_sentence_key": "resp_s0",
"explanation": "Matches doc_0_s0 about COVID-19",
"supporting_sentence_keys": ["doc_0_s0"],
"fully_supported": true ← βœ“ No hallucination
},
{
"response_sentence_key": "resp_s1",
"explanation": "Matches doc_0_s1 about droplet spread",
"supporting_sentence_keys": ["doc_0_s1"],
"fully_supported": true ← βœ“ No hallucination
}
],
"all_utilized_sentence_keys": [
"doc_0_s0",
"doc_0_s1"
],
"overall_supported": true ← Response is fully grounded
}
```
---
## πŸ“Š How Each TRACE Metric is Calculated
### **Metric 1: RELEVANCE (R)**
**Question Being Answered**: "How much of the retrieved documents are relevant to the question?"
**Code Location**: `advanced_rag_evaluator.py`, Lines 554-562
**Calculation**:
```python
R = len(all_relevant_sentence_keys) / 20
```
**From GPT Response**:
- Uses: `all_relevant_sentence_keys` count
- Example: `["doc_0_s0", "doc_0_s1", "doc_1_s0"]` β†’ 3 keys
- Divided by 20 (normalized max)
- Result: 3/20 = **0.15** (15%)
**Interpretation**: Only 15% of the document context is relevant to the query. Rest is noise.
---
### **Metric 2: UTILIZATION (T)**
**Question Being Answered**: "Of the relevant information, how much did the LLM actually use?"
**Code Location**: `advanced_rag_evaluator.py`, Lines 564-576
**Calculation**:
```python
T = len(all_utilized_sentence_keys) / len(all_relevant_sentence_keys)
```
**From GPT Response**:
- Numerator: `all_utilized_sentence_keys` count (e.g., 2)
- Denominator: `all_relevant_sentence_keys` count (e.g., 3)
- Result: 2/3 = **0.67** (67%)
**Interpretation**: The LLM used 67% of the relevant information. It ignored one relevant sentence.
---
### **Metric 3: COMPLETENESS (C)**
**Question Being Answered**: "Does the response cover all the relevant information?"
**Code Location**: `advanced_rag_evaluator.py`, Lines 577-591
**Calculation**:
```python
C = len(relevant_AND_utilized) / len(relevant)
```
**From GPT Response**:
- Find intersection of:
- `all_relevant_sentence_keys` = {doc_0_s0, doc_0_s1, doc_1_s0}
- `all_utilized_sentence_keys` = {doc_0_s0, doc_0_s1}
- Intersection = {doc_0_s0, doc_0_s1} β†’ 2 items
- Result: 2/3 = **0.67** (67%)
**Interpretation**: The response covers 67% of the relevant information. Missing doc_1_s0.
---
### **Metric 4: ADHERENCE (A) - Hallucination Detection**
**Question Being Answered**: "Does the response contain hallucinations? (Are all claims supported by documents?)"
**Code Location**: `advanced_rag_evaluator.py`, Lines 593-609
**Calculation**:
```python
if ALL response sentences have fully_supported=true:
A = 1.0
else:
A = 0.0 (at least one hallucination found!)
```
**From GPT Response**:
- Check each item in `sentence_support_information`
- Look at the `fully_supported` field
- Example:
```
resp_s0: fully_supported = true βœ“
resp_s1: fully_supported = true βœ“
```
- All are true β†’ Result: **1.0** (No hallucinations!)
- If any were false:
```
resp_s0: fully_supported = true βœ“
resp_s1: fully_supported = false βœ— HALLUCINATION!
```
Result: **0.0** (Contains hallucination)
**Interpretation**: 1.0 = Response is completely grounded in documents. 0.0 = Contains at least one unsupported claim.
---
## πŸ“ˆ Real Example: Full Walkthrough
### **Input**:
```
Question: "What is COVID-19?"
Response: "COVID-19 is a respiratory disease. It spreads via droplets."
Documents:
1. "COVID-19 is a respiratory disease caused by SARS-CoV-2."
2. "The virus spreads through respiratory droplets."
3. "Vaccines help prevent infection."
```
### **Step 1: Sentencize**
```
doc_0_s0: "COVID-19 is a respiratory disease caused by SARS-CoV-2."
doc_0_s1: "The virus spreads through respiratory droplets."
doc_1_s0: "Vaccines help prevent infection."
resp_s0: "COVID-19 is a respiratory disease."
resp_s1: "It spreads via droplets."
```
### **Step 2: Send to GPT Labeling Prompt**
GPT analyzes and returns:
```json
{
"all_relevant_sentence_keys": ["doc_0_s0", "doc_0_s1"],
"all_utilized_sentence_keys": ["doc_0_s0", "doc_0_s1"],
"sentence_support_information": [
{"response_sentence_key": "resp_s0", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s0"]},
{"response_sentence_key": "resp_s1", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s1"]}
]
}
```
### **Step 3: Calculate TRACE Metrics**
**Relevance (R)**:
- Relevant keys: 2 (doc_0_s0, doc_0_s1)
- Formula: 2/20 = **0.10** (10%)
- Meaning: 10% of the documents are relevant
**Utilization (T)**:
- Used: 2, Relevant: 2
- Formula: 2/2 = **1.00** (100%)
- Meaning: Used all relevant information
**Completeness (C)**:
- Relevant ∩ Used = 2
- Formula: 2/2 = **1.00** (100%)
- Meaning: Response covers all relevant info
**Adherence (A)**:
- All sentences: fully_supported=true?
- YES β†’ **1.0** (No hallucinations!)
**Average Score**:
- (0.10 + 1.00 + 1.00 + 1.0) / 4 = **0.775** (77.5% overall quality)
---
## πŸŽ“ Why This Is Better Than Simple Metrics
| Aspect | Simple Keywords | GPT Labeling |
|--------|-----------------|--------------|
| Understanding | ❌ Keyword matching | βœ… Semantic understanding |
| Hallucination Detection | ❌ Can't detect | βœ… Detects all unsupported claims |
| Paraphrasing | ❌ Misses rephrased info | βœ… Understands meaning |
| Explainability | ❌ "Just a number" | βœ… Shows exact support mapping |
| Domain Specificity | ⚠️ Needs tuning | βœ… Works across all domains |
---
## πŸ”‘ Key Files to Reference
| File | Purpose | Key Lines |
|------|---------|-----------|
| `advanced_rag_evaluator.py` | Main evaluation engine | All calculations |
| `advanced_rag_evaluator.py` | Prompt template | Lines 305-350 |
| `advanced_rag_evaluator.py` | Get GPT response | Lines 470-552 |
| `advanced_rag_evaluator.py` | Calculate R metric | Lines 554-562 |
| `advanced_rag_evaluator.py` | Calculate T metric | Lines 564-576 |
| `advanced_rag_evaluator.py` | Calculate C metric | Lines 577-591 |
| `advanced_rag_evaluator.py` | Calculate A metric | Lines 593-609 |
| `llm_client.py` | Groq API calls | LLM integration |
---
## πŸ’‘ Key Insights
1. **All metrics come from ONE GPT response**: They're consistent and complementary
2. **Sentence keys enable traceability**: Can show exactly which doc supported which claim
3. **Adherence is binary**: Either fully supported (1.0) or not (0.0) - catches all hallucinations
4. **Relevance normalization**: Divided by 20 to ensure 0-1 range regardless of doc length
5. **LLM as Judge**: Semantic understanding without any code-based rule engineering
---
## 🎯 Summary in One Sentence
**GPT analyzes which document sentences support which response sentences, then metrics are calculated from this mapping to assess RAG quality.**
---
## πŸ“š Complete Documentation Available
1. **TRACE_METRICS_QUICK_REFERENCE.md** - Quick lookup
2. **TRACE_METRICS_EXPLANATION.md** - Detailed explanation
3. **TRACE_Metrics_Flow.png** - Visual process flow
4. **Sentence_Mapping_Example.png** - Sentence-level details
5. **RAG_Architecture_Diagram.png** - System overview
6. **RAG_Data_Flow_Diagram.png** - Complete pipeline
7. **RAG_Capstone_Project_Presentation.pptx** - Full presentation
8. **DOCUMENTATION_INDEX.md** - Navigation guide