Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

File size: 10,502 Bytes

1d10b0a

# GPT Labeling Prompt → TRACE Metrics: Complete Explanation ✨

## 🎯 The Big Picture

Your RAG Capstone Project uses **GPT (LLM) to evaluate RAG responses** instead of simple keyword matching. Here's how it works:

```
┌──────────────┐
│   Query      │
│ + Response   │
│ + Documents  │
└──────┬───────┘
       │
       ▼
┌──────────────────────────────┐
│ Sentencize (Add keys:         │
│  doc_0_s0, resp_s0, etc.)    │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ Generate Structured GPT      │
│ Labeling Prompt              │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ Call Groq LLM API            │
│ (llm_client.generate)        │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ LLM Returns JSON with:       │
│ - relevant_sentence_keys     │
│ - utilized_sentence_keys     │
│ - support_info               │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ Extract and Calculate:       │
│ R (Relevance)   = 0.15       │
│ T (Utilization) = 0.67       │
│ C (Completeness)= 0.67       │
│ A (Adherence)   = 1.0        │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ Return AdvancedTRACEScores   │
│ with all metrics + metadata  │
└──────────────────────────────┘
```

---

## 📋 What the GPT Prompt Asks

The GPT labeling prompt (in `advanced_rag_evaluator.py`, line 305) instructs the LLM to:

**"You are a Fact-Checking and Citation Specialist"**

1. **Identify Relevant Information**: Which document sentences are relevant to the question?
2. **Verify Support**: Which document sentences support each response sentence?
3. **Check Completeness**: Is all important information covered?
4. **Detect Hallucinations**: Are there any unsupported claims?

---

## 🔍 What the LLM Returns (JSON)

```json
{
  "relevance_explanation": "Documents 1-2 are relevant, document 3 is not",
  
  "all_relevant_sentence_keys": [
    "doc_0_s0",  ← Sentence 0 from document 0
    "doc_0_s1",  ← Sentence 1 from document 0
    "doc_1_s0"   ← Sentence 0 from document 1
  ],
  
  "sentence_support_information": [
    {
      "response_sentence_key": "resp_s0",
      "explanation": "Matches doc_0_s0 about COVID-19",
      "supporting_sentence_keys": ["doc_0_s0"],
      "fully_supported": true  ← ✓ No hallucination
    },
    {
      "response_sentence_key": "resp_s1",
      "explanation": "Matches doc_0_s1 about droplet spread",
      "supporting_sentence_keys": ["doc_0_s1"],
      "fully_supported": true  ← ✓ No hallucination
    }
  ],
  
  "all_utilized_sentence_keys": [
    "doc_0_s0",
    "doc_0_s1"
  ],
  
  "overall_supported": true  ← Response is fully grounded
}
```

---

## 📊 How Each TRACE Metric is Calculated

### **Metric 1: RELEVANCE (R)**

**Question Being Answered**: "How much of the retrieved documents are relevant to the question?"

**Code Location**: `advanced_rag_evaluator.py`, Lines 554-562

**Calculation**:
```python
R = len(all_relevant_sentence_keys) / 20
```

**From GPT Response**:
- Uses: `all_relevant_sentence_keys` count
- Example: `["doc_0_s0", "doc_0_s1", "doc_1_s0"]` → 3 keys
- Divided by 20 (normalized max)
- Result: 3/20 = **0.15** (15%)

**Interpretation**: Only 15% of the document context is relevant to the query. Rest is noise.

---

### **Metric 2: UTILIZATION (T)**

**Question Being Answered**: "Of the relevant information, how much did the LLM actually use?"

**Code Location**: `advanced_rag_evaluator.py`, Lines 564-576

**Calculation**:
```python
T = len(all_utilized_sentence_keys) / len(all_relevant_sentence_keys)
```

**From GPT Response**:
- Numerator: `all_utilized_sentence_keys` count (e.g., 2)
- Denominator: `all_relevant_sentence_keys` count (e.g., 3)
- Result: 2/3 = **0.67** (67%)

**Interpretation**: The LLM used 67% of the relevant information. It ignored one relevant sentence.

---

### **Metric 3: COMPLETENESS (C)**

**Question Being Answered**: "Does the response cover all the relevant information?"

**Code Location**: `advanced_rag_evaluator.py`, Lines 577-591

**Calculation**:
```python
C = len(relevant_AND_utilized) / len(relevant)
```

**From GPT Response**:
- Find intersection of:
  - `all_relevant_sentence_keys` = {doc_0_s0, doc_0_s1, doc_1_s0}
  - `all_utilized_sentence_keys` = {doc_0_s0, doc_0_s1}
- Intersection = {doc_0_s0, doc_0_s1} → 2 items
- Result: 2/3 = **0.67** (67%)

**Interpretation**: The response covers 67% of the relevant information. Missing doc_1_s0.

---

### **Metric 4: ADHERENCE (A) - Hallucination Detection**

**Question Being Answered**: "Does the response contain hallucinations? (Are all claims supported by documents?)"

**Code Location**: `advanced_rag_evaluator.py`, Lines 593-609

**Calculation**:
```python
if ALL response sentences have fully_supported=true:
    A = 1.0
else:
    A = 0.0  (at least one hallucination found!)
```

**From GPT Response**:
- Check each item in `sentence_support_information`
- Look at the `fully_supported` field
- Example:
  ```
  resp_s0: fully_supported = true ✓
  resp_s1: fully_supported = true ✓
  ```
- All are true → Result: **1.0** (No hallucinations!)
  
- If any were false:
  ```
  resp_s0: fully_supported = true ✓
  resp_s1: fully_supported = false ✗ HALLUCINATION!
  ```
  Result: **0.0** (Contains hallucination)

**Interpretation**: 1.0 = Response is completely grounded in documents. 0.0 = Contains at least one unsupported claim.

---

## 📈 Real Example: Full Walkthrough

### **Input**:
```
Question:  "What is COVID-19?"
Response:  "COVID-19 is a respiratory disease. It spreads via droplets."

Documents:
1. "COVID-19 is a respiratory disease caused by SARS-CoV-2."
2. "The virus spreads through respiratory droplets."
3. "Vaccines help prevent infection."
```

### **Step 1: Sentencize**
```
doc_0_s0: "COVID-19 is a respiratory disease caused by SARS-CoV-2."
doc_0_s1: "The virus spreads through respiratory droplets."
doc_1_s0: "Vaccines help prevent infection."

resp_s0: "COVID-19 is a respiratory disease."
resp_s1: "It spreads via droplets."
```

### **Step 2: Send to GPT Labeling Prompt**
GPT analyzes and returns:

```json
{
  "all_relevant_sentence_keys": ["doc_0_s0", "doc_0_s1"],
  "all_utilized_sentence_keys": ["doc_0_s0", "doc_0_s1"],
  "sentence_support_information": [
    {"response_sentence_key": "resp_s0", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s0"]},
    {"response_sentence_key": "resp_s1", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s1"]}
  ]
}
```

### **Step 3: Calculate TRACE Metrics**

**Relevance (R)**:
- Relevant keys: 2 (doc_0_s0, doc_0_s1)
- Formula: 2/20 = **0.10** (10%)
- Meaning: 10% of the documents are relevant

**Utilization (T)**:
- Used: 2, Relevant: 2
- Formula: 2/2 = **1.00** (100%)
- Meaning: Used all relevant information

**Completeness (C)**:
- Relevant ∩ Used = 2
- Formula: 2/2 = **1.00** (100%)
- Meaning: Response covers all relevant info

**Adherence (A)**:
- All sentences: fully_supported=true?
- YES → **1.0** (No hallucinations!)

**Average Score**:
- (0.10 + 1.00 + 1.00 + 1.0) / 4 = **0.775** (77.5% overall quality)

---

## 🎓 Why This Is Better Than Simple Metrics

| Aspect | Simple Keywords | GPT Labeling |
|--------|-----------------|--------------|
| Understanding | ❌ Keyword matching | ✅ Semantic understanding |
| Hallucination Detection | ❌ Can't detect | ✅ Detects all unsupported claims |
| Paraphrasing | ❌ Misses rephrased info | ✅ Understands meaning |
| Explainability | ❌ "Just a number" | ✅ Shows exact support mapping |
| Domain Specificity | ⚠️ Needs tuning | ✅ Works across all domains |

---

## 🔑 Key Files to Reference

| File | Purpose | Key Lines |
|------|---------|-----------|
| `advanced_rag_evaluator.py` | Main evaluation engine | All calculations |
| `advanced_rag_evaluator.py` | Prompt template | Lines 305-350 |
| `advanced_rag_evaluator.py` | Get GPT response | Lines 470-552 |
| `advanced_rag_evaluator.py` | Calculate R metric | Lines 554-562 |
| `advanced_rag_evaluator.py` | Calculate T metric | Lines 564-576 |
| `advanced_rag_evaluator.py` | Calculate C metric | Lines 577-591 |
| `advanced_rag_evaluator.py` | Calculate A metric | Lines 593-609 |
| `llm_client.py` | Groq API calls | LLM integration |

---

## 💡 Key Insights

1. **All metrics come from ONE GPT response**: They're consistent and complementary
2. **Sentence keys enable traceability**: Can show exactly which doc supported which claim
3. **Adherence is binary**: Either fully supported (1.0) or not (0.0) - catches all hallucinations
4. **Relevance normalization**: Divided by 20 to ensure 0-1 range regardless of doc length
5. **LLM as Judge**: Semantic understanding without any code-based rule engineering

---

## 🎯 Summary in One Sentence

**GPT analyzes which document sentences support which response sentences, then metrics are calculated from this mapping to assess RAG quality.**

---

## 📚 Complete Documentation Available

1. **TRACE_METRICS_QUICK_REFERENCE.md** - Quick lookup
2. **TRACE_METRICS_EXPLANATION.md** - Detailed explanation
3. **TRACE_Metrics_Flow.png** - Visual process flow
4. **Sentence_Mapping_Example.png** - Sentence-level details
5. **RAG_Architecture_Diagram.png** - System overview
6. **RAG_Data_Flow_Diagram.png** - Complete pipeline
7. **RAG_Capstone_Project_Presentation.pptx** - Full presentation
8. **DOCUMENTATION_INDEX.md** - Navigation guide