Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /HOW_GPT_LABELING_CALCULATES_TRACE_METRICS.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 2 months ago

preview code

raw

history blame contribute delete

10.5 kB

	# GPT Labeling Prompt → TRACE Metrics: Complete Explanation ✨

	## 🎯 The Big Picture

	Your RAG Capstone Project uses GPT (LLM) to evaluate RAG responses instead of simple keyword matching. Here's how it works:

	```
	┌──────────────┐
	│ Query │
	│ + Response │
	│ + Documents │
	└──────┬───────┘
	│
	▼
	┌──────────────────────────────┐
	│ Sentencize (Add keys: │
	│ doc_0_s0, resp_s0, etc.) │
	└──────┬───────────────────────┘
	│
	▼
	┌──────────────────────────────┐
	│ Generate Structured GPT │
	│ Labeling Prompt │
	└──────┬───────────────────────┘
	│
	▼
	┌──────────────────────────────┐
	│ Call Groq LLM API │
	│ (llm_client.generate) │
	└──────┬───────────────────────┘
	│
	▼
	┌──────────────────────────────┐
	│ LLM Returns JSON with: │
	│ - relevant_sentence_keys │
	│ - utilized_sentence_keys │
	│ - support_info │
	└──────┬───────────────────────┘
	│
	▼
	┌──────────────────────────────┐
	│ Extract and Calculate: │
	│ R (Relevance) = 0.15 │
	│ T (Utilization) = 0.67 │
	│ C (Completeness)= 0.67 │
	│ A (Adherence) = 1.0 │
	└──────┬───────────────────────┘
	│
	▼
	┌──────────────────────────────┐
	│ Return AdvancedTRACEScores │
	│ with all metrics + metadata │
	└──────────────────────────────┘
	```

	---

	## 📋 What the GPT Prompt Asks

	The GPT labeling prompt (in `advanced_rag_evaluator.py`, line 305) instructs the LLM to:

	"You are a Fact-Checking and Citation Specialist"

	1. Identify Relevant Information: Which document sentences are relevant to the question?
	2. Verify Support: Which document sentences support each response sentence?
	3. Check Completeness: Is all important information covered?
	4. Detect Hallucinations: Are there any unsupported claims?

	---

	## 🔍 What the LLM Returns (JSON)

	```json
	{
	"relevance_explanation": "Documents 1-2 are relevant, document 3 is not",

	"all_relevant_sentence_keys": [
	"doc_0_s0", ← Sentence 0 from document 0
	"doc_0_s1", ← Sentence 1 from document 0
	"doc_1_s0" ← Sentence 0 from document 1
	],

	"sentence_support_information": [
	{
	"response_sentence_key": "resp_s0",
	"explanation": "Matches doc_0_s0 about COVID-19",
	"supporting_sentence_keys": ["doc_0_s0"],
	"fully_supported": true ← ✓ No hallucination
	},
	{
	"response_sentence_key": "resp_s1",
	"explanation": "Matches doc_0_s1 about droplet spread",
	"supporting_sentence_keys": ["doc_0_s1"],
	"fully_supported": true ← ✓ No hallucination
	}
	],

	"all_utilized_sentence_keys": [
	"doc_0_s0",
	"doc_0_s1"
	],

	"overall_supported": true ← Response is fully grounded
	}
	```

	---

	## 📊 How Each TRACE Metric is Calculated

	### Metric 1: RELEVANCE (R)

	Question Being Answered: "How much of the retrieved documents are relevant to the question?"

	Code Location: `advanced_rag_evaluator.py`, Lines 554-562

	Calculation:
	```python
	R = len(all_relevant_sentence_keys) / 20
	```

	From GPT Response:
	- Uses: `all_relevant_sentence_keys` count
	- Example: `["doc_0_s0", "doc_0_s1", "doc_1_s0"]` → 3 keys
	- Divided by 20 (normalized max)
	- Result: 3/20 = 0.15 (15%)

	Interpretation: Only 15% of the document context is relevant to the query. Rest is noise.

	---

	### Metric 2: UTILIZATION (T)

	Question Being Answered: "Of the relevant information, how much did the LLM actually use?"

	Code Location: `advanced_rag_evaluator.py`, Lines 564-576

	Calculation:
	```python
	T = len(all_utilized_sentence_keys) / len(all_relevant_sentence_keys)
	```

	From GPT Response:
	- Numerator: `all_utilized_sentence_keys` count (e.g., 2)
	- Denominator: `all_relevant_sentence_keys` count (e.g., 3)
	- Result: 2/3 = 0.67 (67%)

	Interpretation: The LLM used 67% of the relevant information. It ignored one relevant sentence.

	---

	### Metric 3: COMPLETENESS (C)

	Question Being Answered: "Does the response cover all the relevant information?"

	Code Location: `advanced_rag_evaluator.py`, Lines 577-591

	Calculation:
	```python
	C = len(relevant_AND_utilized) / len(relevant)
	```

	From GPT Response:
	- Find intersection of:
	- `all_relevant_sentence_keys` = {doc_0_s0, doc_0_s1, doc_1_s0}
	- `all_utilized_sentence_keys` = {doc_0_s0, doc_0_s1}
	- Intersection = {doc_0_s0, doc_0_s1} → 2 items
	- Result: 2/3 = 0.67 (67%)

	Interpretation: The response covers 67% of the relevant information. Missing doc_1_s0.

	---

	### Metric 4: ADHERENCE (A) - Hallucination Detection

	Question Being Answered: "Does the response contain hallucinations? (Are all claims supported by documents?)"

	Code Location: `advanced_rag_evaluator.py`, Lines 593-609

	Calculation:
	```python
	if ALL response sentences have fully_supported=true:
	A = 1.0
	else:
	A = 0.0 (at least one hallucination found!)
	```

	From GPT Response:
	- Check each item in `sentence_support_information`
	- Look at the `fully_supported` field
	- Example:
	```
	resp_s0: fully_supported = true ✓
	resp_s1: fully_supported = true ✓
	```
	- All are true → Result: 1.0 (No hallucinations!)

	- If any were false:
	```
	resp_s0: fully_supported = true ✓
	resp_s1: fully_supported = false ✗ HALLUCINATION!
	```
	Result: 0.0 (Contains hallucination)

	Interpretation: 1.0 = Response is completely grounded in documents. 0.0 = Contains at least one unsupported claim.

	---

	## 📈 Real Example: Full Walkthrough

	### Input:
	```
	Question: "What is COVID-19?"
	Response: "COVID-19 is a respiratory disease. It spreads via droplets."

	Documents:
	1. "COVID-19 is a respiratory disease caused by SARS-CoV-2."
	2. "The virus spreads through respiratory droplets."
	3. "Vaccines help prevent infection."
	```

	### Step 1: Sentencize
	```
	doc_0_s0: "COVID-19 is a respiratory disease caused by SARS-CoV-2."
	doc_0_s1: "The virus spreads through respiratory droplets."
	doc_1_s0: "Vaccines help prevent infection."

	resp_s0: "COVID-19 is a respiratory disease."
	resp_s1: "It spreads via droplets."
	```

	### Step 2: Send to GPT Labeling Prompt
	GPT analyzes and returns:

	```json
	{
	"all_relevant_sentence_keys": ["doc_0_s0", "doc_0_s1"],
	"all_utilized_sentence_keys": ["doc_0_s0", "doc_0_s1"],
	"sentence_support_information": [
	{"response_sentence_key": "resp_s0", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s0"]},
	{"response_sentence_key": "resp_s1", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s1"]}
	]
	}
	```

	### Step 3: Calculate TRACE Metrics

	Relevance (R):
	- Relevant keys: 2 (doc_0_s0, doc_0_s1)
	- Formula: 2/20 = 0.10 (10%)
	- Meaning: 10% of the documents are relevant

	Utilization (T):
	- Used: 2, Relevant: 2
	- Formula: 2/2 = 1.00 (100%)
	- Meaning: Used all relevant information

	Completeness (C):
	- Relevant ∩ Used = 2
	- Formula: 2/2 = 1.00 (100%)
	- Meaning: Response covers all relevant info

	Adherence (A):
	- All sentences: fully_supported=true?
	- YES → 1.0 (No hallucinations!)

	Average Score:
	- (0.10 + 1.00 + 1.00 + 1.0) / 4 = 0.775 (77.5% overall quality)

	---

	## 🎓 Why This Is Better Than Simple Metrics

	\| Aspect \| Simple Keywords \| GPT Labeling \|
	\|--------\|-----------------\|--------------\|
	\| Understanding \| ❌ Keyword matching \| ✅ Semantic understanding \|
	\| Hallucination Detection \| ❌ Can't detect \| ✅ Detects all unsupported claims \|
	\| Paraphrasing \| ❌ Misses rephrased info \| ✅ Understands meaning \|
	\| Explainability \| ❌ "Just a number" \| ✅ Shows exact support mapping \|
	\| Domain Specificity \| ⚠️ Needs tuning \| ✅ Works across all domains \|

	---

	## 🔑 Key Files to Reference

	\| File \| Purpose \| Key Lines \|
	\|------\|---------\|-----------\|
	\| `advanced_rag_evaluator.py` \| Main evaluation engine \| All calculations \|
	\| `advanced_rag_evaluator.py` \| Prompt template \| Lines 305-350 \|
	\| `advanced_rag_evaluator.py` \| Get GPT response \| Lines 470-552 \|
	\| `advanced_rag_evaluator.py` \| Calculate R metric \| Lines 554-562 \|
	\| `advanced_rag_evaluator.py` \| Calculate T metric \| Lines 564-576 \|
	\| `advanced_rag_evaluator.py` \| Calculate C metric \| Lines 577-591 \|
	\| `advanced_rag_evaluator.py` \| Calculate A metric \| Lines 593-609 \|
	\| `llm_client.py` \| Groq API calls \| LLM integration \|

	---

	## 💡 Key Insights

	1. All metrics come from ONE GPT response: They're consistent and complementary
	2. Sentence keys enable traceability: Can show exactly which doc supported which claim
	3. Adherence is binary: Either fully supported (1.0) or not (0.0) - catches all hallucinations
	4. Relevance normalization: Divided by 20 to ensure 0-1 range regardless of doc length
	5. LLM as Judge: Semantic understanding without any code-based rule engineering

	---

	## 🎯 Summary in One Sentence

	GPT analyzes which document sentences support which response sentences, then metrics are calculated from this mapping to assess RAG quality.

	---

	## 📚 Complete Documentation Available

	1. TRACE_METRICS_QUICK_REFERENCE.md - Quick lookup
	2. TRACE_METRICS_EXPLANATION.md - Detailed explanation
	3. TRACE_Metrics_Flow.png - Visual process flow
	4. Sentence_Mapping_Example.png - Sentence-level details
	5. RAG_Architecture_Diagram.png - System overview
	6. RAG_Data_Flow_Diagram.png - Complete pipeline
	7. RAG_Capstone_Project_Presentation.pptx - Full presentation
	8. DOCUMENTATION_INDEX.md - Navigation guide