Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

App Files Files Community

CapStoneRAG10 / docs /HOW_GPT_LABELING_CALCULATES_TRACE_METRICS.md

Developer

Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud

1d10b0a about 2 months ago

preview code

raw

history blame contribute delete

10.5 kB

GPT Labeling Prompt → TRACE Metrics: Complete Explanation ✨

🎯 The Big Picture

Your RAG Capstone Project uses GPT (LLM) to evaluate RAG responses instead of simple keyword matching. Here's how it works:

┌──────────────┐
│   Query      │
│ + Response   │
│ + Documents  │
└──────┬───────┘
       │
       ▼
┌──────────────────────────────┐
│ Sentencize (Add keys:         │
│  doc_0_s0, resp_s0, etc.)    │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ Generate Structured GPT      │
│ Labeling Prompt              │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ Call Groq LLM API            │
│ (llm_client.generate)        │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ LLM Returns JSON with:       │
│ - relevant_sentence_keys     │
│ - utilized_sentence_keys     │
│ - support_info               │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ Extract and Calculate:       │
│ R (Relevance)   = 0.15       │
│ T (Utilization) = 0.67       │
│ C (Completeness)= 0.67       │
│ A (Adherence)   = 1.0        │
└──────┬───────────────────────┘
       │
       ▼
┌──────────────────────────────┐
│ Return AdvancedTRACEScores   │
│ with all metrics + metadata  │
└──────────────────────────────┘

📋 What the GPT Prompt Asks

The GPT labeling prompt (in advanced_rag_evaluator.py, line 305) instructs the LLM to:

"You are a Fact-Checking and Citation Specialist"

Identify Relevant Information: Which document sentences are relevant to the question?
Verify Support: Which document sentences support each response sentence?
Check Completeness: Is all important information covered?
Detect Hallucinations: Are there any unsupported claims?

🔍 What the LLM Returns (JSON)

{
  "relevance_explanation": "Documents 1-2 are relevant, document 3 is not",
  
  "all_relevant_sentence_keys": [
    "doc_0_s0",  ← Sentence 0 from document 0
    "doc_0_s1",  ← Sentence 1 from document 0
    "doc_1_s0"   ← Sentence 0 from document 1
  ],
  
  "sentence_support_information": [
    {
      "response_sentence_key": "resp_s0",
      "explanation": "Matches doc_0_s0 about COVID-19",
      "supporting_sentence_keys": ["doc_0_s0"],
      "fully_supported": true  ← ✓ No hallucination
    },
    {
      "response_sentence_key": "resp_s1",
      "explanation": "Matches doc_0_s1 about droplet spread",
      "supporting_sentence_keys": ["doc_0_s1"],
      "fully_supported": true  ← ✓ No hallucination
    }
  ],
  
  "all_utilized_sentence_keys": [
    "doc_0_s0",
    "doc_0_s1"
  ],
  
  "overall_supported": true  ← Response is fully grounded
}

📊 How Each TRACE Metric is Calculated

Metric 1: RELEVANCE (R)

Question Being Answered: "How much of the retrieved documents are relevant to the question?"

Code Location: advanced_rag_evaluator.py, Lines 554-562

Calculation:

R = len(all_relevant_sentence_keys) / 20

From GPT Response:

Uses: all_relevant_sentence_keys count
Example: ["doc_0_s0", "doc_0_s1", "doc_1_s0"] → 3 keys
Divided by 20 (normalized max)
Result: 3/20 = 0.15 (15%)

Interpretation: Only 15% of the document context is relevant to the query. Rest is noise.

Metric 2: UTILIZATION (T)

Question Being Answered: "Of the relevant information, how much did the LLM actually use?"

Code Location: advanced_rag_evaluator.py, Lines 564-576

Calculation:

T = len(all_utilized_sentence_keys) / len(all_relevant_sentence_keys)

From GPT Response:

Numerator: all_utilized_sentence_keys count (e.g., 2)
Denominator: all_relevant_sentence_keys count (e.g., 3)
Result: 2/3 = 0.67 (67%)

Interpretation: The LLM used 67% of the relevant information. It ignored one relevant sentence.

Metric 3: COMPLETENESS (C)

Question Being Answered: "Does the response cover all the relevant information?"

Code Location: advanced_rag_evaluator.py, Lines 577-591

Calculation:

C = len(relevant_AND_utilized) / len(relevant)

From GPT Response:

Find intersection of:
- all_relevant_sentence_keys = {doc_0_s0, doc_0_s1, doc_1_s0}
- all_utilized_sentence_keys = {doc_0_s0, doc_0_s1}
Intersection = {doc_0_s0, doc_0_s1} → 2 items
Result: 2/3 = 0.67 (67%)

Interpretation: The response covers 67% of the relevant information. Missing doc_1_s0.

Metric 4: ADHERENCE (A) - Hallucination Detection

Question Being Answered: "Does the response contain hallucinations? (Are all claims supported by documents?)"

Code Location: advanced_rag_evaluator.py, Lines 593-609

Calculation:

if ALL response sentences have fully_supported=true:
    A = 1.0
else:
    A = 0.0  (at least one hallucination found!)

From GPT Response:

Check each item in sentence_support_information
Look at the fully_supported field

Example:

resp_s0: fully_supported = true ✓
resp_s1: fully_supported = true ✓

All are true → Result: 1.0 (No hallucinations!)

If any were false:

resp_s0: fully_supported = true ✓
resp_s1: fully_supported = false ✗ HALLUCINATION!

Result: 0.0 (Contains hallucination)

Interpretation: 1.0 = Response is completely grounded in documents. 0.0 = Contains at least one unsupported claim.

📈 Real Example: Full Walkthrough

Input:

Question:  "What is COVID-19?"
Response:  "COVID-19 is a respiratory disease. It spreads via droplets."

Documents:
1. "COVID-19 is a respiratory disease caused by SARS-CoV-2."
2. "The virus spreads through respiratory droplets."
3. "Vaccines help prevent infection."

Step 1: Sentencize

doc_0_s0: "COVID-19 is a respiratory disease caused by SARS-CoV-2."
doc_0_s1: "The virus spreads through respiratory droplets."
doc_1_s0: "Vaccines help prevent infection."

resp_s0: "COVID-19 is a respiratory disease."
resp_s1: "It spreads via droplets."

Step 2: Send to GPT Labeling Prompt

GPT analyzes and returns:

{
  "all_relevant_sentence_keys": ["doc_0_s0", "doc_0_s1"],
  "all_utilized_sentence_keys": ["doc_0_s0", "doc_0_s1"],
  "sentence_support_information": [
    {"response_sentence_key": "resp_s0", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s0"]},
    {"response_sentence_key": "resp_s1", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s1"]}
  ]
}

Step 3: Calculate TRACE Metrics

Relevance (R):

Relevant keys: 2 (doc_0_s0, doc_0_s1)
Formula: 2/20 = 0.10 (10%)
Meaning: 10% of the documents are relevant

Utilization (T):

Used: 2, Relevant: 2
Formula: 2/2 = 1.00 (100%)
Meaning: Used all relevant information

Completeness (C):

Relevant ∩ Used = 2
Formula: 2/2 = 1.00 (100%)
Meaning: Response covers all relevant info

Adherence (A):

All sentences: fully_supported=true?
YES → 1.0 (No hallucinations!)

Average Score:

(0.10 + 1.00 + 1.00 + 1.0) / 4 = 0.775 (77.5% overall quality)

🎓 Why This Is Better Than Simple Metrics

Aspect	Simple Keywords	GPT Labeling
Understanding	❌ Keyword matching	✅ Semantic understanding
Hallucination Detection	❌ Can't detect	✅ Detects all unsupported claims
Paraphrasing	❌ Misses rephrased info	✅ Understands meaning
Explainability	❌ "Just a number"	✅ Shows exact support mapping
Domain Specificity	⚠️ Needs tuning	✅ Works across all domains

🔑 Key Files to Reference

File	Purpose	Key Lines
`advanced_rag_evaluator.py`	Main evaluation engine	All calculations
`advanced_rag_evaluator.py`	Prompt template	Lines 305-350
`advanced_rag_evaluator.py`	Get GPT response	Lines 470-552
`advanced_rag_evaluator.py`	Calculate R metric	Lines 554-562
`advanced_rag_evaluator.py`	Calculate T metric	Lines 564-576
`advanced_rag_evaluator.py`	Calculate C metric	Lines 577-591
`advanced_rag_evaluator.py`	Calculate A metric	Lines 593-609
`llm_client.py`	Groq API calls	LLM integration

💡 Key Insights

All metrics come from ONE GPT response: They're consistent and complementary
Sentence keys enable traceability: Can show exactly which doc supported which claim
Adherence is binary: Either fully supported (1.0) or not (0.0) - catches all hallucinations
Relevance normalization: Divided by 20 to ensure 0-1 range regardless of doc length
LLM as Judge: Semantic understanding without any code-based rule engineering

🎯 Summary in One Sentence

GPT analyzes which document sentences support which response sentences, then metrics are calculated from this mapping to assess RAG quality.

📚 Complete Documentation Available

TRACE_METRICS_QUICK_REFERENCE.md - Quick lookup
TRACE_METRICS_EXPLANATION.md - Detailed explanation
TRACE_Metrics_Flow.png - Visual process flow
Sentence_Mapping_Example.png - Sentence-level details
RAG_Architecture_Diagram.png - System overview
RAG_Data_Flow_Diagram.png - Complete pipeline
RAG_Capstone_Project_Presentation.pptx - Full presentation
DOCUMENTATION_INDEX.md - Navigation guide