CapStoneRAG10 / docs /HOW_GPT_LABELING_CALCULATES_TRACE_METRICS.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

GPT Labeling Prompt β†’ TRACE Metrics: Complete Explanation ✨

🎯 The Big Picture

Your RAG Capstone Project uses GPT (LLM) to evaluate RAG responses instead of simple keyword matching. Here's how it works:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Query      β”‚
β”‚ + Response   β”‚
β”‚ + Documents  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Sentencize (Add keys:         β”‚
β”‚  doc_0_s0, resp_s0, etc.)    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Generate Structured GPT      β”‚
β”‚ Labeling Prompt              β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Call Groq LLM API            β”‚
β”‚ (llm_client.generate)        β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LLM Returns JSON with:       β”‚
β”‚ - relevant_sentence_keys     β”‚
β”‚ - utilized_sentence_keys     β”‚
β”‚ - support_info               β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Extract and Calculate:       β”‚
β”‚ R (Relevance)   = 0.15       β”‚
β”‚ T (Utilization) = 0.67       β”‚
β”‚ C (Completeness)= 0.67       β”‚
β”‚ A (Adherence)   = 1.0        β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Return AdvancedTRACEScores   β”‚
β”‚ with all metrics + metadata  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ What the GPT Prompt Asks

The GPT labeling prompt (in advanced_rag_evaluator.py, line 305) instructs the LLM to:

"You are a Fact-Checking and Citation Specialist"

  1. Identify Relevant Information: Which document sentences are relevant to the question?
  2. Verify Support: Which document sentences support each response sentence?
  3. Check Completeness: Is all important information covered?
  4. Detect Hallucinations: Are there any unsupported claims?

πŸ” What the LLM Returns (JSON)

{
  "relevance_explanation": "Documents 1-2 are relevant, document 3 is not",
  
  "all_relevant_sentence_keys": [
    "doc_0_s0",  ← Sentence 0 from document 0
    "doc_0_s1",  ← Sentence 1 from document 0
    "doc_1_s0"   ← Sentence 0 from document 1
  ],
  
  "sentence_support_information": [
    {
      "response_sentence_key": "resp_s0",
      "explanation": "Matches doc_0_s0 about COVID-19",
      "supporting_sentence_keys": ["doc_0_s0"],
      "fully_supported": true  ← βœ“ No hallucination
    },
    {
      "response_sentence_key": "resp_s1",
      "explanation": "Matches doc_0_s1 about droplet spread",
      "supporting_sentence_keys": ["doc_0_s1"],
      "fully_supported": true  ← βœ“ No hallucination
    }
  ],
  
  "all_utilized_sentence_keys": [
    "doc_0_s0",
    "doc_0_s1"
  ],
  
  "overall_supported": true  ← Response is fully grounded
}

πŸ“Š How Each TRACE Metric is Calculated

Metric 1: RELEVANCE (R)

Question Being Answered: "How much of the retrieved documents are relevant to the question?"

Code Location: advanced_rag_evaluator.py, Lines 554-562

Calculation:

R = len(all_relevant_sentence_keys) / 20

From GPT Response:

  • Uses: all_relevant_sentence_keys count
  • Example: ["doc_0_s0", "doc_0_s1", "doc_1_s0"] β†’ 3 keys
  • Divided by 20 (normalized max)
  • Result: 3/20 = 0.15 (15%)

Interpretation: Only 15% of the document context is relevant to the query. Rest is noise.


Metric 2: UTILIZATION (T)

Question Being Answered: "Of the relevant information, how much did the LLM actually use?"

Code Location: advanced_rag_evaluator.py, Lines 564-576

Calculation:

T = len(all_utilized_sentence_keys) / len(all_relevant_sentence_keys)

From GPT Response:

  • Numerator: all_utilized_sentence_keys count (e.g., 2)
  • Denominator: all_relevant_sentence_keys count (e.g., 3)
  • Result: 2/3 = 0.67 (67%)

Interpretation: The LLM used 67% of the relevant information. It ignored one relevant sentence.


Metric 3: COMPLETENESS (C)

Question Being Answered: "Does the response cover all the relevant information?"

Code Location: advanced_rag_evaluator.py, Lines 577-591

Calculation:

C = len(relevant_AND_utilized) / len(relevant)

From GPT Response:

  • Find intersection of:
    • all_relevant_sentence_keys = {doc_0_s0, doc_0_s1, doc_1_s0}
    • all_utilized_sentence_keys = {doc_0_s0, doc_0_s1}
  • Intersection = {doc_0_s0, doc_0_s1} β†’ 2 items
  • Result: 2/3 = 0.67 (67%)

Interpretation: The response covers 67% of the relevant information. Missing doc_1_s0.


Metric 4: ADHERENCE (A) - Hallucination Detection

Question Being Answered: "Does the response contain hallucinations? (Are all claims supported by documents?)"

Code Location: advanced_rag_evaluator.py, Lines 593-609

Calculation:

if ALL response sentences have fully_supported=true:
    A = 1.0
else:
    A = 0.0  (at least one hallucination found!)

From GPT Response:

  • Check each item in sentence_support_information

  • Look at the fully_supported field

  • Example:

    resp_s0: fully_supported = true βœ“
    resp_s1: fully_supported = true βœ“
    
  • All are true β†’ Result: 1.0 (No hallucinations!)

  • If any were false:

    resp_s0: fully_supported = true βœ“
    resp_s1: fully_supported = false βœ— HALLUCINATION!
    

    Result: 0.0 (Contains hallucination)

Interpretation: 1.0 = Response is completely grounded in documents. 0.0 = Contains at least one unsupported claim.


πŸ“ˆ Real Example: Full Walkthrough

Input:

Question:  "What is COVID-19?"
Response:  "COVID-19 is a respiratory disease. It spreads via droplets."

Documents:
1. "COVID-19 is a respiratory disease caused by SARS-CoV-2."
2. "The virus spreads through respiratory droplets."
3. "Vaccines help prevent infection."

Step 1: Sentencize

doc_0_s0: "COVID-19 is a respiratory disease caused by SARS-CoV-2."
doc_0_s1: "The virus spreads through respiratory droplets."
doc_1_s0: "Vaccines help prevent infection."

resp_s0: "COVID-19 is a respiratory disease."
resp_s1: "It spreads via droplets."

Step 2: Send to GPT Labeling Prompt

GPT analyzes and returns:

{
  "all_relevant_sentence_keys": ["doc_0_s0", "doc_0_s1"],
  "all_utilized_sentence_keys": ["doc_0_s0", "doc_0_s1"],
  "sentence_support_information": [
    {"response_sentence_key": "resp_s0", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s0"]},
    {"response_sentence_key": "resp_s1", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s1"]}
  ]
}

Step 3: Calculate TRACE Metrics

Relevance (R):

  • Relevant keys: 2 (doc_0_s0, doc_0_s1)
  • Formula: 2/20 = 0.10 (10%)
  • Meaning: 10% of the documents are relevant

Utilization (T):

  • Used: 2, Relevant: 2
  • Formula: 2/2 = 1.00 (100%)
  • Meaning: Used all relevant information

Completeness (C):

  • Relevant ∩ Used = 2
  • Formula: 2/2 = 1.00 (100%)
  • Meaning: Response covers all relevant info

Adherence (A):

  • All sentences: fully_supported=true?
  • YES β†’ 1.0 (No hallucinations!)

Average Score:

  • (0.10 + 1.00 + 1.00 + 1.0) / 4 = 0.775 (77.5% overall quality)

πŸŽ“ Why This Is Better Than Simple Metrics

Aspect Simple Keywords GPT Labeling
Understanding ❌ Keyword matching βœ… Semantic understanding
Hallucination Detection ❌ Can't detect βœ… Detects all unsupported claims
Paraphrasing ❌ Misses rephrased info βœ… Understands meaning
Explainability ❌ "Just a number" βœ… Shows exact support mapping
Domain Specificity ⚠️ Needs tuning βœ… Works across all domains

πŸ”‘ Key Files to Reference

File Purpose Key Lines
advanced_rag_evaluator.py Main evaluation engine All calculations
advanced_rag_evaluator.py Prompt template Lines 305-350
advanced_rag_evaluator.py Get GPT response Lines 470-552
advanced_rag_evaluator.py Calculate R metric Lines 554-562
advanced_rag_evaluator.py Calculate T metric Lines 564-576
advanced_rag_evaluator.py Calculate C metric Lines 577-591
advanced_rag_evaluator.py Calculate A metric Lines 593-609
llm_client.py Groq API calls LLM integration

πŸ’‘ Key Insights

  1. All metrics come from ONE GPT response: They're consistent and complementary
  2. Sentence keys enable traceability: Can show exactly which doc supported which claim
  3. Adherence is binary: Either fully supported (1.0) or not (0.0) - catches all hallucinations
  4. Relevance normalization: Divided by 20 to ensure 0-1 range regardless of doc length
  5. LLM as Judge: Semantic understanding without any code-based rule engineering

🎯 Summary in One Sentence

GPT analyzes which document sentences support which response sentences, then metrics are calculated from this mapping to assess RAG quality.


πŸ“š Complete Documentation Available

  1. TRACE_METRICS_QUICK_REFERENCE.md - Quick lookup
  2. TRACE_METRICS_EXPLANATION.md - Detailed explanation
  3. TRACE_Metrics_Flow.png - Visual process flow
  4. Sentence_Mapping_Example.png - Sentence-level details
  5. RAG_Architecture_Diagram.png - System overview
  6. RAG_Data_Flow_Diagram.png - Complete pipeline
  7. RAG_Capstone_Project_Presentation.pptx - Full presentation
  8. DOCUMENTATION_INDEX.md - Navigation guide