# GPT Labeling Prompt → TRACE Metrics: Complete Explanation ✨ ## 🎯 The Big Picture Your RAG Capstone Project uses **GPT (LLM) to evaluate RAG responses** instead of simple keyword matching. Here's how it works: ``` ┌──────────────┐ │ Query │ │ + Response │ │ + Documents │ └──────┬───────┘ │ ▼ ┌──────────────────────────────┐ │ Sentencize (Add keys: │ │ doc_0_s0, resp_s0, etc.) │ └──────┬───────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Generate Structured GPT │ │ Labeling Prompt │ └──────┬───────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Call Groq LLM API │ │ (llm_client.generate) │ └──────┬───────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ LLM Returns JSON with: │ │ - relevant_sentence_keys │ │ - utilized_sentence_keys │ │ - support_info │ └──────┬───────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Extract and Calculate: │ │ R (Relevance) = 0.15 │ │ T (Utilization) = 0.67 │ │ C (Completeness)= 0.67 │ │ A (Adherence) = 1.0 │ └──────┬───────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Return AdvancedTRACEScores │ │ with all metrics + metadata │ └──────────────────────────────┘ ``` --- ## 📋 What the GPT Prompt Asks The GPT labeling prompt (in `advanced_rag_evaluator.py`, line 305) instructs the LLM to: **"You are a Fact-Checking and Citation Specialist"** 1. **Identify Relevant Information**: Which document sentences are relevant to the question? 2. **Verify Support**: Which document sentences support each response sentence? 3. **Check Completeness**: Is all important information covered? 4. **Detect Hallucinations**: Are there any unsupported claims? --- ## 🔍 What the LLM Returns (JSON) ```json { "relevance_explanation": "Documents 1-2 are relevant, document 3 is not", "all_relevant_sentence_keys": [ "doc_0_s0", ← Sentence 0 from document 0 "doc_0_s1", ← Sentence 1 from document 0 "doc_1_s0" ← Sentence 0 from document 1 ], "sentence_support_information": [ { "response_sentence_key": "resp_s0", "explanation": "Matches doc_0_s0 about COVID-19", "supporting_sentence_keys": ["doc_0_s0"], "fully_supported": true ← ✓ No hallucination }, { "response_sentence_key": "resp_s1", "explanation": "Matches doc_0_s1 about droplet spread", "supporting_sentence_keys": ["doc_0_s1"], "fully_supported": true ← ✓ No hallucination } ], "all_utilized_sentence_keys": [ "doc_0_s0", "doc_0_s1" ], "overall_supported": true ← Response is fully grounded } ``` --- ## 📊 How Each TRACE Metric is Calculated ### **Metric 1: RELEVANCE (R)** **Question Being Answered**: "How much of the retrieved documents are relevant to the question?" **Code Location**: `advanced_rag_evaluator.py`, Lines 554-562 **Calculation**: ```python R = len(all_relevant_sentence_keys) / 20 ``` **From GPT Response**: - Uses: `all_relevant_sentence_keys` count - Example: `["doc_0_s0", "doc_0_s1", "doc_1_s0"]` → 3 keys - Divided by 20 (normalized max) - Result: 3/20 = **0.15** (15%) **Interpretation**: Only 15% of the document context is relevant to the query. Rest is noise. --- ### **Metric 2: UTILIZATION (T)** **Question Being Answered**: "Of the relevant information, how much did the LLM actually use?" **Code Location**: `advanced_rag_evaluator.py`, Lines 564-576 **Calculation**: ```python T = len(all_utilized_sentence_keys) / len(all_relevant_sentence_keys) ``` **From GPT Response**: - Numerator: `all_utilized_sentence_keys` count (e.g., 2) - Denominator: `all_relevant_sentence_keys` count (e.g., 3) - Result: 2/3 = **0.67** (67%) **Interpretation**: The LLM used 67% of the relevant information. It ignored one relevant sentence. --- ### **Metric 3: COMPLETENESS (C)** **Question Being Answered**: "Does the response cover all the relevant information?" **Code Location**: `advanced_rag_evaluator.py`, Lines 577-591 **Calculation**: ```python C = len(relevant_AND_utilized) / len(relevant) ``` **From GPT Response**: - Find intersection of: - `all_relevant_sentence_keys` = {doc_0_s0, doc_0_s1, doc_1_s0} - `all_utilized_sentence_keys` = {doc_0_s0, doc_0_s1} - Intersection = {doc_0_s0, doc_0_s1} → 2 items - Result: 2/3 = **0.67** (67%) **Interpretation**: The response covers 67% of the relevant information. Missing doc_1_s0. --- ### **Metric 4: ADHERENCE (A) - Hallucination Detection** **Question Being Answered**: "Does the response contain hallucinations? (Are all claims supported by documents?)" **Code Location**: `advanced_rag_evaluator.py`, Lines 593-609 **Calculation**: ```python if ALL response sentences have fully_supported=true: A = 1.0 else: A = 0.0 (at least one hallucination found!) ``` **From GPT Response**: - Check each item in `sentence_support_information` - Look at the `fully_supported` field - Example: ``` resp_s0: fully_supported = true ✓ resp_s1: fully_supported = true ✓ ``` - All are true → Result: **1.0** (No hallucinations!) - If any were false: ``` resp_s0: fully_supported = true ✓ resp_s1: fully_supported = false ✗ HALLUCINATION! ``` Result: **0.0** (Contains hallucination) **Interpretation**: 1.0 = Response is completely grounded in documents. 0.0 = Contains at least one unsupported claim. --- ## 📈 Real Example: Full Walkthrough ### **Input**: ``` Question: "What is COVID-19?" Response: "COVID-19 is a respiratory disease. It spreads via droplets." Documents: 1. "COVID-19 is a respiratory disease caused by SARS-CoV-2." 2. "The virus spreads through respiratory droplets." 3. "Vaccines help prevent infection." ``` ### **Step 1: Sentencize** ``` doc_0_s0: "COVID-19 is a respiratory disease caused by SARS-CoV-2." doc_0_s1: "The virus spreads through respiratory droplets." doc_1_s0: "Vaccines help prevent infection." resp_s0: "COVID-19 is a respiratory disease." resp_s1: "It spreads via droplets." ``` ### **Step 2: Send to GPT Labeling Prompt** GPT analyzes and returns: ```json { "all_relevant_sentence_keys": ["doc_0_s0", "doc_0_s1"], "all_utilized_sentence_keys": ["doc_0_s0", "doc_0_s1"], "sentence_support_information": [ {"response_sentence_key": "resp_s0", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s0"]}, {"response_sentence_key": "resp_s1", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s1"]} ] } ``` ### **Step 3: Calculate TRACE Metrics** **Relevance (R)**: - Relevant keys: 2 (doc_0_s0, doc_0_s1) - Formula: 2/20 = **0.10** (10%) - Meaning: 10% of the documents are relevant **Utilization (T)**: - Used: 2, Relevant: 2 - Formula: 2/2 = **1.00** (100%) - Meaning: Used all relevant information **Completeness (C)**: - Relevant ∩ Used = 2 - Formula: 2/2 = **1.00** (100%) - Meaning: Response covers all relevant info **Adherence (A)**: - All sentences: fully_supported=true? - YES → **1.0** (No hallucinations!) **Average Score**: - (0.10 + 1.00 + 1.00 + 1.0) / 4 = **0.775** (77.5% overall quality) --- ## 🎓 Why This Is Better Than Simple Metrics | Aspect | Simple Keywords | GPT Labeling | |--------|-----------------|--------------| | Understanding | ❌ Keyword matching | ✅ Semantic understanding | | Hallucination Detection | ❌ Can't detect | ✅ Detects all unsupported claims | | Paraphrasing | ❌ Misses rephrased info | ✅ Understands meaning | | Explainability | ❌ "Just a number" | ✅ Shows exact support mapping | | Domain Specificity | ⚠️ Needs tuning | ✅ Works across all domains | --- ## 🔑 Key Files to Reference | File | Purpose | Key Lines | |------|---------|-----------| | `advanced_rag_evaluator.py` | Main evaluation engine | All calculations | | `advanced_rag_evaluator.py` | Prompt template | Lines 305-350 | | `advanced_rag_evaluator.py` | Get GPT response | Lines 470-552 | | `advanced_rag_evaluator.py` | Calculate R metric | Lines 554-562 | | `advanced_rag_evaluator.py` | Calculate T metric | Lines 564-576 | | `advanced_rag_evaluator.py` | Calculate C metric | Lines 577-591 | | `advanced_rag_evaluator.py` | Calculate A metric | Lines 593-609 | | `llm_client.py` | Groq API calls | LLM integration | --- ## 💡 Key Insights 1. **All metrics come from ONE GPT response**: They're consistent and complementary 2. **Sentence keys enable traceability**: Can show exactly which doc supported which claim 3. **Adherence is binary**: Either fully supported (1.0) or not (0.0) - catches all hallucinations 4. **Relevance normalization**: Divided by 20 to ensure 0-1 range regardless of doc length 5. **LLM as Judge**: Semantic understanding without any code-based rule engineering --- ## 🎯 Summary in One Sentence **GPT analyzes which document sentences support which response sentences, then metrics are calculated from this mapping to assess RAG quality.** --- ## 📚 Complete Documentation Available 1. **TRACE_METRICS_QUICK_REFERENCE.md** - Quick lookup 2. **TRACE_METRICS_EXPLANATION.md** - Detailed explanation 3. **TRACE_Metrics_Flow.png** - Visual process flow 4. **Sentence_Mapping_Example.png** - Sentence-level details 5. **RAG_Architecture_Diagram.png** - System overview 6. **RAG_Data_Flow_Diagram.png** - Complete pipeline 7. **RAG_Capstone_Project_Presentation.pptx** - Full presentation 8. **DOCUMENTATION_INDEX.md** - Navigation guide