Spaces:
Sleeping
Sleeping
| # GPT Labeling Prompt β TRACE Metrics: Complete Explanation β¨ | |
| ## π― The Big Picture | |
| Your RAG Capstone Project uses **GPT (LLM) to evaluate RAG responses** instead of simple keyword matching. Here's how it works: | |
| ``` | |
| ββββββββββββββββ | |
| β Query β | |
| β + Response β | |
| β + Documents β | |
| ββββββββ¬ββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββ | |
| β Sentencize (Add keys: β | |
| β doc_0_s0, resp_s0, etc.) β | |
| ββββββββ¬ββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββ | |
| β Generate Structured GPT β | |
| β Labeling Prompt β | |
| ββββββββ¬ββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββ | |
| β Call Groq LLM API β | |
| β (llm_client.generate) β | |
| ββββββββ¬ββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββ | |
| β LLM Returns JSON with: β | |
| β - relevant_sentence_keys β | |
| β - utilized_sentence_keys β | |
| β - support_info β | |
| ββββββββ¬ββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββ | |
| β Extract and Calculate: β | |
| β R (Relevance) = 0.15 β | |
| β T (Utilization) = 0.67 β | |
| β C (Completeness)= 0.67 β | |
| β A (Adherence) = 1.0 β | |
| ββββββββ¬ββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββ | |
| β Return AdvancedTRACEScores β | |
| β with all metrics + metadata β | |
| ββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## π What the GPT Prompt Asks | |
| The GPT labeling prompt (in `advanced_rag_evaluator.py`, line 305) instructs the LLM to: | |
| **"You are a Fact-Checking and Citation Specialist"** | |
| 1. **Identify Relevant Information**: Which document sentences are relevant to the question? | |
| 2. **Verify Support**: Which document sentences support each response sentence? | |
| 3. **Check Completeness**: Is all important information covered? | |
| 4. **Detect Hallucinations**: Are there any unsupported claims? | |
| --- | |
| ## π What the LLM Returns (JSON) | |
| ```json | |
| { | |
| "relevance_explanation": "Documents 1-2 are relevant, document 3 is not", | |
| "all_relevant_sentence_keys": [ | |
| "doc_0_s0", β Sentence 0 from document 0 | |
| "doc_0_s1", β Sentence 1 from document 0 | |
| "doc_1_s0" β Sentence 0 from document 1 | |
| ], | |
| "sentence_support_information": [ | |
| { | |
| "response_sentence_key": "resp_s0", | |
| "explanation": "Matches doc_0_s0 about COVID-19", | |
| "supporting_sentence_keys": ["doc_0_s0"], | |
| "fully_supported": true β β No hallucination | |
| }, | |
| { | |
| "response_sentence_key": "resp_s1", | |
| "explanation": "Matches doc_0_s1 about droplet spread", | |
| "supporting_sentence_keys": ["doc_0_s1"], | |
| "fully_supported": true β β No hallucination | |
| } | |
| ], | |
| "all_utilized_sentence_keys": [ | |
| "doc_0_s0", | |
| "doc_0_s1" | |
| ], | |
| "overall_supported": true β Response is fully grounded | |
| } | |
| ``` | |
| --- | |
| ## π How Each TRACE Metric is Calculated | |
| ### **Metric 1: RELEVANCE (R)** | |
| **Question Being Answered**: "How much of the retrieved documents are relevant to the question?" | |
| **Code Location**: `advanced_rag_evaluator.py`, Lines 554-562 | |
| **Calculation**: | |
| ```python | |
| R = len(all_relevant_sentence_keys) / 20 | |
| ``` | |
| **From GPT Response**: | |
| - Uses: `all_relevant_sentence_keys` count | |
| - Example: `["doc_0_s0", "doc_0_s1", "doc_1_s0"]` β 3 keys | |
| - Divided by 20 (normalized max) | |
| - Result: 3/20 = **0.15** (15%) | |
| **Interpretation**: Only 15% of the document context is relevant to the query. Rest is noise. | |
| --- | |
| ### **Metric 2: UTILIZATION (T)** | |
| **Question Being Answered**: "Of the relevant information, how much did the LLM actually use?" | |
| **Code Location**: `advanced_rag_evaluator.py`, Lines 564-576 | |
| **Calculation**: | |
| ```python | |
| T = len(all_utilized_sentence_keys) / len(all_relevant_sentence_keys) | |
| ``` | |
| **From GPT Response**: | |
| - Numerator: `all_utilized_sentence_keys` count (e.g., 2) | |
| - Denominator: `all_relevant_sentence_keys` count (e.g., 3) | |
| - Result: 2/3 = **0.67** (67%) | |
| **Interpretation**: The LLM used 67% of the relevant information. It ignored one relevant sentence. | |
| --- | |
| ### **Metric 3: COMPLETENESS (C)** | |
| **Question Being Answered**: "Does the response cover all the relevant information?" | |
| **Code Location**: `advanced_rag_evaluator.py`, Lines 577-591 | |
| **Calculation**: | |
| ```python | |
| C = len(relevant_AND_utilized) / len(relevant) | |
| ``` | |
| **From GPT Response**: | |
| - Find intersection of: | |
| - `all_relevant_sentence_keys` = {doc_0_s0, doc_0_s1, doc_1_s0} | |
| - `all_utilized_sentence_keys` = {doc_0_s0, doc_0_s1} | |
| - Intersection = {doc_0_s0, doc_0_s1} β 2 items | |
| - Result: 2/3 = **0.67** (67%) | |
| **Interpretation**: The response covers 67% of the relevant information. Missing doc_1_s0. | |
| --- | |
| ### **Metric 4: ADHERENCE (A) - Hallucination Detection** | |
| **Question Being Answered**: "Does the response contain hallucinations? (Are all claims supported by documents?)" | |
| **Code Location**: `advanced_rag_evaluator.py`, Lines 593-609 | |
| **Calculation**: | |
| ```python | |
| if ALL response sentences have fully_supported=true: | |
| A = 1.0 | |
| else: | |
| A = 0.0 (at least one hallucination found!) | |
| ``` | |
| **From GPT Response**: | |
| - Check each item in `sentence_support_information` | |
| - Look at the `fully_supported` field | |
| - Example: | |
| ``` | |
| resp_s0: fully_supported = true β | |
| resp_s1: fully_supported = true β | |
| ``` | |
| - All are true β Result: **1.0** (No hallucinations!) | |
| - If any were false: | |
| ``` | |
| resp_s0: fully_supported = true β | |
| resp_s1: fully_supported = false β HALLUCINATION! | |
| ``` | |
| Result: **0.0** (Contains hallucination) | |
| **Interpretation**: 1.0 = Response is completely grounded in documents. 0.0 = Contains at least one unsupported claim. | |
| --- | |
| ## π Real Example: Full Walkthrough | |
| ### **Input**: | |
| ``` | |
| Question: "What is COVID-19?" | |
| Response: "COVID-19 is a respiratory disease. It spreads via droplets." | |
| Documents: | |
| 1. "COVID-19 is a respiratory disease caused by SARS-CoV-2." | |
| 2. "The virus spreads through respiratory droplets." | |
| 3. "Vaccines help prevent infection." | |
| ``` | |
| ### **Step 1: Sentencize** | |
| ``` | |
| doc_0_s0: "COVID-19 is a respiratory disease caused by SARS-CoV-2." | |
| doc_0_s1: "The virus spreads through respiratory droplets." | |
| doc_1_s0: "Vaccines help prevent infection." | |
| resp_s0: "COVID-19 is a respiratory disease." | |
| resp_s1: "It spreads via droplets." | |
| ``` | |
| ### **Step 2: Send to GPT Labeling Prompt** | |
| GPT analyzes and returns: | |
| ```json | |
| { | |
| "all_relevant_sentence_keys": ["doc_0_s0", "doc_0_s1"], | |
| "all_utilized_sentence_keys": ["doc_0_s0", "doc_0_s1"], | |
| "sentence_support_information": [ | |
| {"response_sentence_key": "resp_s0", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s0"]}, | |
| {"response_sentence_key": "resp_s1", "fully_supported": true, "supporting_sentence_keys": ["doc_0_s1"]} | |
| ] | |
| } | |
| ``` | |
| ### **Step 3: Calculate TRACE Metrics** | |
| **Relevance (R)**: | |
| - Relevant keys: 2 (doc_0_s0, doc_0_s1) | |
| - Formula: 2/20 = **0.10** (10%) | |
| - Meaning: 10% of the documents are relevant | |
| **Utilization (T)**: | |
| - Used: 2, Relevant: 2 | |
| - Formula: 2/2 = **1.00** (100%) | |
| - Meaning: Used all relevant information | |
| **Completeness (C)**: | |
| - Relevant β© Used = 2 | |
| - Formula: 2/2 = **1.00** (100%) | |
| - Meaning: Response covers all relevant info | |
| **Adherence (A)**: | |
| - All sentences: fully_supported=true? | |
| - YES β **1.0** (No hallucinations!) | |
| **Average Score**: | |
| - (0.10 + 1.00 + 1.00 + 1.0) / 4 = **0.775** (77.5% overall quality) | |
| --- | |
| ## π Why This Is Better Than Simple Metrics | |
| | Aspect | Simple Keywords | GPT Labeling | | |
| |--------|-----------------|--------------| | |
| | Understanding | β Keyword matching | β Semantic understanding | | |
| | Hallucination Detection | β Can't detect | β Detects all unsupported claims | | |
| | Paraphrasing | β Misses rephrased info | β Understands meaning | | |
| | Explainability | β "Just a number" | β Shows exact support mapping | | |
| | Domain Specificity | β οΈ Needs tuning | β Works across all domains | | |
| --- | |
| ## π Key Files to Reference | |
| | File | Purpose | Key Lines | | |
| |------|---------|-----------| | |
| | `advanced_rag_evaluator.py` | Main evaluation engine | All calculations | | |
| | `advanced_rag_evaluator.py` | Prompt template | Lines 305-350 | | |
| | `advanced_rag_evaluator.py` | Get GPT response | Lines 470-552 | | |
| | `advanced_rag_evaluator.py` | Calculate R metric | Lines 554-562 | | |
| | `advanced_rag_evaluator.py` | Calculate T metric | Lines 564-576 | | |
| | `advanced_rag_evaluator.py` | Calculate C metric | Lines 577-591 | | |
| | `advanced_rag_evaluator.py` | Calculate A metric | Lines 593-609 | | |
| | `llm_client.py` | Groq API calls | LLM integration | | |
| --- | |
| ## π‘ Key Insights | |
| 1. **All metrics come from ONE GPT response**: They're consistent and complementary | |
| 2. **Sentence keys enable traceability**: Can show exactly which doc supported which claim | |
| 3. **Adherence is binary**: Either fully supported (1.0) or not (0.0) - catches all hallucinations | |
| 4. **Relevance normalization**: Divided by 20 to ensure 0-1 range regardless of doc length | |
| 5. **LLM as Judge**: Semantic understanding without any code-based rule engineering | |
| --- | |
| ## π― Summary in One Sentence | |
| **GPT analyzes which document sentences support which response sentences, then metrics are calculated from this mapping to assess RAG quality.** | |
| --- | |
| ## π Complete Documentation Available | |
| 1. **TRACE_METRICS_QUICK_REFERENCE.md** - Quick lookup | |
| 2. **TRACE_METRICS_EXPLANATION.md** - Detailed explanation | |
| 3. **TRACE_Metrics_Flow.png** - Visual process flow | |
| 4. **Sentence_Mapping_Example.png** - Sentence-level details | |
| 5. **RAG_Architecture_Diagram.png** - System overview | |
| 6. **RAG_Data_Flow_Diagram.png** - Complete pipeline | |
| 7. **RAG_Capstone_Project_Presentation.pptx** - Full presentation | |
| 8. **DOCUMENTATION_INDEX.md** - Navigation guide | |