# GPT Labeling Approach - Main Metrics Calculation ## Overview The RAG evaluation system uses **GPT-based labeling** to calculate four TRACE metrics. The GPT model analyzes responses sentence-by-sentence and identifies which parts are supported by retrieved documents, enabling precise metric calculation. --- ## The Four TRACE Metrics ### 1. **Context Relevance (R)** - What's Actually Relevant? **Definition:** Fraction of retrieved context that is relevant to answering the user's question. **Calculation:** ``` Context Relevance = Number of relevant sentences / Total retrieved sentences Normalized: min(1.0, count / 20) ``` **Formula:** ``` R = |Relevant Sentences| / |Total Sentences| ``` **Data Source from GPT:** ```python gpt_labels.all_relevant_sentence_keys # List of document sentence keys identified as relevant # Example: ["0a", "0b", "1c", "2a"] ``` **Example:** ``` Retrieved 30 sentences total GPT identifies 12 as relevant to question "What is machine learning?" Context Relevance = 12/30 = 0.40 (40%) ``` **What It Tells You:** - ✓ How good was the retrieval? - ✓ Did we pull documents about the right topic? - ✓ Are there irrelevant documents in the results? --- ### 2. **Context Utilization (T)** - How Much Was Used? **Definition:** Fraction of relevant context that the response actually used to generate its answer. **Calculation:** ``` Context Utilization = Number of utilized sentences / Number of relevant sentences ``` **Formula:** ``` U = |Utilized Sentences| / |Relevant Sentences| ``` **Data Source from GPT:** ```python gpt_labels.all_utilized_sentence_keys # List of document sentence keys actually used in response # Example: ["0a", "0b", "1c"] ``` **Example:** ``` Context Relevance found 12 relevant sentences: ["0a", "0b", "1a", "1c", "2a", ...] GPT identifies 8 actually used in response: ["0a", "0b", "1c", "2a", ...] Context Utilization = 8/12 = 0.67 (67%) ``` **What It Tells You:** - ✓ Did the LLM actually use the available information? - ✓ Is the response limited by context availability? - ✓ Is context being ignored/wasted? **Problem Pattern:** - High Relevance (0.9) + Low Utilization (0.3) → Retrieval is good, but LLM isn't using it → Fix: Improve prompt instructions --- ### 3. **Completeness (C)** - Was It Comprehensive? **Definition:** Fraction of relevant information that is covered by the response. **Calculation:** ``` Completeness = (Relevant ∩ Utilized) / Relevant = Relevant sentences that were used / All relevant sentences ``` **Formula:** ``` C = |Relevant ∩ Utilized| / |Relevant| ``` **Data Source from GPT:** ```python # Set intersection: relevant_set = set(gpt_labels.all_relevant_sentence_keys) utilized_set = set(gpt_labels.all_utilized_sentence_keys) intersection = relevant_set & utilized_set completeness = len(intersection) / len(relevant_set) ``` **Example:** ``` Relevant sentences: {"0a", "0b", "1a", "1c", "2a", "2b", "3a"} (7 total) Utilized sentences: {"0a", "0b", "1c", "2a"} (4 used) Overlap (Relevant AND Used): {"0a", "0b", "1c", "2a"} (4 in both) Completeness = 4/7 = 0.57 (57%) Missing: "1a", "2b", "3a" were relevant but not mentioned ``` **What It Tells You:** - ✓ Did the response cover all important information? - ✓ What relevant details were omitted? - ✓ Is the response comprehensive? **Problem Pattern:** - Low Completeness (0.4) with High Adherence (1.0) → Response is accurate but limited → Missing important information → Fix: Improve retrieval coverage or summarization --- ### 4. **Adherence (A)** - Was It Grounded? **Definition:** Whether the response is fully grounded in the retrieved context (no hallucinations). **Calculation:** ``` Adherence = 1.0 if ALL sentences are fully supported, 0.0 otherwise (Boolean: fully grounded or contains hallucination) ``` **Formula:** ``` A = 1.0 if all(sentence.fully_supported for all sentences) else 0.0 ``` **Data Source from GPT:** ```python gpt_labels.sentence_support_information # For each response sentence: # { # "response_sentence_key": "a", # "fully_supported": true/false, # ← This determines adherence # "supporting_sentence_keys": ["0a", "0b"], # "explanation": "..." # } fully_supported_count = sum( 1 for s in sentence_support_information if s.get("fully_supported", False) ) adherence = 1.0 if fully_supported_count == total_sentences else 0.0 ``` **Example 1 - Perfect Adherence:** ``` Response sentences: a. "Machine learning is a subset of AI." └─ Fully supported by document 0a ✓ b. "It uses algorithms to learn from data." └─ Fully supported by document 1b ✓ c. "Common applications include image recognition." └─ Fully supported by documents 2a, 2b ✓ ALL sentences fully supported → Adherence = 1.0 (100%) ``` **Example 2 - Contains Hallucination:** ``` Response sentences: a. "Machine learning is a subset of AI." └─ Fully supported by document 0a ✓ b. "It requires quantum computers." └─ NOT supported by any document ✗ (Hallucination!) c. "Common applications include image recognition." └─ Fully supported by documents 2a, 2b ✓ ONE sentence NOT fully supported → Adherence = 0.0 (0%) ``` **What It Tells You:** - ✓ Is the response truthful/grounded? - ✓ Does it contain hallucinations? - ✓ Can we trust the answer? --- ## How GPT Labeling Works ### Step 1: Sentencization **Documents:** ``` Document 0: "Machine learning is AI. It learns from data." Document 1: "Neural networks are models. They mimic brains." ↓ Splits into sentences with keys: 0a: "Machine learning is AI." 0b: "It learns from data." 1a: "Neural networks are models." 1b: "They mimic brains." ``` **Response:** ``` "Machine learning uses neural networks. They learn patterns." ↓ Splits into sentences: a: "Machine learning uses neural networks." b: "They learn patterns." ``` ### Step 2: GPT Labeling GPT analyzes and identifies: 1. **Relevance:** Which document sentences are relevant to the question? 2. **Utilization:** Which document sentences were actually used in the response? 3. **Support:** Is each response sentence fully/partially/not supported? **GPT Output (JSON):** ```json { "relevance_explanation": "Document discusses ML basics...", "all_relevant_sentence_keys": ["0a", "0b", "1a"], "overall_supported_explanation": "Response is grounded...", "overall_supported": true, "sentence_support_information": [ { "response_sentence_key": "a", "explanation": "Matches document sentences...", "supporting_sentence_keys": ["0a", "1a"], "fully_supported": true }, { "response_sentence_key": "b", "explanation": "Partially supported...", "supporting_sentence_keys": ["1b"], "fully_supported": false } ], "all_utilized_sentence_keys": ["0a", "1a", "1b"] } ``` ### Step 3: Metric Calculation ``` Relevant: ["0a", "0b", "1a"] (3 sentences) Utilized: ["0a", "1a", "1b"] (3 sentences) Context Relevance = |["0a", "0b", "1a"]| / total_sentences = 3 / ~20 = 0.15 Context Utilization = |Utilized| / |Relevant| = 3 / 3 = 1.0 Completeness = |Relevant ∩ Utilized| / |Relevant| = |{"0a", "1a"}| / |{"0a", "0b", "1a"}| = 2 / 3 = 0.67 Adherence = All sentences fully supported? → Check each = sentence_a (true) AND sentence_b (false) = false → 0.0 ``` --- ## Code Implementation ### Metric Calculation Methods ```python def _compute_context_relevance(self, gpt_labels: GPTLabelingOutput) -> float: """Count relevant sentences, normalize to 0-1.""" if not gpt_labels.all_relevant_sentence_keys: return 0.0 return min(1.0, len(gpt_labels.all_relevant_sentence_keys) / 20.0) def _compute_context_utilization(self, gpt_labels: GPTLabelingOutput) -> float: """Utilized / Relevant.""" relevant_count = len(gpt_labels.all_relevant_sentence_keys) utilized_count = len(gpt_labels.all_utilized_sentence_keys) if relevant_count == 0: return 0.0 return min(1.0, utilized_count / relevant_count) def _compute_completeness(self, gpt_labels: GPTLabelingOutput, ground_truth: Optional[str] = None) -> float: """(Relevant AND Utilized) / Relevant.""" relevant_set = set(gpt_labels.all_relevant_sentence_keys) utilized_set = set(gpt_labels.all_utilized_sentence_keys) intersection = len(relevant_set & utilized_set) if len(relevant_set) == 0: return 1.0 if len(utilized_set) == 0 else 0.0 return intersection / len(relevant_set) def _compute_adherence(self, gpt_labels: GPTLabelingOutput) -> float: """All sentences fully supported? Boolean: 1.0 or 0.0.""" total_sentences = len(gpt_labels.sentence_support_information) if total_sentences == 0: return 1.0 fully_supported_count = sum( 1 for s in gpt_labels.sentence_support_information if s.get("fully_supported", False) ) return 1.0 if fully_supported_count == total_sentences else 0.0 ``` --- ## Complete Example: Full Calculation ### Input **Question:** "What is machine learning?" **Retrieved Documents:** ``` Doc 0: "Machine learning is a subset of AI. It learns patterns from data. Algorithms improve through experience." Doc 1: "Deep learning uses neural networks. It's popular in computer vision." Doc 2: "Supervised learning needs labeled data. Unsupervised learning finds patterns." ``` **LLM Response:** ``` "Machine learning is a field of AI that learns from data. Deep learning uses neural networks. It's powerful for image recognition." ``` ### GPT Labeling Process **Sentencized Documents:** ``` 0a: "Machine learning is a subset of AI." 0b: "It learns patterns from data." 0c: "Algorithms improve through experience." 1a: "Deep learning uses neural networks." 1b: "It's popular in computer vision." 2a: "Supervised learning needs labeled data." 2b: "Unsupervised learning finds patterns." ``` **Sentencized Response:** ``` a: "Machine learning is a field of AI that learns from data." b: "Deep learning uses neural networks." c: "It's powerful for image recognition." ``` **GPT Analysis:** ```json { "all_relevant_sentence_keys": ["0a", "0b", "1a", "1b"], "all_utilized_sentence_keys": ["0a", "0b", "1a", "1b"], "sentence_support_information": [ { "response_sentence_key": "a", "supporting_sentence_keys": ["0a", "0b"], "fully_supported": true }, { "response_sentence_key": "b", "supporting_sentence_keys": ["1a"], "fully_supported": true }, { "response_sentence_key": "c", "supporting_sentence_keys": ["1b"], "fully_supported": false // "powerful for image recognition" not explicitly in docs } ] } ``` ### Metric Calculation ``` Relevant: ["0a", "0b", "1a", "1b"] (4 sentences) Utilized: ["0a", "0b", "1a", "1b"] (4 sentences) Total sentences retrieved: 7 Context Relevance = 4 / 20 = 0.20 (20%) └─ 4 out of ~20 average sentences are relevant Context Utilization = 4 / 4 = 1.0 (100%) └─ All relevant information was used Completeness = |{"0a","0b","1a","1b"} ∩ {"0a","0b","1a","1b"}| / 4 = 4 / 4 = 1.0 (100%) └─ All relevant info was covered Adherence = All fully supported? = sentence_a (true) AND sentence_b (true) AND sentence_c (false) = false → 0.0 (0%) └─ Contains hallucination about "powerful for image recognition" Average Score = (0.20 + 1.0 + 1.0 + 0.0) / 4 = 0.55 ``` --- ## Key Insights ### 1. Complementary Metrics | Metric | Measures | Ideal Value | |--------|----------|-------------| | **Relevance** | Quality of retrieval | High (0.7+) | | **Utilization** | LLM uses available info | High (0.7+) | | **Completeness** | Coverage of information | High (0.7+) | | **Adherence** | Grounding (no hallucination) | Perfect (1.0) | ### 2. Common Patterns **Pattern 1: Good Retrieval, Bad Generation** ``` Relevance: 0.85 (good retrieval) Utilization: 0.40 (not using it) → Problem: LLM not leveraging context → Fix: Improve prompt instructions ``` **Pattern 2: Conservative but Accurate** ``` Completeness: 0.50 (missing info) Adherence: 1.0 (all correct) → Problem: Limited but grounded response → Fix: Improve retrieval coverage ``` **Pattern 3: Comprehensive and Grounded** ``` Relevance: 0.75, Utilization: 0.80, Completeness: 0.85, Adherence: 1.0 → Excellent RAG system → Action: Monitor and maintain ``` ### 3. Mathematical Relationships ``` Completeness ≤ Utilization ≤ Relevance (Because: utilization requires relevance, completeness requires utilization) Also: If Relevance = 0 → Utilization = 0, Completeness = 0 If Utilization = 0 → Completeness = 0 (but Relevance can be > 0) ``` --- ## Advantages of GPT Labeling ✅ **Semantic Understanding** - Not just keyword matching - Understands meaning and context - Detects subtle hallucinations ✅ **Fine-Grained Analysis** - Sentence-level support mapping - Identifies exactly which info is supported - Pinpoints problematic claims ✅ **Comprehensive** - Evaluates all four TRACE metrics - Single pass through documents - Complete audit trail ✅ **Transparent** - Full explanation for each metric - Shows supporting evidence - Human-verifiable results --- ## Limitations ❌ **Cost** - API calls per evaluation (~2.5s per eval with rate limiting) - At 30 RPM: 50 evals = 3-5 minutes ❌ **Semantic Brittleness** - Depends on GPT's understanding - May miss implicit knowledge - Sensitive to phrasing ❌ **Normalization** - Context Relevance normalized by 20 (arbitrary baseline) - Different domain sizes affect scaling ❌ **Binary Adherence** - One hallucination = 0.0 adherence - No partial credit for mostly correct --- ## Summary The GPT labeling approach calculates TRACE metrics by: 1. **Splitting** documents and response into sentences 2. **Analyzing** with GPT to identify relevant/utilized/supported information 3. **Computing** metrics from the labeled sentence keys: - **Relevance**: What's relevant in retrieved docs? - **Utilization**: What of that was actually used? - **Completeness**: What coverage of relevant info? - **Adherence**: Is all information grounded? This enables precise, interpretable evaluation of RAG system quality.