# TRACE Metrics Calculation - Visual Guide ## Step-by-Step Visualization ### STEP 1: Sentencization ``` DOCUMENTS RESPONSE ═══════════════════════════════ ══════════════════════════════ Doc 0: "Machine learning is AI that learns "ML is AI. It learns from data. from data. Deep learning uses neural Algorithms improve through time." networks. It's powerful for images." ↓ Split by sentence ends ↓ Split by sentence ends 0a: "ML is AI." a: "Machine learning is AI that 0b: "It learns from data." learns from data." 0c: "Algorithms improve b: "Deep learning uses neural through time." networks." c: "It's powerful for images." ``` ### STEP 2: GPT Analysis ``` GPT MODEL PROCESSES: ┌─────────────────────────────────────────────────────────────┐ │ │ │ INPUT: Sentencized docs + response + question │ │ │ │ ANALYSIS: │ │ ✓ Which doc sentences are relevant to question? │ │ ✓ Which doc sentences does response use? │ │ ✓ Is each response sentence fully/partially supported? │ │ │ │ OUTPUT: JSON with sentence keys and support mappings │ │ │ └─────────────────────────────────────────────────────────────┘ ``` ### STEP 3: Metric Calculation ``` GPT OUTPUT (SIMPLIFIED): { "all_relevant_sentence_keys": ["0a", "0b"], "all_utilized_sentence_keys": ["0a", "0b"], "sentence_support_information": [ {"response_sentence_key": "a", "fully_supported": true}, {"response_sentence_key": "b", "fully_supported": true}, {"response_sentence_key": "c", "fully_supported": false} ] } ↓ METRIC CALCULATION: ├─ Context Relevance = |relevant| / 20 = 2/20 = 0.10 ├─ Context Utilization = |utilized| / |relevant| = 2/2 = 1.0 ├─ Completeness = |relevant ∩ utilized| / |relevant| = 2/2 = 1.0 └─ Adherence = all_fully_supported? = false → 0.0 ``` --- ## Metric Formulas with Venn Diagrams ### Context Relevance (R) ``` ALL RETRIEVED SENTENCES ┌──────────────────────────┐ │ │ │ Total: ~20 sentences │ │ │ │ ┌──────────────────┐ │ │ │ RELEVANT: │ │ │ │ ["0a", "0b"] │ │ │ │ Count: 2 │ │ │ └──────────────────┘ │ │ │ │ Irrelevant: 18 │ │ │ └──────────────────────────┘ Formula: R = 2 / 20 = 0.10 (10%) Interpretation: 10% of retrieved content is relevant to question ``` ### Context Utilization (T) ``` RELEVANT SENTENCES ┌──────────────────────────┐ │ RELEVANT: ["0a", "0b"] │ │ │ │ ┌────────────────────┐ │ │ │ UTILIZED: │ │ │ │ ["0a", "0b"] │ │ │ │ Count: 2 │ │ │ └────────────────────┘ │ │ │ │ NOT USED: 0 │ │ │ └──────────────────────────┘ Formula: U = 2 / 2 = 1.0 (100%) Interpretation: All relevant information was used ``` ### Completeness (C) ``` RELEVANT UTILIZED ┌──────────────┐ ┌──────────────┐ │ ["0a", "0b"] │ │ ["0a", "0b"] │ │ │ │ │ │ COUNT: 2 │ │ COUNT: 2 │ │ │ │ │ └──────────────┘ └──────────────┘ │ │ └────────┬───────────┘ │ OVERLAP: ["0a", "0b"] COUNT: 2 Formula: C = 2 / 2 = 1.0 (100%) Interpretation: All relevant info is in response ``` ### Adherence (A) ``` RESPONSE SENTENCES: SUPPORT STATUS: ┌──────────────────┐ ┌──────────────────┐ │ a: "ML is AI..." │ ───────→│ ✓ Fully │ │ │ │ Supported │ │ b: "Deep learns..│ ───────→│ ✓ Fully │ │ │ │ Supported │ │ c: "Powerful..." │ ───────→│ ✗ Not │ │ │ │ Supported │ └──────────────────┘ └──────────────────┘ Formula: A = (all_supported) ? 1.0 : 0.0 = (true AND true AND false) ? 1.0 : 0.0 = 0.0 (100% = 0 because of one failure) Interpretation: Response contains hallucination (adherence fails) ``` --- ## Complete Example Walkthrough ### Input ``` QUESTION: "What makes machine learning different from traditional programming?" RETRIEVED DOCUMENTS: 0: "Machine learning is a subset of AI. It learns patterns from data. Traditional programming requires explicit instructions." 1: "ML algorithms improve through experience. They adapt to new data. Rule-based systems are rigid and hard to maintain." LLM RESPONSE: "Machine learning differs because it learns from data rather than requiring explicit instructions. ML algorithms improve over time. It's the future of all computing." ``` ### Step 1: Sentencization ``` DOCUMENTS: 0a: "Machine learning is a subset of AI." 0b: "It learns patterns from data." 0c: "Traditional programming requires explicit instructions." 1a: "ML algorithms improve through experience." 1b: "They adapt to new data." 1c: "Rule-based systems are rigid and hard to maintain." RESPONSE: a: "Machine learning differs because it learns from data rather than requiring explicit instructions." b: "ML algorithms improve over time." c: "It's the future of all computing." ``` ### Step 2: GPT Labeling ``` ANALYSIS BY GPT: Question focus: Differences between ML and traditional programming └─ "learns from data" vs "explicit instructions" └─ "improves through experience" └─ Adaptability RELEVANT SENTENCES (to question): ├─ 0a: "subset of AI" → Partially relevant ├─ 0b: "learns patterns from data" → RELEVANT ✓ ├─ 0c: "requires explicit instructions" → RELEVANT ✓ ├─ 1a: "improve through experience" → RELEVANT ✓ ├─ 1b: "adapt to new data" → RELEVANT ✓ └─ 1c: "rule-based systems rigid" → Partially relevant UTILIZED SENTENCES (used in response): ├─ response_a uses: 0b, 0c → Document references: [0b, 0c] ├─ response_b uses: 1a → Document references: [1a] └─ response_c uses: NONE → No support → [hallucination] FULLY SUPPORTED CHECK: ├─ response_a "learns from data, not explicit" → Supported by 0b, 0c ✓ ├─ response_b "algorithms improve" → Supported by 1a ✓ └─ response_c "future of all computing" → NOT in documents ✗ ``` ### Step 3: Metric Calculation ``` EXTRACTED DATA: all_relevant_sentence_keys = ["0b", "0c", "1a", "1b"] (4 sentences) all_utilized_sentence_keys = ["0b", "0c", "1a"] (3 sentences) sentence_support_information = [ {key: "a", fully_supported: true}, {key: "b", fully_supported: true}, {key: "c", fully_supported: false} ] CALCULATIONS: 1. Context Relevance = |relevant| / 20 = 4 / 20 = 0.20 (20%) 2. Context Utilization = |utilized| / |relevant| = 3 / 4 = 0.75 (75%) 3. Completeness = |relevant ∩ utilized| / |relevant| = |{0b, 0c, 1a}| / |{0b, 0c, 1a, 1b}| = 3 / 4 = 0.75 (75%) 4. Adherence = all fully_supported? = true AND true AND false = FALSE → 0.0 (0%) ``` ### Results ``` ┌─────────────────────────────────────────┐ │ TRACE METRICS RESULTS │ ├─────────────────────────────────────────┤ │ Context Relevance: 0.20 (20%) │ │ Context Utilization: 0.75 (75%) │ │ Completeness: 0.75 (75%) │ │ Adherence: 0.0 (0%) │ ├─────────────────────────────────────────┤ │ Average: 0.425 (42.5%) │ │ RMSE Aggregation: 0.437 │ │ Consistency Score: 0.563 │ └─────────────────────────────────────────┘ INTERPRETATION: ✓ Good relevance targeting (20%) ✓ Decent information usage (75%) ✓ Good coverage of relevant info (75%) ✗ Contains hallucination (0% adherence) ACTION: Address the hallucination about "future of all computing" ``` --- ## Calculation Pseudocode ```python # INPUT: GPT labeled output gpt_labels = { "all_relevant_sentence_keys": [...], "all_utilized_sentence_keys": [...], "sentence_support_information": [...] } # METRIC 1: Context Relevance relevant_count = len(gpt_labels["all_relevant_sentence_keys"]) context_relevance = min(1.0, relevant_count / 20.0) # METRIC 2: Context Utilization utilized_count = len(gpt_labels["all_utilized_sentence_keys"]) if relevant_count == 0: context_utilization = 0.0 else: context_utilization = min(1.0, utilized_count / relevant_count) # METRIC 3: Completeness relevant_set = set(gpt_labels["all_relevant_sentence_keys"]) utilized_set = set(gpt_labels["all_utilized_sentence_keys"]) overlap_count = len(relevant_set & utilized_set) if len(relevant_set) == 0: completeness = 1.0 if len(utilized_set) == 0 else 0.0 else: completeness = overlap_count / len(relevant_set) # METRIC 4: Adherence fully_supported_count = sum( 1 for sentence in gpt_labels["sentence_support_information"] if sentence["fully_supported"] ) total_sentences = len(gpt_labels["sentence_support_information"]) if total_sentences == 0: adherence = 1.0 else: adherence = 1.0 if fully_supported_count == total_sentences else 0.0 # OUTPUT scores = { "context_relevance": context_relevance, "context_utilization": context_utilization, "completeness": completeness, "adherence": adherence, "average": (context_relevance + context_utilization + completeness + adherence) / 4 } ``` --- ## Key Takeaways ### 1. Each Metric Answers a Different Question | Metric | Question | Data Source | |--------|----------|-------------| | **R** | Is retrieval good? | Relevant sentences | | **U** | Does LLM use it? | Utilized sentences | | **C** | Is response comprehensive? | Overlap | | **A** | Is response truthful? | Support flags | ### 2. Metrics Are Independent - Low R, high U is possible (ignore irrelevant) - Low U, high R is possible (retrieval good, generation bad) - Low C, high A is possible (limited but correct) ### 3. GPT Labeling is Sentence-Level - Fine-grained sentence keys (0a, 0b, 1c, etc.) - Exact mapping of support - Transparent and verifiable ### 4. All Four Metrics Required for Full Picture ``` Relevance: ← "Did we retrieve the right docs?" Utilization: ← "Did the LLM use them?" Completeness: ← "Did it cover the information?" Adherence: ← "Is it accurate?" ``` All four needed to understand RAG quality.