Spaces:
Sleeping
Sleeping
| # TRACE Metrics Calculation - Visual Guide | |
| ## Step-by-Step Visualization | |
| ### STEP 1: Sentencization | |
| ``` | |
| DOCUMENTS RESPONSE | |
| βββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ | |
| Doc 0: "Machine learning is AI that learns | |
| "ML is AI. It learns from data. from data. Deep learning uses neural | |
| Algorithms improve through time." networks. It's powerful for images." | |
| β Split by sentence ends β Split by sentence ends | |
| 0a: "ML is AI." a: "Machine learning is AI that | |
| 0b: "It learns from data." learns from data." | |
| 0c: "Algorithms improve b: "Deep learning uses neural | |
| through time." networks." | |
| c: "It's powerful for images." | |
| ``` | |
| ### STEP 2: GPT Analysis | |
| ``` | |
| GPT MODEL PROCESSES: | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β β | |
| β INPUT: Sentencized docs + response + question β | |
| β β | |
| β ANALYSIS: β | |
| β β Which doc sentences are relevant to question? β | |
| β β Which doc sentences does response use? β | |
| β β Is each response sentence fully/partially supported? β | |
| β β | |
| β OUTPUT: JSON with sentence keys and support mappings β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### STEP 3: Metric Calculation | |
| ``` | |
| GPT OUTPUT (SIMPLIFIED): | |
| { | |
| "all_relevant_sentence_keys": ["0a", "0b"], | |
| "all_utilized_sentence_keys": ["0a", "0b"], | |
| "sentence_support_information": [ | |
| {"response_sentence_key": "a", "fully_supported": true}, | |
| {"response_sentence_key": "b", "fully_supported": true}, | |
| {"response_sentence_key": "c", "fully_supported": false} | |
| ] | |
| } | |
| β | |
| METRIC CALCULATION: | |
| ββ Context Relevance = |relevant| / 20 = 2/20 = 0.10 | |
| ββ Context Utilization = |utilized| / |relevant| = 2/2 = 1.0 | |
| ββ Completeness = |relevant β© utilized| / |relevant| = 2/2 = 1.0 | |
| ββ Adherence = all_fully_supported? = false β 0.0 | |
| ``` | |
| --- | |
| ## Metric Formulas with Venn Diagrams | |
| ### Context Relevance (R) | |
| ``` | |
| ALL RETRIEVED SENTENCES | |
| ββββββββββββββββββββββββββββ | |
| β β | |
| β Total: ~20 sentences β | |
| β β | |
| β ββββββββββββββββββββ β | |
| β β RELEVANT: β β | |
| β β ["0a", "0b"] β β | |
| β β Count: 2 β β | |
| β ββββββββββββββββββββ β | |
| β β | |
| β Irrelevant: 18 β | |
| β β | |
| ββββββββββββββββββββββββββββ | |
| Formula: R = 2 / 20 = 0.10 (10%) | |
| Interpretation: 10% of retrieved content is relevant to question | |
| ``` | |
| ### Context Utilization (T) | |
| ``` | |
| RELEVANT SENTENCES | |
| ββββββββββββββββββββββββββββ | |
| β RELEVANT: ["0a", "0b"] β | |
| β β | |
| β ββββββββββββββββββββββ β | |
| β β UTILIZED: β β | |
| β β ["0a", "0b"] β β | |
| β β Count: 2 β β | |
| β ββββββββββββββββββββββ β | |
| β β | |
| β NOT USED: 0 β | |
| β β | |
| ββββββββββββββββββββββββββββ | |
| Formula: U = 2 / 2 = 1.0 (100%) | |
| Interpretation: All relevant information was used | |
| ``` | |
| ### Completeness (C) | |
| ``` | |
| RELEVANT UTILIZED | |
| ββββββββββββββββ ββββββββββββββββ | |
| β ["0a", "0b"] β β ["0a", "0b"] β | |
| β β β β | |
| β COUNT: 2 β β COUNT: 2 β | |
| β β β β | |
| ββββββββββββββββ ββββββββββββββββ | |
| β β | |
| ββββββββββ¬ββββββββββββ | |
| β | |
| OVERLAP: ["0a", "0b"] | |
| COUNT: 2 | |
| Formula: C = 2 / 2 = 1.0 (100%) | |
| Interpretation: All relevant info is in response | |
| ``` | |
| ### Adherence (A) | |
| ``` | |
| RESPONSE SENTENCES: SUPPORT STATUS: | |
| ββββββββββββββββββββ ββββββββββββββββββββ | |
| β a: "ML is AI..." β βββββββββ β Fully β | |
| β β β Supported β | |
| β b: "Deep learns..β βββββββββ β Fully β | |
| β β β Supported β | |
| β c: "Powerful..." β βββββββββ β Not β | |
| β β β Supported β | |
| ββββββββββββββββββββ ββββββββββββββββββββ | |
| Formula: A = (all_supported) ? 1.0 : 0.0 | |
| = (true AND true AND false) ? 1.0 : 0.0 | |
| = 0.0 (100% = 0 because of one failure) | |
| Interpretation: Response contains hallucination (adherence fails) | |
| ``` | |
| --- | |
| ## Complete Example Walkthrough | |
| ### Input | |
| ``` | |
| QUESTION: | |
| "What makes machine learning different from traditional programming?" | |
| RETRIEVED DOCUMENTS: | |
| 0: "Machine learning is a subset of AI. It learns patterns from data. | |
| Traditional programming requires explicit instructions." | |
| 1: "ML algorithms improve through experience. They adapt to new data. | |
| Rule-based systems are rigid and hard to maintain." | |
| LLM RESPONSE: | |
| "Machine learning differs because it learns from data rather than | |
| requiring explicit instructions. ML algorithms improve over time. | |
| It's the future of all computing." | |
| ``` | |
| ### Step 1: Sentencization | |
| ``` | |
| DOCUMENTS: | |
| 0a: "Machine learning is a subset of AI." | |
| 0b: "It learns patterns from data." | |
| 0c: "Traditional programming requires explicit instructions." | |
| 1a: "ML algorithms improve through experience." | |
| 1b: "They adapt to new data." | |
| 1c: "Rule-based systems are rigid and hard to maintain." | |
| RESPONSE: | |
| a: "Machine learning differs because it learns from data rather than | |
| requiring explicit instructions." | |
| b: "ML algorithms improve over time." | |
| c: "It's the future of all computing." | |
| ``` | |
| ### Step 2: GPT Labeling | |
| ``` | |
| ANALYSIS BY GPT: | |
| Question focus: Differences between ML and traditional programming | |
| ββ "learns from data" vs "explicit instructions" | |
| ββ "improves through experience" | |
| ββ Adaptability | |
| RELEVANT SENTENCES (to question): | |
| ββ 0a: "subset of AI" β Partially relevant | |
| ββ 0b: "learns patterns from data" β RELEVANT β | |
| ββ 0c: "requires explicit instructions" β RELEVANT β | |
| ββ 1a: "improve through experience" β RELEVANT β | |
| ββ 1b: "adapt to new data" β RELEVANT β | |
| ββ 1c: "rule-based systems rigid" β Partially relevant | |
| UTILIZED SENTENCES (used in response): | |
| ββ response_a uses: 0b, 0c β Document references: [0b, 0c] | |
| ββ response_b uses: 1a β Document references: [1a] | |
| ββ response_c uses: NONE β No support β [hallucination] | |
| FULLY SUPPORTED CHECK: | |
| ββ response_a "learns from data, not explicit" β Supported by 0b, 0c β | |
| ββ response_b "algorithms improve" β Supported by 1a β | |
| ββ response_c "future of all computing" β NOT in documents β | |
| ``` | |
| ### Step 3: Metric Calculation | |
| ``` | |
| EXTRACTED DATA: | |
| all_relevant_sentence_keys = ["0b", "0c", "1a", "1b"] (4 sentences) | |
| all_utilized_sentence_keys = ["0b", "0c", "1a"] (3 sentences) | |
| sentence_support_information = [ | |
| {key: "a", fully_supported: true}, | |
| {key: "b", fully_supported: true}, | |
| {key: "c", fully_supported: false} | |
| ] | |
| CALCULATIONS: | |
| 1. Context Relevance | |
| = |relevant| / 20 | |
| = 4 / 20 | |
| = 0.20 (20%) | |
| 2. Context Utilization | |
| = |utilized| / |relevant| | |
| = 3 / 4 | |
| = 0.75 (75%) | |
| 3. Completeness | |
| = |relevant β© utilized| / |relevant| | |
| = |{0b, 0c, 1a}| / |{0b, 0c, 1a, 1b}| | |
| = 3 / 4 | |
| = 0.75 (75%) | |
| 4. Adherence | |
| = all fully_supported? | |
| = true AND true AND false | |
| = FALSE β 0.0 (0%) | |
| ``` | |
| ### Results | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| β TRACE METRICS RESULTS β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Context Relevance: 0.20 (20%) β | |
| β Context Utilization: 0.75 (75%) β | |
| β Completeness: 0.75 (75%) β | |
| β Adherence: 0.0 (0%) β | |
| βββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Average: 0.425 (42.5%) β | |
| β RMSE Aggregation: 0.437 β | |
| β Consistency Score: 0.563 β | |
| βββββββββββββββββββββββββββββββββββββββββββ | |
| INTERPRETATION: | |
| β Good relevance targeting (20%) | |
| β Decent information usage (75%) | |
| β Good coverage of relevant info (75%) | |
| β Contains hallucination (0% adherence) | |
| ACTION: Address the hallucination about "future of all computing" | |
| ``` | |
| --- | |
| ## Calculation Pseudocode | |
| ```python | |
| # INPUT: GPT labeled output | |
| gpt_labels = { | |
| "all_relevant_sentence_keys": [...], | |
| "all_utilized_sentence_keys": [...], | |
| "sentence_support_information": [...] | |
| } | |
| # METRIC 1: Context Relevance | |
| relevant_count = len(gpt_labels["all_relevant_sentence_keys"]) | |
| context_relevance = min(1.0, relevant_count / 20.0) | |
| # METRIC 2: Context Utilization | |
| utilized_count = len(gpt_labels["all_utilized_sentence_keys"]) | |
| if relevant_count == 0: | |
| context_utilization = 0.0 | |
| else: | |
| context_utilization = min(1.0, utilized_count / relevant_count) | |
| # METRIC 3: Completeness | |
| relevant_set = set(gpt_labels["all_relevant_sentence_keys"]) | |
| utilized_set = set(gpt_labels["all_utilized_sentence_keys"]) | |
| overlap_count = len(relevant_set & utilized_set) | |
| if len(relevant_set) == 0: | |
| completeness = 1.0 if len(utilized_set) == 0 else 0.0 | |
| else: | |
| completeness = overlap_count / len(relevant_set) | |
| # METRIC 4: Adherence | |
| fully_supported_count = sum( | |
| 1 for sentence in gpt_labels["sentence_support_information"] | |
| if sentence["fully_supported"] | |
| ) | |
| total_sentences = len(gpt_labels["sentence_support_information"]) | |
| if total_sentences == 0: | |
| adherence = 1.0 | |
| else: | |
| adherence = 1.0 if fully_supported_count == total_sentences else 0.0 | |
| # OUTPUT | |
| scores = { | |
| "context_relevance": context_relevance, | |
| "context_utilization": context_utilization, | |
| "completeness": completeness, | |
| "adherence": adherence, | |
| "average": (context_relevance + context_utilization + | |
| completeness + adherence) / 4 | |
| } | |
| ``` | |
| --- | |
| ## Key Takeaways | |
| ### 1. Each Metric Answers a Different Question | |
| | Metric | Question | Data Source | | |
| |--------|----------|-------------| | |
| | **R** | Is retrieval good? | Relevant sentences | | |
| | **U** | Does LLM use it? | Utilized sentences | | |
| | **C** | Is response comprehensive? | Overlap | | |
| | **A** | Is response truthful? | Support flags | | |
| ### 2. Metrics Are Independent | |
| - Low R, high U is possible (ignore irrelevant) | |
| - Low U, high R is possible (retrieval good, generation bad) | |
| - Low C, high A is possible (limited but correct) | |
| ### 3. GPT Labeling is Sentence-Level | |
| - Fine-grained sentence keys (0a, 0b, 1c, etc.) | |
| - Exact mapping of support | |
| - Transparent and verifiable | |
| ### 4. All Four Metrics Required for Full Picture | |
| ``` | |
| Relevance: β "Did we retrieve the right docs?" | |
| Utilization: β "Did the LLM use them?" | |
| Completeness: β "Did it cover the information?" | |
| Adherence: β "Is it accurate?" | |
| ``` | |
| All four needed to understand RAG quality. | |