Spaces:
Sleeping
Sleeping
| # GPT Labeling Approach - Main Metrics Calculation | |
| ## Overview | |
| The RAG evaluation system uses **GPT-based labeling** to calculate four TRACE metrics. The GPT model analyzes responses sentence-by-sentence and identifies which parts are supported by retrieved documents, enabling precise metric calculation. | |
| --- | |
| ## The Four TRACE Metrics | |
| ### 1. **Context Relevance (R)** - What's Actually Relevant? | |
| **Definition:** Fraction of retrieved context that is relevant to answering the user's question. | |
| **Calculation:** | |
| ``` | |
| Context Relevance = Number of relevant sentences / Total retrieved sentences | |
| Normalized: min(1.0, count / 20) | |
| ``` | |
| **Formula:** | |
| ``` | |
| R = |Relevant Sentences| / |Total Sentences| | |
| ``` | |
| **Data Source from GPT:** | |
| ```python | |
| gpt_labels.all_relevant_sentence_keys | |
| # List of document sentence keys identified as relevant | |
| # Example: ["0a", "0b", "1c", "2a"] | |
| ``` | |
| **Example:** | |
| ``` | |
| Retrieved 30 sentences total | |
| GPT identifies 12 as relevant to question "What is machine learning?" | |
| Context Relevance = 12/30 = 0.40 (40%) | |
| ``` | |
| **What It Tells You:** | |
| - β How good was the retrieval? | |
| - β Did we pull documents about the right topic? | |
| - β Are there irrelevant documents in the results? | |
| --- | |
| ### 2. **Context Utilization (T)** - How Much Was Used? | |
| **Definition:** Fraction of relevant context that the response actually used to generate its answer. | |
| **Calculation:** | |
| ``` | |
| Context Utilization = Number of utilized sentences / Number of relevant sentences | |
| ``` | |
| **Formula:** | |
| ``` | |
| U = |Utilized Sentences| / |Relevant Sentences| | |
| ``` | |
| **Data Source from GPT:** | |
| ```python | |
| gpt_labels.all_utilized_sentence_keys | |
| # List of document sentence keys actually used in response | |
| # Example: ["0a", "0b", "1c"] | |
| ``` | |
| **Example:** | |
| ``` | |
| Context Relevance found 12 relevant sentences: ["0a", "0b", "1a", "1c", "2a", ...] | |
| GPT identifies 8 actually used in response: ["0a", "0b", "1c", "2a", ...] | |
| Context Utilization = 8/12 = 0.67 (67%) | |
| ``` | |
| **What It Tells You:** | |
| - β Did the LLM actually use the available information? | |
| - β Is the response limited by context availability? | |
| - β Is context being ignored/wasted? | |
| **Problem Pattern:** | |
| - High Relevance (0.9) + Low Utilization (0.3) | |
| β Retrieval is good, but LLM isn't using it | |
| β Fix: Improve prompt instructions | |
| --- | |
| ### 3. **Completeness (C)** - Was It Comprehensive? | |
| **Definition:** Fraction of relevant information that is covered by the response. | |
| **Calculation:** | |
| ``` | |
| Completeness = (Relevant β© Utilized) / Relevant | |
| = Relevant sentences that were used / All relevant sentences | |
| ``` | |
| **Formula:** | |
| ``` | |
| C = |Relevant β© Utilized| / |Relevant| | |
| ``` | |
| **Data Source from GPT:** | |
| ```python | |
| # Set intersection: | |
| relevant_set = set(gpt_labels.all_relevant_sentence_keys) | |
| utilized_set = set(gpt_labels.all_utilized_sentence_keys) | |
| intersection = relevant_set & utilized_set | |
| completeness = len(intersection) / len(relevant_set) | |
| ``` | |
| **Example:** | |
| ``` | |
| Relevant sentences: {"0a", "0b", "1a", "1c", "2a", "2b", "3a"} (7 total) | |
| Utilized sentences: {"0a", "0b", "1c", "2a"} (4 used) | |
| Overlap (Relevant AND Used): {"0a", "0b", "1c", "2a"} (4 in both) | |
| Completeness = 4/7 = 0.57 (57%) | |
| Missing: "1a", "2b", "3a" were relevant but not mentioned | |
| ``` | |
| **What It Tells You:** | |
| - β Did the response cover all important information? | |
| - β What relevant details were omitted? | |
| - β Is the response comprehensive? | |
| **Problem Pattern:** | |
| - Low Completeness (0.4) with High Adherence (1.0) | |
| β Response is accurate but limited | |
| β Missing important information | |
| β Fix: Improve retrieval coverage or summarization | |
| --- | |
| ### 4. **Adherence (A)** - Was It Grounded? | |
| **Definition:** Whether the response is fully grounded in the retrieved context (no hallucinations). | |
| **Calculation:** | |
| ``` | |
| Adherence = 1.0 if ALL sentences are fully supported, 0.0 otherwise | |
| (Boolean: fully grounded or contains hallucination) | |
| ``` | |
| **Formula:** | |
| ``` | |
| A = 1.0 if all(sentence.fully_supported for all sentences) else 0.0 | |
| ``` | |
| **Data Source from GPT:** | |
| ```python | |
| gpt_labels.sentence_support_information | |
| # For each response sentence: | |
| # { | |
| # "response_sentence_key": "a", | |
| # "fully_supported": true/false, # β This determines adherence | |
| # "supporting_sentence_keys": ["0a", "0b"], | |
| # "explanation": "..." | |
| # } | |
| fully_supported_count = sum( | |
| 1 for s in sentence_support_information | |
| if s.get("fully_supported", False) | |
| ) | |
| adherence = 1.0 if fully_supported_count == total_sentences else 0.0 | |
| ``` | |
| **Example 1 - Perfect Adherence:** | |
| ``` | |
| Response sentences: | |
| a. "Machine learning is a subset of AI." | |
| ββ Fully supported by document 0a β | |
| b. "It uses algorithms to learn from data." | |
| ββ Fully supported by document 1b β | |
| c. "Common applications include image recognition." | |
| ββ Fully supported by documents 2a, 2b β | |
| ALL sentences fully supported β Adherence = 1.0 (100%) | |
| ``` | |
| **Example 2 - Contains Hallucination:** | |
| ``` | |
| Response sentences: | |
| a. "Machine learning is a subset of AI." | |
| ββ Fully supported by document 0a β | |
| b. "It requires quantum computers." | |
| ββ NOT supported by any document β (Hallucination!) | |
| c. "Common applications include image recognition." | |
| ββ Fully supported by documents 2a, 2b β | |
| ONE sentence NOT fully supported β Adherence = 0.0 (0%) | |
| ``` | |
| **What It Tells You:** | |
| - β Is the response truthful/grounded? | |
| - β Does it contain hallucinations? | |
| - β Can we trust the answer? | |
| --- | |
| ## How GPT Labeling Works | |
| ### Step 1: Sentencization | |
| **Documents:** | |
| ``` | |
| Document 0: "Machine learning is AI. It learns from data." | |
| Document 1: "Neural networks are models. They mimic brains." | |
| β Splits into sentences with keys: | |
| 0a: "Machine learning is AI." | |
| 0b: "It learns from data." | |
| 1a: "Neural networks are models." | |
| 1b: "They mimic brains." | |
| ``` | |
| **Response:** | |
| ``` | |
| "Machine learning uses neural networks. They learn patterns." | |
| β Splits into sentences: | |
| a: "Machine learning uses neural networks." | |
| b: "They learn patterns." | |
| ``` | |
| ### Step 2: GPT Labeling | |
| GPT analyzes and identifies: | |
| 1. **Relevance:** Which document sentences are relevant to the question? | |
| 2. **Utilization:** Which document sentences were actually used in the response? | |
| 3. **Support:** Is each response sentence fully/partially/not supported? | |
| **GPT Output (JSON):** | |
| ```json | |
| { | |
| "relevance_explanation": "Document discusses ML basics...", | |
| "all_relevant_sentence_keys": ["0a", "0b", "1a"], | |
| "overall_supported_explanation": "Response is grounded...", | |
| "overall_supported": true, | |
| "sentence_support_information": [ | |
| { | |
| "response_sentence_key": "a", | |
| "explanation": "Matches document sentences...", | |
| "supporting_sentence_keys": ["0a", "1a"], | |
| "fully_supported": true | |
| }, | |
| { | |
| "response_sentence_key": "b", | |
| "explanation": "Partially supported...", | |
| "supporting_sentence_keys": ["1b"], | |
| "fully_supported": false | |
| } | |
| ], | |
| "all_utilized_sentence_keys": ["0a", "1a", "1b"] | |
| } | |
| ``` | |
| ### Step 3: Metric Calculation | |
| ``` | |
| Relevant: ["0a", "0b", "1a"] (3 sentences) | |
| Utilized: ["0a", "1a", "1b"] (3 sentences) | |
| Context Relevance = |["0a", "0b", "1a"]| / total_sentences | |
| = 3 / ~20 = 0.15 | |
| Context Utilization = |Utilized| / |Relevant| | |
| = 3 / 3 = 1.0 | |
| Completeness = |Relevant β© Utilized| / |Relevant| | |
| = |{"0a", "1a"}| / |{"0a", "0b", "1a"}| | |
| = 2 / 3 = 0.67 | |
| Adherence = All sentences fully supported? β Check each | |
| = sentence_a (true) AND sentence_b (false) | |
| = false β 0.0 | |
| ``` | |
| --- | |
| ## Code Implementation | |
| ### Metric Calculation Methods | |
| ```python | |
| def _compute_context_relevance(self, gpt_labels: GPTLabelingOutput) -> float: | |
| """Count relevant sentences, normalize to 0-1.""" | |
| if not gpt_labels.all_relevant_sentence_keys: | |
| return 0.0 | |
| return min(1.0, len(gpt_labels.all_relevant_sentence_keys) / 20.0) | |
| def _compute_context_utilization(self, gpt_labels: GPTLabelingOutput) -> float: | |
| """Utilized / Relevant.""" | |
| relevant_count = len(gpt_labels.all_relevant_sentence_keys) | |
| utilized_count = len(gpt_labels.all_utilized_sentence_keys) | |
| if relevant_count == 0: | |
| return 0.0 | |
| return min(1.0, utilized_count / relevant_count) | |
| def _compute_completeness(self, gpt_labels: GPTLabelingOutput, | |
| ground_truth: Optional[str] = None) -> float: | |
| """(Relevant AND Utilized) / Relevant.""" | |
| relevant_set = set(gpt_labels.all_relevant_sentence_keys) | |
| utilized_set = set(gpt_labels.all_utilized_sentence_keys) | |
| intersection = len(relevant_set & utilized_set) | |
| if len(relevant_set) == 0: | |
| return 1.0 if len(utilized_set) == 0 else 0.0 | |
| return intersection / len(relevant_set) | |
| def _compute_adherence(self, gpt_labels: GPTLabelingOutput) -> float: | |
| """All sentences fully supported? Boolean: 1.0 or 0.0.""" | |
| total_sentences = len(gpt_labels.sentence_support_information) | |
| if total_sentences == 0: | |
| return 1.0 | |
| fully_supported_count = sum( | |
| 1 for s in gpt_labels.sentence_support_information | |
| if s.get("fully_supported", False) | |
| ) | |
| return 1.0 if fully_supported_count == total_sentences else 0.0 | |
| ``` | |
| --- | |
| ## Complete Example: Full Calculation | |
| ### Input | |
| **Question:** "What is machine learning?" | |
| **Retrieved Documents:** | |
| ``` | |
| Doc 0: "Machine learning is a subset of AI. It learns patterns from data. | |
| Algorithms improve through experience." | |
| Doc 1: "Deep learning uses neural networks. It's popular in computer vision." | |
| Doc 2: "Supervised learning needs labeled data. Unsupervised learning finds patterns." | |
| ``` | |
| **LLM Response:** | |
| ``` | |
| "Machine learning is a field of AI that learns from data. Deep learning | |
| uses neural networks. It's powerful for image recognition." | |
| ``` | |
| ### GPT Labeling Process | |
| **Sentencized Documents:** | |
| ``` | |
| 0a: "Machine learning is a subset of AI." | |
| 0b: "It learns patterns from data." | |
| 0c: "Algorithms improve through experience." | |
| 1a: "Deep learning uses neural networks." | |
| 1b: "It's popular in computer vision." | |
| 2a: "Supervised learning needs labeled data." | |
| 2b: "Unsupervised learning finds patterns." | |
| ``` | |
| **Sentencized Response:** | |
| ``` | |
| a: "Machine learning is a field of AI that learns from data." | |
| b: "Deep learning uses neural networks." | |
| c: "It's powerful for image recognition." | |
| ``` | |
| **GPT Analysis:** | |
| ```json | |
| { | |
| "all_relevant_sentence_keys": ["0a", "0b", "1a", "1b"], | |
| "all_utilized_sentence_keys": ["0a", "0b", "1a", "1b"], | |
| "sentence_support_information": [ | |
| { | |
| "response_sentence_key": "a", | |
| "supporting_sentence_keys": ["0a", "0b"], | |
| "fully_supported": true | |
| }, | |
| { | |
| "response_sentence_key": "b", | |
| "supporting_sentence_keys": ["1a"], | |
| "fully_supported": true | |
| }, | |
| { | |
| "response_sentence_key": "c", | |
| "supporting_sentence_keys": ["1b"], | |
| "fully_supported": false // "powerful for image recognition" not explicitly in docs | |
| } | |
| ] | |
| } | |
| ``` | |
| ### Metric Calculation | |
| ``` | |
| Relevant: ["0a", "0b", "1a", "1b"] (4 sentences) | |
| Utilized: ["0a", "0b", "1a", "1b"] (4 sentences) | |
| Total sentences retrieved: 7 | |
| Context Relevance = 4 / 20 = 0.20 (20%) | |
| ββ 4 out of ~20 average sentences are relevant | |
| Context Utilization = 4 / 4 = 1.0 (100%) | |
| ββ All relevant information was used | |
| Completeness = |{"0a","0b","1a","1b"} β© {"0a","0b","1a","1b"}| / 4 | |
| = 4 / 4 = 1.0 (100%) | |
| ββ All relevant info was covered | |
| Adherence = All fully supported? | |
| = sentence_a (true) AND sentence_b (true) AND sentence_c (false) | |
| = false β 0.0 (0%) | |
| ββ Contains hallucination about "powerful for image recognition" | |
| Average Score = (0.20 + 1.0 + 1.0 + 0.0) / 4 = 0.55 | |
| ``` | |
| --- | |
| ## Key Insights | |
| ### 1. Complementary Metrics | |
| | Metric | Measures | Ideal Value | | |
| |--------|----------|-------------| | |
| | **Relevance** | Quality of retrieval | High (0.7+) | | |
| | **Utilization** | LLM uses available info | High (0.7+) | | |
| | **Completeness** | Coverage of information | High (0.7+) | | |
| | **Adherence** | Grounding (no hallucination) | Perfect (1.0) | | |
| ### 2. Common Patterns | |
| **Pattern 1: Good Retrieval, Bad Generation** | |
| ``` | |
| Relevance: 0.85 (good retrieval) | |
| Utilization: 0.40 (not using it) | |
| β Problem: LLM not leveraging context | |
| β Fix: Improve prompt instructions | |
| ``` | |
| **Pattern 2: Conservative but Accurate** | |
| ``` | |
| Completeness: 0.50 (missing info) | |
| Adherence: 1.0 (all correct) | |
| β Problem: Limited but grounded response | |
| β Fix: Improve retrieval coverage | |
| ``` | |
| **Pattern 3: Comprehensive and Grounded** | |
| ``` | |
| Relevance: 0.75, Utilization: 0.80, Completeness: 0.85, Adherence: 1.0 | |
| β Excellent RAG system | |
| β Action: Monitor and maintain | |
| ``` | |
| ### 3. Mathematical Relationships | |
| ``` | |
| Completeness β€ Utilization β€ Relevance | |
| (Because: utilization requires relevance, completeness requires utilization) | |
| Also: | |
| If Relevance = 0 β Utilization = 0, Completeness = 0 | |
| If Utilization = 0 β Completeness = 0 (but Relevance can be > 0) | |
| ``` | |
| --- | |
| ## Advantages of GPT Labeling | |
| β **Semantic Understanding** | |
| - Not just keyword matching | |
| - Understands meaning and context | |
| - Detects subtle hallucinations | |
| β **Fine-Grained Analysis** | |
| - Sentence-level support mapping | |
| - Identifies exactly which info is supported | |
| - Pinpoints problematic claims | |
| β **Comprehensive** | |
| - Evaluates all four TRACE metrics | |
| - Single pass through documents | |
| - Complete audit trail | |
| β **Transparent** | |
| - Full explanation for each metric | |
| - Shows supporting evidence | |
| - Human-verifiable results | |
| --- | |
| ## Limitations | |
| β **Cost** | |
| - API calls per evaluation (~2.5s per eval with rate limiting) | |
| - At 30 RPM: 50 evals = 3-5 minutes | |
| β **Semantic Brittleness** | |
| - Depends on GPT's understanding | |
| - May miss implicit knowledge | |
| - Sensitive to phrasing | |
| β **Normalization** | |
| - Context Relevance normalized by 20 (arbitrary baseline) | |
| - Different domain sizes affect scaling | |
| β **Binary Adherence** | |
| - One hallucination = 0.0 adherence | |
| - No partial credit for mostly correct | |
| --- | |
| ## Summary | |
| The GPT labeling approach calculates TRACE metrics by: | |
| 1. **Splitting** documents and response into sentences | |
| 2. **Analyzing** with GPT to identify relevant/utilized/supported information | |
| 3. **Computing** metrics from the labeled sentence keys: | |
| - **Relevance**: What's relevant in retrieved docs? | |
| - **Utilization**: What of that was actually used? | |
| - **Completeness**: What coverage of relevant info? | |
| - **Adherence**: Is all information grounded? | |
| This enables precise, interpretable evaluation of RAG system quality. | |