CapStoneRAG10 / docs /GPT_METRICS_CALCULATION.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# GPT Labeling Approach - Main Metrics Calculation
## Overview
The RAG evaluation system uses **GPT-based labeling** to calculate four TRACE metrics. The GPT model analyzes responses sentence-by-sentence and identifies which parts are supported by retrieved documents, enabling precise metric calculation.
---
## The Four TRACE Metrics
### 1. **Context Relevance (R)** - What's Actually Relevant?
**Definition:** Fraction of retrieved context that is relevant to answering the user's question.
**Calculation:**
```
Context Relevance = Number of relevant sentences / Total retrieved sentences
Normalized: min(1.0, count / 20)
```
**Formula:**
```
R = |Relevant Sentences| / |Total Sentences|
```
**Data Source from GPT:**
```python
gpt_labels.all_relevant_sentence_keys
# List of document sentence keys identified as relevant
# Example: ["0a", "0b", "1c", "2a"]
```
**Example:**
```
Retrieved 30 sentences total
GPT identifies 12 as relevant to question "What is machine learning?"
Context Relevance = 12/30 = 0.40 (40%)
```
**What It Tells You:**
- βœ“ How good was the retrieval?
- βœ“ Did we pull documents about the right topic?
- βœ“ Are there irrelevant documents in the results?
---
### 2. **Context Utilization (T)** - How Much Was Used?
**Definition:** Fraction of relevant context that the response actually used to generate its answer.
**Calculation:**
```
Context Utilization = Number of utilized sentences / Number of relevant sentences
```
**Formula:**
```
U = |Utilized Sentences| / |Relevant Sentences|
```
**Data Source from GPT:**
```python
gpt_labels.all_utilized_sentence_keys
# List of document sentence keys actually used in response
# Example: ["0a", "0b", "1c"]
```
**Example:**
```
Context Relevance found 12 relevant sentences: ["0a", "0b", "1a", "1c", "2a", ...]
GPT identifies 8 actually used in response: ["0a", "0b", "1c", "2a", ...]
Context Utilization = 8/12 = 0.67 (67%)
```
**What It Tells You:**
- βœ“ Did the LLM actually use the available information?
- βœ“ Is the response limited by context availability?
- βœ“ Is context being ignored/wasted?
**Problem Pattern:**
- High Relevance (0.9) + Low Utilization (0.3)
β†’ Retrieval is good, but LLM isn't using it
β†’ Fix: Improve prompt instructions
---
### 3. **Completeness (C)** - Was It Comprehensive?
**Definition:** Fraction of relevant information that is covered by the response.
**Calculation:**
```
Completeness = (Relevant ∩ Utilized) / Relevant
= Relevant sentences that were used / All relevant sentences
```
**Formula:**
```
C = |Relevant ∩ Utilized| / |Relevant|
```
**Data Source from GPT:**
```python
# Set intersection:
relevant_set = set(gpt_labels.all_relevant_sentence_keys)
utilized_set = set(gpt_labels.all_utilized_sentence_keys)
intersection = relevant_set & utilized_set
completeness = len(intersection) / len(relevant_set)
```
**Example:**
```
Relevant sentences: {"0a", "0b", "1a", "1c", "2a", "2b", "3a"} (7 total)
Utilized sentences: {"0a", "0b", "1c", "2a"} (4 used)
Overlap (Relevant AND Used): {"0a", "0b", "1c", "2a"} (4 in both)
Completeness = 4/7 = 0.57 (57%)
Missing: "1a", "2b", "3a" were relevant but not mentioned
```
**What It Tells You:**
- βœ“ Did the response cover all important information?
- βœ“ What relevant details were omitted?
- βœ“ Is the response comprehensive?
**Problem Pattern:**
- Low Completeness (0.4) with High Adherence (1.0)
β†’ Response is accurate but limited
β†’ Missing important information
β†’ Fix: Improve retrieval coverage or summarization
---
### 4. **Adherence (A)** - Was It Grounded?
**Definition:** Whether the response is fully grounded in the retrieved context (no hallucinations).
**Calculation:**
```
Adherence = 1.0 if ALL sentences are fully supported, 0.0 otherwise
(Boolean: fully grounded or contains hallucination)
```
**Formula:**
```
A = 1.0 if all(sentence.fully_supported for all sentences) else 0.0
```
**Data Source from GPT:**
```python
gpt_labels.sentence_support_information
# For each response sentence:
# {
# "response_sentence_key": "a",
# "fully_supported": true/false, # ← This determines adherence
# "supporting_sentence_keys": ["0a", "0b"],
# "explanation": "..."
# }
fully_supported_count = sum(
1 for s in sentence_support_information
if s.get("fully_supported", False)
)
adherence = 1.0 if fully_supported_count == total_sentences else 0.0
```
**Example 1 - Perfect Adherence:**
```
Response sentences:
a. "Machine learning is a subset of AI."
└─ Fully supported by document 0a βœ“
b. "It uses algorithms to learn from data."
└─ Fully supported by document 1b βœ“
c. "Common applications include image recognition."
└─ Fully supported by documents 2a, 2b βœ“
ALL sentences fully supported β†’ Adherence = 1.0 (100%)
```
**Example 2 - Contains Hallucination:**
```
Response sentences:
a. "Machine learning is a subset of AI."
└─ Fully supported by document 0a βœ“
b. "It requires quantum computers."
└─ NOT supported by any document βœ— (Hallucination!)
c. "Common applications include image recognition."
└─ Fully supported by documents 2a, 2b βœ“
ONE sentence NOT fully supported β†’ Adherence = 0.0 (0%)
```
**What It Tells You:**
- βœ“ Is the response truthful/grounded?
- βœ“ Does it contain hallucinations?
- βœ“ Can we trust the answer?
---
## How GPT Labeling Works
### Step 1: Sentencization
**Documents:**
```
Document 0: "Machine learning is AI. It learns from data."
Document 1: "Neural networks are models. They mimic brains."
↓ Splits into sentences with keys:
0a: "Machine learning is AI."
0b: "It learns from data."
1a: "Neural networks are models."
1b: "They mimic brains."
```
**Response:**
```
"Machine learning uses neural networks. They learn patterns."
↓ Splits into sentences:
a: "Machine learning uses neural networks."
b: "They learn patterns."
```
### Step 2: GPT Labeling
GPT analyzes and identifies:
1. **Relevance:** Which document sentences are relevant to the question?
2. **Utilization:** Which document sentences were actually used in the response?
3. **Support:** Is each response sentence fully/partially/not supported?
**GPT Output (JSON):**
```json
{
"relevance_explanation": "Document discusses ML basics...",
"all_relevant_sentence_keys": ["0a", "0b", "1a"],
"overall_supported_explanation": "Response is grounded...",
"overall_supported": true,
"sentence_support_information": [
{
"response_sentence_key": "a",
"explanation": "Matches document sentences...",
"supporting_sentence_keys": ["0a", "1a"],
"fully_supported": true
},
{
"response_sentence_key": "b",
"explanation": "Partially supported...",
"supporting_sentence_keys": ["1b"],
"fully_supported": false
}
],
"all_utilized_sentence_keys": ["0a", "1a", "1b"]
}
```
### Step 3: Metric Calculation
```
Relevant: ["0a", "0b", "1a"] (3 sentences)
Utilized: ["0a", "1a", "1b"] (3 sentences)
Context Relevance = |["0a", "0b", "1a"]| / total_sentences
= 3 / ~20 = 0.15
Context Utilization = |Utilized| / |Relevant|
= 3 / 3 = 1.0
Completeness = |Relevant ∩ Utilized| / |Relevant|
= |{"0a", "1a"}| / |{"0a", "0b", "1a"}|
= 2 / 3 = 0.67
Adherence = All sentences fully supported? β†’ Check each
= sentence_a (true) AND sentence_b (false)
= false β†’ 0.0
```
---
## Code Implementation
### Metric Calculation Methods
```python
def _compute_context_relevance(self, gpt_labels: GPTLabelingOutput) -> float:
"""Count relevant sentences, normalize to 0-1."""
if not gpt_labels.all_relevant_sentence_keys:
return 0.0
return min(1.0, len(gpt_labels.all_relevant_sentence_keys) / 20.0)
def _compute_context_utilization(self, gpt_labels: GPTLabelingOutput) -> float:
"""Utilized / Relevant."""
relevant_count = len(gpt_labels.all_relevant_sentence_keys)
utilized_count = len(gpt_labels.all_utilized_sentence_keys)
if relevant_count == 0:
return 0.0
return min(1.0, utilized_count / relevant_count)
def _compute_completeness(self, gpt_labels: GPTLabelingOutput,
ground_truth: Optional[str] = None) -> float:
"""(Relevant AND Utilized) / Relevant."""
relevant_set = set(gpt_labels.all_relevant_sentence_keys)
utilized_set = set(gpt_labels.all_utilized_sentence_keys)
intersection = len(relevant_set & utilized_set)
if len(relevant_set) == 0:
return 1.0 if len(utilized_set) == 0 else 0.0
return intersection / len(relevant_set)
def _compute_adherence(self, gpt_labels: GPTLabelingOutput) -> float:
"""All sentences fully supported? Boolean: 1.0 or 0.0."""
total_sentences = len(gpt_labels.sentence_support_information)
if total_sentences == 0:
return 1.0
fully_supported_count = sum(
1 for s in gpt_labels.sentence_support_information
if s.get("fully_supported", False)
)
return 1.0 if fully_supported_count == total_sentences else 0.0
```
---
## Complete Example: Full Calculation
### Input
**Question:** "What is machine learning?"
**Retrieved Documents:**
```
Doc 0: "Machine learning is a subset of AI. It learns patterns from data.
Algorithms improve through experience."
Doc 1: "Deep learning uses neural networks. It's popular in computer vision."
Doc 2: "Supervised learning needs labeled data. Unsupervised learning finds patterns."
```
**LLM Response:**
```
"Machine learning is a field of AI that learns from data. Deep learning
uses neural networks. It's powerful for image recognition."
```
### GPT Labeling Process
**Sentencized Documents:**
```
0a: "Machine learning is a subset of AI."
0b: "It learns patterns from data."
0c: "Algorithms improve through experience."
1a: "Deep learning uses neural networks."
1b: "It's popular in computer vision."
2a: "Supervised learning needs labeled data."
2b: "Unsupervised learning finds patterns."
```
**Sentencized Response:**
```
a: "Machine learning is a field of AI that learns from data."
b: "Deep learning uses neural networks."
c: "It's powerful for image recognition."
```
**GPT Analysis:**
```json
{
"all_relevant_sentence_keys": ["0a", "0b", "1a", "1b"],
"all_utilized_sentence_keys": ["0a", "0b", "1a", "1b"],
"sentence_support_information": [
{
"response_sentence_key": "a",
"supporting_sentence_keys": ["0a", "0b"],
"fully_supported": true
},
{
"response_sentence_key": "b",
"supporting_sentence_keys": ["1a"],
"fully_supported": true
},
{
"response_sentence_key": "c",
"supporting_sentence_keys": ["1b"],
"fully_supported": false // "powerful for image recognition" not explicitly in docs
}
]
}
```
### Metric Calculation
```
Relevant: ["0a", "0b", "1a", "1b"] (4 sentences)
Utilized: ["0a", "0b", "1a", "1b"] (4 sentences)
Total sentences retrieved: 7
Context Relevance = 4 / 20 = 0.20 (20%)
└─ 4 out of ~20 average sentences are relevant
Context Utilization = 4 / 4 = 1.0 (100%)
└─ All relevant information was used
Completeness = |{"0a","0b","1a","1b"} ∩ {"0a","0b","1a","1b"}| / 4
= 4 / 4 = 1.0 (100%)
└─ All relevant info was covered
Adherence = All fully supported?
= sentence_a (true) AND sentence_b (true) AND sentence_c (false)
= false β†’ 0.0 (0%)
└─ Contains hallucination about "powerful for image recognition"
Average Score = (0.20 + 1.0 + 1.0 + 0.0) / 4 = 0.55
```
---
## Key Insights
### 1. Complementary Metrics
| Metric | Measures | Ideal Value |
|--------|----------|-------------|
| **Relevance** | Quality of retrieval | High (0.7+) |
| **Utilization** | LLM uses available info | High (0.7+) |
| **Completeness** | Coverage of information | High (0.7+) |
| **Adherence** | Grounding (no hallucination) | Perfect (1.0) |
### 2. Common Patterns
**Pattern 1: Good Retrieval, Bad Generation**
```
Relevance: 0.85 (good retrieval)
Utilization: 0.40 (not using it)
β†’ Problem: LLM not leveraging context
β†’ Fix: Improve prompt instructions
```
**Pattern 2: Conservative but Accurate**
```
Completeness: 0.50 (missing info)
Adherence: 1.0 (all correct)
β†’ Problem: Limited but grounded response
β†’ Fix: Improve retrieval coverage
```
**Pattern 3: Comprehensive and Grounded**
```
Relevance: 0.75, Utilization: 0.80, Completeness: 0.85, Adherence: 1.0
β†’ Excellent RAG system
β†’ Action: Monitor and maintain
```
### 3. Mathematical Relationships
```
Completeness ≀ Utilization ≀ Relevance
(Because: utilization requires relevance, completeness requires utilization)
Also:
If Relevance = 0 β†’ Utilization = 0, Completeness = 0
If Utilization = 0 β†’ Completeness = 0 (but Relevance can be > 0)
```
---
## Advantages of GPT Labeling
βœ… **Semantic Understanding**
- Not just keyword matching
- Understands meaning and context
- Detects subtle hallucinations
βœ… **Fine-Grained Analysis**
- Sentence-level support mapping
- Identifies exactly which info is supported
- Pinpoints problematic claims
βœ… **Comprehensive**
- Evaluates all four TRACE metrics
- Single pass through documents
- Complete audit trail
βœ… **Transparent**
- Full explanation for each metric
- Shows supporting evidence
- Human-verifiable results
---
## Limitations
❌ **Cost**
- API calls per evaluation (~2.5s per eval with rate limiting)
- At 30 RPM: 50 evals = 3-5 minutes
❌ **Semantic Brittleness**
- Depends on GPT's understanding
- May miss implicit knowledge
- Sensitive to phrasing
❌ **Normalization**
- Context Relevance normalized by 20 (arbitrary baseline)
- Different domain sizes affect scaling
❌ **Binary Adherence**
- One hallucination = 0.0 adherence
- No partial credit for mostly correct
---
## Summary
The GPT labeling approach calculates TRACE metrics by:
1. **Splitting** documents and response into sentences
2. **Analyzing** with GPT to identify relevant/utilized/supported information
3. **Computing** metrics from the labeled sentence keys:
- **Relevance**: What's relevant in retrieved docs?
- **Utilization**: What of that was actually used?
- **Completeness**: What coverage of relevant info?
- **Adherence**: Is all information grounded?
This enables precise, interpretable evaluation of RAG system quality.