Spaces:
Sleeping
GPT Labeling Approach - Main Metrics Calculation
Overview
The RAG evaluation system uses GPT-based labeling to calculate four TRACE metrics. The GPT model analyzes responses sentence-by-sentence and identifies which parts are supported by retrieved documents, enabling precise metric calculation.
The Four TRACE Metrics
1. Context Relevance (R) - What's Actually Relevant?
Definition: Fraction of retrieved context that is relevant to answering the user's question.
Calculation:
Context Relevance = Number of relevant sentences / Total retrieved sentences
Normalized: min(1.0, count / 20)
Formula:
R = |Relevant Sentences| / |Total Sentences|
Data Source from GPT:
gpt_labels.all_relevant_sentence_keys
# List of document sentence keys identified as relevant
# Example: ["0a", "0b", "1c", "2a"]
Example:
Retrieved 30 sentences total
GPT identifies 12 as relevant to question "What is machine learning?"
Context Relevance = 12/30 = 0.40 (40%)
What It Tells You:
- β How good was the retrieval?
- β Did we pull documents about the right topic?
- β Are there irrelevant documents in the results?
2. Context Utilization (T) - How Much Was Used?
Definition: Fraction of relevant context that the response actually used to generate its answer.
Calculation:
Context Utilization = Number of utilized sentences / Number of relevant sentences
Formula:
U = |Utilized Sentences| / |Relevant Sentences|
Data Source from GPT:
gpt_labels.all_utilized_sentence_keys
# List of document sentence keys actually used in response
# Example: ["0a", "0b", "1c"]
Example:
Context Relevance found 12 relevant sentences: ["0a", "0b", "1a", "1c", "2a", ...]
GPT identifies 8 actually used in response: ["0a", "0b", "1c", "2a", ...]
Context Utilization = 8/12 = 0.67 (67%)
What It Tells You:
- β Did the LLM actually use the available information?
- β Is the response limited by context availability?
- β Is context being ignored/wasted?
Problem Pattern:
- High Relevance (0.9) + Low Utilization (0.3) β Retrieval is good, but LLM isn't using it β Fix: Improve prompt instructions
3. Completeness (C) - Was It Comprehensive?
Definition: Fraction of relevant information that is covered by the response.
Calculation:
Completeness = (Relevant β© Utilized) / Relevant
= Relevant sentences that were used / All relevant sentences
Formula:
C = |Relevant β© Utilized| / |Relevant|
Data Source from GPT:
# Set intersection:
relevant_set = set(gpt_labels.all_relevant_sentence_keys)
utilized_set = set(gpt_labels.all_utilized_sentence_keys)
intersection = relevant_set & utilized_set
completeness = len(intersection) / len(relevant_set)
Example:
Relevant sentences: {"0a", "0b", "1a", "1c", "2a", "2b", "3a"} (7 total)
Utilized sentences: {"0a", "0b", "1c", "2a"} (4 used)
Overlap (Relevant AND Used): {"0a", "0b", "1c", "2a"} (4 in both)
Completeness = 4/7 = 0.57 (57%)
Missing: "1a", "2b", "3a" were relevant but not mentioned
What It Tells You:
- β Did the response cover all important information?
- β What relevant details were omitted?
- β Is the response comprehensive?
Problem Pattern:
- Low Completeness (0.4) with High Adherence (1.0) β Response is accurate but limited β Missing important information β Fix: Improve retrieval coverage or summarization
4. Adherence (A) - Was It Grounded?
Definition: Whether the response is fully grounded in the retrieved context (no hallucinations).
Calculation:
Adherence = 1.0 if ALL sentences are fully supported, 0.0 otherwise
(Boolean: fully grounded or contains hallucination)
Formula:
A = 1.0 if all(sentence.fully_supported for all sentences) else 0.0
Data Source from GPT:
gpt_labels.sentence_support_information
# For each response sentence:
# {
# "response_sentence_key": "a",
# "fully_supported": true/false, # β This determines adherence
# "supporting_sentence_keys": ["0a", "0b"],
# "explanation": "..."
# }
fully_supported_count = sum(
1 for s in sentence_support_information
if s.get("fully_supported", False)
)
adherence = 1.0 if fully_supported_count == total_sentences else 0.0
Example 1 - Perfect Adherence:
Response sentences:
a. "Machine learning is a subset of AI."
ββ Fully supported by document 0a β
b. "It uses algorithms to learn from data."
ββ Fully supported by document 1b β
c. "Common applications include image recognition."
ββ Fully supported by documents 2a, 2b β
ALL sentences fully supported β Adherence = 1.0 (100%)
Example 2 - Contains Hallucination:
Response sentences:
a. "Machine learning is a subset of AI."
ββ Fully supported by document 0a β
b. "It requires quantum computers."
ββ NOT supported by any document β (Hallucination!)
c. "Common applications include image recognition."
ββ Fully supported by documents 2a, 2b β
ONE sentence NOT fully supported β Adherence = 0.0 (0%)
What It Tells You:
- β Is the response truthful/grounded?
- β Does it contain hallucinations?
- β Can we trust the answer?
How GPT Labeling Works
Step 1: Sentencization
Documents:
Document 0: "Machine learning is AI. It learns from data."
Document 1: "Neural networks are models. They mimic brains."
β Splits into sentences with keys:
0a: "Machine learning is AI."
0b: "It learns from data."
1a: "Neural networks are models."
1b: "They mimic brains."
Response:
"Machine learning uses neural networks. They learn patterns."
β Splits into sentences:
a: "Machine learning uses neural networks."
b: "They learn patterns."
Step 2: GPT Labeling
GPT analyzes and identifies:
- Relevance: Which document sentences are relevant to the question?
- Utilization: Which document sentences were actually used in the response?
- Support: Is each response sentence fully/partially/not supported?
GPT Output (JSON):
{
"relevance_explanation": "Document discusses ML basics...",
"all_relevant_sentence_keys": ["0a", "0b", "1a"],
"overall_supported_explanation": "Response is grounded...",
"overall_supported": true,
"sentence_support_information": [
{
"response_sentence_key": "a",
"explanation": "Matches document sentences...",
"supporting_sentence_keys": ["0a", "1a"],
"fully_supported": true
},
{
"response_sentence_key": "b",
"explanation": "Partially supported...",
"supporting_sentence_keys": ["1b"],
"fully_supported": false
}
],
"all_utilized_sentence_keys": ["0a", "1a", "1b"]
}
Step 3: Metric Calculation
Relevant: ["0a", "0b", "1a"] (3 sentences)
Utilized: ["0a", "1a", "1b"] (3 sentences)
Context Relevance = |["0a", "0b", "1a"]| / total_sentences
= 3 / ~20 = 0.15
Context Utilization = |Utilized| / |Relevant|
= 3 / 3 = 1.0
Completeness = |Relevant β© Utilized| / |Relevant|
= |{"0a", "1a"}| / |{"0a", "0b", "1a"}|
= 2 / 3 = 0.67
Adherence = All sentences fully supported? β Check each
= sentence_a (true) AND sentence_b (false)
= false β 0.0
Code Implementation
Metric Calculation Methods
def _compute_context_relevance(self, gpt_labels: GPTLabelingOutput) -> float:
"""Count relevant sentences, normalize to 0-1."""
if not gpt_labels.all_relevant_sentence_keys:
return 0.0
return min(1.0, len(gpt_labels.all_relevant_sentence_keys) / 20.0)
def _compute_context_utilization(self, gpt_labels: GPTLabelingOutput) -> float:
"""Utilized / Relevant."""
relevant_count = len(gpt_labels.all_relevant_sentence_keys)
utilized_count = len(gpt_labels.all_utilized_sentence_keys)
if relevant_count == 0:
return 0.0
return min(1.0, utilized_count / relevant_count)
def _compute_completeness(self, gpt_labels: GPTLabelingOutput,
ground_truth: Optional[str] = None) -> float:
"""(Relevant AND Utilized) / Relevant."""
relevant_set = set(gpt_labels.all_relevant_sentence_keys)
utilized_set = set(gpt_labels.all_utilized_sentence_keys)
intersection = len(relevant_set & utilized_set)
if len(relevant_set) == 0:
return 1.0 if len(utilized_set) == 0 else 0.0
return intersection / len(relevant_set)
def _compute_adherence(self, gpt_labels: GPTLabelingOutput) -> float:
"""All sentences fully supported? Boolean: 1.0 or 0.0."""
total_sentences = len(gpt_labels.sentence_support_information)
if total_sentences == 0:
return 1.0
fully_supported_count = sum(
1 for s in gpt_labels.sentence_support_information
if s.get("fully_supported", False)
)
return 1.0 if fully_supported_count == total_sentences else 0.0
Complete Example: Full Calculation
Input
Question: "What is machine learning?"
Retrieved Documents:
Doc 0: "Machine learning is a subset of AI. It learns patterns from data.
Algorithms improve through experience."
Doc 1: "Deep learning uses neural networks. It's popular in computer vision."
Doc 2: "Supervised learning needs labeled data. Unsupervised learning finds patterns."
LLM Response:
"Machine learning is a field of AI that learns from data. Deep learning
uses neural networks. It's powerful for image recognition."
GPT Labeling Process
Sentencized Documents:
0a: "Machine learning is a subset of AI."
0b: "It learns patterns from data."
0c: "Algorithms improve through experience."
1a: "Deep learning uses neural networks."
1b: "It's popular in computer vision."
2a: "Supervised learning needs labeled data."
2b: "Unsupervised learning finds patterns."
Sentencized Response:
a: "Machine learning is a field of AI that learns from data."
b: "Deep learning uses neural networks."
c: "It's powerful for image recognition."
GPT Analysis:
{
"all_relevant_sentence_keys": ["0a", "0b", "1a", "1b"],
"all_utilized_sentence_keys": ["0a", "0b", "1a", "1b"],
"sentence_support_information": [
{
"response_sentence_key": "a",
"supporting_sentence_keys": ["0a", "0b"],
"fully_supported": true
},
{
"response_sentence_key": "b",
"supporting_sentence_keys": ["1a"],
"fully_supported": true
},
{
"response_sentence_key": "c",
"supporting_sentence_keys": ["1b"],
"fully_supported": false // "powerful for image recognition" not explicitly in docs
}
]
}
Metric Calculation
Relevant: ["0a", "0b", "1a", "1b"] (4 sentences)
Utilized: ["0a", "0b", "1a", "1b"] (4 sentences)
Total sentences retrieved: 7
Context Relevance = 4 / 20 = 0.20 (20%)
ββ 4 out of ~20 average sentences are relevant
Context Utilization = 4 / 4 = 1.0 (100%)
ββ All relevant information was used
Completeness = |{"0a","0b","1a","1b"} β© {"0a","0b","1a","1b"}| / 4
= 4 / 4 = 1.0 (100%)
ββ All relevant info was covered
Adherence = All fully supported?
= sentence_a (true) AND sentence_b (true) AND sentence_c (false)
= false β 0.0 (0%)
ββ Contains hallucination about "powerful for image recognition"
Average Score = (0.20 + 1.0 + 1.0 + 0.0) / 4 = 0.55
Key Insights
1. Complementary Metrics
| Metric | Measures | Ideal Value |
|---|---|---|
| Relevance | Quality of retrieval | High (0.7+) |
| Utilization | LLM uses available info | High (0.7+) |
| Completeness | Coverage of information | High (0.7+) |
| Adherence | Grounding (no hallucination) | Perfect (1.0) |
2. Common Patterns
Pattern 1: Good Retrieval, Bad Generation
Relevance: 0.85 (good retrieval)
Utilization: 0.40 (not using it)
β Problem: LLM not leveraging context
β Fix: Improve prompt instructions
Pattern 2: Conservative but Accurate
Completeness: 0.50 (missing info)
Adherence: 1.0 (all correct)
β Problem: Limited but grounded response
β Fix: Improve retrieval coverage
Pattern 3: Comprehensive and Grounded
Relevance: 0.75, Utilization: 0.80, Completeness: 0.85, Adherence: 1.0
β Excellent RAG system
β Action: Monitor and maintain
3. Mathematical Relationships
Completeness β€ Utilization β€ Relevance
(Because: utilization requires relevance, completeness requires utilization)
Also:
If Relevance = 0 β Utilization = 0, Completeness = 0
If Utilization = 0 β Completeness = 0 (but Relevance can be > 0)
Advantages of GPT Labeling
β Semantic Understanding
- Not just keyword matching
- Understands meaning and context
- Detects subtle hallucinations
β Fine-Grained Analysis
- Sentence-level support mapping
- Identifies exactly which info is supported
- Pinpoints problematic claims
β Comprehensive
- Evaluates all four TRACE metrics
- Single pass through documents
- Complete audit trail
β Transparent
- Full explanation for each metric
- Shows supporting evidence
- Human-verifiable results
Limitations
β Cost
- API calls per evaluation (~2.5s per eval with rate limiting)
- At 30 RPM: 50 evals = 3-5 minutes
β Semantic Brittleness
- Depends on GPT's understanding
- May miss implicit knowledge
- Sensitive to phrasing
β Normalization
- Context Relevance normalized by 20 (arbitrary baseline)
- Different domain sizes affect scaling
β Binary Adherence
- One hallucination = 0.0 adherence
- No partial credit for mostly correct
Summary
The GPT labeling approach calculates TRACE metrics by:
- Splitting documents and response into sentences
- Analyzing with GPT to identify relevant/utilized/supported information
- Computing metrics from the labeled sentence keys:
- Relevance: What's relevant in retrieved docs?
- Utilization: What of that was actually used?
- Completeness: What coverage of relevant info?
- Adherence: Is all information grounded?
This enables precise, interpretable evaluation of RAG system quality.