CapStoneRAG10 / docs /GPT_METRICS_CALCULATION.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

GPT Labeling Approach - Main Metrics Calculation

Overview

The RAG evaluation system uses GPT-based labeling to calculate four TRACE metrics. The GPT model analyzes responses sentence-by-sentence and identifies which parts are supported by retrieved documents, enabling precise metric calculation.


The Four TRACE Metrics

1. Context Relevance (R) - What's Actually Relevant?

Definition: Fraction of retrieved context that is relevant to answering the user's question.

Calculation:

Context Relevance = Number of relevant sentences / Total retrieved sentences
Normalized: min(1.0, count / 20)

Formula:

R = |Relevant Sentences| / |Total Sentences|

Data Source from GPT:

gpt_labels.all_relevant_sentence_keys
# List of document sentence keys identified as relevant
# Example: ["0a", "0b", "1c", "2a"]

Example:

Retrieved 30 sentences total
GPT identifies 12 as relevant to question "What is machine learning?"
Context Relevance = 12/30 = 0.40 (40%)

What It Tells You:

  • βœ“ How good was the retrieval?
  • βœ“ Did we pull documents about the right topic?
  • βœ“ Are there irrelevant documents in the results?

2. Context Utilization (T) - How Much Was Used?

Definition: Fraction of relevant context that the response actually used to generate its answer.

Calculation:

Context Utilization = Number of utilized sentences / Number of relevant sentences

Formula:

U = |Utilized Sentences| / |Relevant Sentences|

Data Source from GPT:

gpt_labels.all_utilized_sentence_keys
# List of document sentence keys actually used in response
# Example: ["0a", "0b", "1c"]

Example:

Context Relevance found 12 relevant sentences: ["0a", "0b", "1a", "1c", "2a", ...]
GPT identifies 8 actually used in response: ["0a", "0b", "1c", "2a", ...]
Context Utilization = 8/12 = 0.67 (67%)

What It Tells You:

  • βœ“ Did the LLM actually use the available information?
  • βœ“ Is the response limited by context availability?
  • βœ“ Is context being ignored/wasted?

Problem Pattern:

  • High Relevance (0.9) + Low Utilization (0.3) β†’ Retrieval is good, but LLM isn't using it β†’ Fix: Improve prompt instructions

3. Completeness (C) - Was It Comprehensive?

Definition: Fraction of relevant information that is covered by the response.

Calculation:

Completeness = (Relevant ∩ Utilized) / Relevant
            = Relevant sentences that were used / All relevant sentences

Formula:

C = |Relevant ∩ Utilized| / |Relevant|

Data Source from GPT:

# Set intersection:
relevant_set = set(gpt_labels.all_relevant_sentence_keys)
utilized_set = set(gpt_labels.all_utilized_sentence_keys)
intersection = relevant_set & utilized_set

completeness = len(intersection) / len(relevant_set)

Example:

Relevant sentences: {"0a", "0b", "1a", "1c", "2a", "2b", "3a"}  (7 total)
Utilized sentences: {"0a", "0b", "1c", "2a"}                    (4 used)

Overlap (Relevant AND Used): {"0a", "0b", "1c", "2a"}            (4 in both)
Completeness = 4/7 = 0.57 (57%)

Missing: "1a", "2b", "3a" were relevant but not mentioned

What It Tells You:

  • βœ“ Did the response cover all important information?
  • βœ“ What relevant details were omitted?
  • βœ“ Is the response comprehensive?

Problem Pattern:

  • Low Completeness (0.4) with High Adherence (1.0) β†’ Response is accurate but limited β†’ Missing important information β†’ Fix: Improve retrieval coverage or summarization

4. Adherence (A) - Was It Grounded?

Definition: Whether the response is fully grounded in the retrieved context (no hallucinations).

Calculation:

Adherence = 1.0 if ALL sentences are fully supported, 0.0 otherwise
           (Boolean: fully grounded or contains hallucination)

Formula:

A = 1.0 if all(sentence.fully_supported for all sentences) else 0.0

Data Source from GPT:

gpt_labels.sentence_support_information
# For each response sentence:
# {
#   "response_sentence_key": "a",
#   "fully_supported": true/false,    # ← This determines adherence
#   "supporting_sentence_keys": ["0a", "0b"],
#   "explanation": "..."
# }

fully_supported_count = sum(
    1 for s in sentence_support_information
    if s.get("fully_supported", False)
)

adherence = 1.0 if fully_supported_count == total_sentences else 0.0

Example 1 - Perfect Adherence:

Response sentences:
  a. "Machine learning is a subset of AI."
     └─ Fully supported by document 0a βœ“
  b. "It uses algorithms to learn from data."
     └─ Fully supported by document 1b βœ“
  c. "Common applications include image recognition."
     └─ Fully supported by documents 2a, 2b βœ“

ALL sentences fully supported β†’ Adherence = 1.0 (100%)

Example 2 - Contains Hallucination:

Response sentences:
  a. "Machine learning is a subset of AI."
     └─ Fully supported by document 0a βœ“
  b. "It requires quantum computers."
     └─ NOT supported by any document βœ— (Hallucination!)
  c. "Common applications include image recognition."
     └─ Fully supported by documents 2a, 2b βœ“

ONE sentence NOT fully supported β†’ Adherence = 0.0 (0%)

What It Tells You:

  • βœ“ Is the response truthful/grounded?
  • βœ“ Does it contain hallucinations?
  • βœ“ Can we trust the answer?

How GPT Labeling Works

Step 1: Sentencization

Documents:

Document 0: "Machine learning is AI. It learns from data."
Document 1: "Neural networks are models. They mimic brains."

↓ Splits into sentences with keys:

0a: "Machine learning is AI."
0b: "It learns from data."
1a: "Neural networks are models."
1b: "They mimic brains."

Response:

"Machine learning uses neural networks. They learn patterns."

↓ Splits into sentences:

a: "Machine learning uses neural networks."
b: "They learn patterns."

Step 2: GPT Labeling

GPT analyzes and identifies:

  1. Relevance: Which document sentences are relevant to the question?
  2. Utilization: Which document sentences were actually used in the response?
  3. Support: Is each response sentence fully/partially/not supported?

GPT Output (JSON):

{
  "relevance_explanation": "Document discusses ML basics...",
  "all_relevant_sentence_keys": ["0a", "0b", "1a"],
  "overall_supported_explanation": "Response is grounded...",
  "overall_supported": true,
  "sentence_support_information": [
    {
      "response_sentence_key": "a",
      "explanation": "Matches document sentences...",
      "supporting_sentence_keys": ["0a", "1a"],
      "fully_supported": true
    },
    {
      "response_sentence_key": "b",
      "explanation": "Partially supported...",
      "supporting_sentence_keys": ["1b"],
      "fully_supported": false
    }
  ],
  "all_utilized_sentence_keys": ["0a", "1a", "1b"]
}

Step 3: Metric Calculation

Relevant: ["0a", "0b", "1a"]              (3 sentences)
Utilized: ["0a", "1a", "1b"]              (3 sentences)

Context Relevance = |["0a", "0b", "1a"]| / total_sentences
                  = 3 / ~20 = 0.15

Context Utilization = |Utilized| / |Relevant|
                    = 3 / 3 = 1.0

Completeness = |Relevant ∩ Utilized| / |Relevant|
             = |{"0a", "1a"}| / |{"0a", "0b", "1a"}|
             = 2 / 3 = 0.67

Adherence = All sentences fully supported? β†’ Check each
          = sentence_a (true) AND sentence_b (false)
          = false β†’ 0.0

Code Implementation

Metric Calculation Methods

def _compute_context_relevance(self, gpt_labels: GPTLabelingOutput) -> float:
    """Count relevant sentences, normalize to 0-1."""
    if not gpt_labels.all_relevant_sentence_keys:
        return 0.0
    return min(1.0, len(gpt_labels.all_relevant_sentence_keys) / 20.0)

def _compute_context_utilization(self, gpt_labels: GPTLabelingOutput) -> float:
    """Utilized / Relevant."""
    relevant_count = len(gpt_labels.all_relevant_sentence_keys)
    utilized_count = len(gpt_labels.all_utilized_sentence_keys)
    if relevant_count == 0:
        return 0.0
    return min(1.0, utilized_count / relevant_count)

def _compute_completeness(self, gpt_labels: GPTLabelingOutput, 
                        ground_truth: Optional[str] = None) -> float:
    """(Relevant AND Utilized) / Relevant."""
    relevant_set = set(gpt_labels.all_relevant_sentence_keys)
    utilized_set = set(gpt_labels.all_utilized_sentence_keys)
    intersection = len(relevant_set & utilized_set)
    if len(relevant_set) == 0:
        return 1.0 if len(utilized_set) == 0 else 0.0
    return intersection / len(relevant_set)

def _compute_adherence(self, gpt_labels: GPTLabelingOutput) -> float:
    """All sentences fully supported? Boolean: 1.0 or 0.0."""
    total_sentences = len(gpt_labels.sentence_support_information)
    if total_sentences == 0:
        return 1.0
    fully_supported_count = sum(
        1 for s in gpt_labels.sentence_support_information
        if s.get("fully_supported", False)
    )
    return 1.0 if fully_supported_count == total_sentences else 0.0

Complete Example: Full Calculation

Input

Question: "What is machine learning?"

Retrieved Documents:

Doc 0: "Machine learning is a subset of AI. It learns patterns from data. 
        Algorithms improve through experience."
Doc 1: "Deep learning uses neural networks. It's popular in computer vision."
Doc 2: "Supervised learning needs labeled data. Unsupervised learning finds patterns."

LLM Response:

"Machine learning is a field of AI that learns from data. Deep learning 
uses neural networks. It's powerful for image recognition."

GPT Labeling Process

Sentencized Documents:

0a: "Machine learning is a subset of AI."
0b: "It learns patterns from data."
0c: "Algorithms improve through experience."
1a: "Deep learning uses neural networks."
1b: "It's popular in computer vision."
2a: "Supervised learning needs labeled data."
2b: "Unsupervised learning finds patterns."

Sentencized Response:

a: "Machine learning is a field of AI that learns from data."
b: "Deep learning uses neural networks."
c: "It's powerful for image recognition."

GPT Analysis:

{
  "all_relevant_sentence_keys": ["0a", "0b", "1a", "1b"],
  "all_utilized_sentence_keys": ["0a", "0b", "1a", "1b"],
  "sentence_support_information": [
    {
      "response_sentence_key": "a",
      "supporting_sentence_keys": ["0a", "0b"],
      "fully_supported": true
    },
    {
      "response_sentence_key": "b",
      "supporting_sentence_keys": ["1a"],
      "fully_supported": true
    },
    {
      "response_sentence_key": "c",
      "supporting_sentence_keys": ["1b"],
      "fully_supported": false  // "powerful for image recognition" not explicitly in docs
    }
  ]
}

Metric Calculation

Relevant: ["0a", "0b", "1a", "1b"]       (4 sentences)
Utilized: ["0a", "0b", "1a", "1b"]       (4 sentences)
Total sentences retrieved: 7

Context Relevance = 4 / 20 = 0.20 (20%)
  └─ 4 out of ~20 average sentences are relevant

Context Utilization = 4 / 4 = 1.0 (100%)
  └─ All relevant information was used

Completeness = |{"0a","0b","1a","1b"} ∩ {"0a","0b","1a","1b"}| / 4
             = 4 / 4 = 1.0 (100%)
  └─ All relevant info was covered

Adherence = All fully supported?
          = sentence_a (true) AND sentence_b (true) AND sentence_c (false)
          = false β†’ 0.0 (0%)
  └─ Contains hallucination about "powerful for image recognition"

Average Score = (0.20 + 1.0 + 1.0 + 0.0) / 4 = 0.55

Key Insights

1. Complementary Metrics

Metric Measures Ideal Value
Relevance Quality of retrieval High (0.7+)
Utilization LLM uses available info High (0.7+)
Completeness Coverage of information High (0.7+)
Adherence Grounding (no hallucination) Perfect (1.0)

2. Common Patterns

Pattern 1: Good Retrieval, Bad Generation

Relevance: 0.85 (good retrieval)
Utilization: 0.40 (not using it)
β†’ Problem: LLM not leveraging context
β†’ Fix: Improve prompt instructions

Pattern 2: Conservative but Accurate

Completeness: 0.50 (missing info)
Adherence: 1.0 (all correct)
β†’ Problem: Limited but grounded response
β†’ Fix: Improve retrieval coverage

Pattern 3: Comprehensive and Grounded

Relevance: 0.75, Utilization: 0.80, Completeness: 0.85, Adherence: 1.0
β†’ Excellent RAG system
β†’ Action: Monitor and maintain

3. Mathematical Relationships

Completeness ≀ Utilization ≀ Relevance
(Because: utilization requires relevance, completeness requires utilization)

Also:
If Relevance = 0 β†’ Utilization = 0, Completeness = 0
If Utilization = 0 β†’ Completeness = 0 (but Relevance can be > 0)

Advantages of GPT Labeling

βœ… Semantic Understanding

  • Not just keyword matching
  • Understands meaning and context
  • Detects subtle hallucinations

βœ… Fine-Grained Analysis

  • Sentence-level support mapping
  • Identifies exactly which info is supported
  • Pinpoints problematic claims

βœ… Comprehensive

  • Evaluates all four TRACE metrics
  • Single pass through documents
  • Complete audit trail

βœ… Transparent

  • Full explanation for each metric
  • Shows supporting evidence
  • Human-verifiable results

Limitations

❌ Cost

  • API calls per evaluation (~2.5s per eval with rate limiting)
  • At 30 RPM: 50 evals = 3-5 minutes

❌ Semantic Brittleness

  • Depends on GPT's understanding
  • May miss implicit knowledge
  • Sensitive to phrasing

❌ Normalization

  • Context Relevance normalized by 20 (arbitrary baseline)
  • Different domain sizes affect scaling

❌ Binary Adherence

  • One hallucination = 0.0 adherence
  • No partial credit for mostly correct

Summary

The GPT labeling approach calculates TRACE metrics by:

  1. Splitting documents and response into sentences
  2. Analyzing with GPT to identify relevant/utilized/supported information
  3. Computing metrics from the labeled sentence keys:
    • Relevance: What's relevant in retrieved docs?
    • Utilization: What of that was actually used?
    • Completeness: What coverage of relevant info?
    • Adherence: Is all information grounded?

This enables precise, interpretable evaluation of RAG system quality.