# GPT Labeling Approach - Main Metrics Calculation

## Overview

The RAG evaluation system uses **GPT-based labeling** to calculate four TRACE metrics. The GPT model analyzes responses sentence-by-sentence and identifies which parts are supported by retrieved documents, enabling precise metric calculation.

---

## The Four TRACE Metrics

### 1. **Context Relevance (R)** - What's Actually Relevant?

**Definition:** Fraction of retrieved context that is relevant to answering the user's question.

**Calculation:**
```
Context Relevance = Number of relevant sentences / Total retrieved sentences
Normalized: min(1.0, count / 20)
```

**Formula:**
```
R = |Relevant Sentences| / |Total Sentences|
```

**Data Source from GPT:**
```python
gpt_labels.all_relevant_sentence_keys
# List of document sentence keys identified as relevant
# Example: ["0a", "0b", "1c", "2a"]
```

**Example:**
```
Retrieved 30 sentences total
GPT identifies 12 as relevant to question "What is machine learning?"
Context Relevance = 12/30 = 0.40 (40%)
```

**What It Tells You:**
- ✓ How good was the retrieval?
- ✓ Did we pull documents about the right topic?
- ✓ Are there irrelevant documents in the results?

---

### 2. **Context Utilization (T)** - How Much Was Used?

**Definition:** Fraction of relevant context that the response actually used to generate its answer.

**Calculation:**
```
Context Utilization = Number of utilized sentences / Number of relevant sentences
```

**Formula:**
```
U = |Utilized Sentences| / |Relevant Sentences|
```

**Data Source from GPT:**
```python
gpt_labels.all_utilized_sentence_keys
# List of document sentence keys actually used in response
# Example: ["0a", "0b", "1c"]
```

**Example:**
```
Context Relevance found 12 relevant sentences: ["0a", "0b", "1a", "1c", "2a", ...]
GPT identifies 8 actually used in response: ["0a", "0b", "1c", "2a", ...]
Context Utilization = 8/12 = 0.67 (67%)
```

**What It Tells You:**
- ✓ Did the LLM actually use the available information?
- ✓ Is the response limited by context availability?
- ✓ Is context being ignored/wasted?

**Problem Pattern:**
- High Relevance (0.9) + Low Utilization (0.3)
  → Retrieval is good, but LLM isn't using it
  → Fix: Improve prompt instructions

---

### 3. **Completeness (C)** - Was It Comprehensive?

**Definition:** Fraction of relevant information that is covered by the response.

**Calculation:**
```
Completeness = (Relevant ∩ Utilized) / Relevant
            = Relevant sentences that were used / All relevant sentences
```

**Formula:**
```
C = |Relevant ∩ Utilized| / |Relevant|
```

**Data Source from GPT:**
```python
# Set intersection:
relevant_set = set(gpt_labels.all_relevant_sentence_keys)
utilized_set = set(gpt_labels.all_utilized_sentence_keys)
intersection = relevant_set & utilized_set

completeness = len(intersection) / len(relevant_set)
```

**Example:**
```
Relevant sentences: {"0a", "0b", "1a", "1c", "2a", "2b", "3a"}  (7 total)
Utilized sentences: {"0a", "0b", "1c", "2a"}                    (4 used)

Overlap (Relevant AND Used): {"0a", "0b", "1c", "2a"}            (4 in both)
Completeness = 4/7 = 0.57 (57%)

Missing: "1a", "2b", "3a" were relevant but not mentioned
```

**What It Tells You:**
- ✓ Did the response cover all important information?
- ✓ What relevant details were omitted?
- ✓ Is the response comprehensive?

**Problem Pattern:**
- Low Completeness (0.4) with High Adherence (1.0)
  → Response is accurate but limited
  → Missing important information
  → Fix: Improve retrieval coverage or summarization

---

### 4. **Adherence (A)** - Was It Grounded?

**Definition:** Whether the response is fully grounded in the retrieved context (no hallucinations).

**Calculation:**
```
Adherence = 1.0 if ALL sentences are fully supported, 0.0 otherwise
           (Boolean: fully grounded or contains hallucination)
```

**Formula:**
```
A = 1.0 if all(sentence.fully_supported for all sentences) else 0.0
```

**Data Source from GPT:**
```python
gpt_labels.sentence_support_information
# For each response sentence:
# {
#   "response_sentence_key": "a",
#   "fully_supported": true/false,    # ← This determines adherence
#   "supporting_sentence_keys": ["0a", "0b"],
#   "explanation": "..."
# }

fully_supported_count = sum(
    1 for s in sentence_support_information
    if s.get("fully_supported", False)
)

adherence = 1.0 if fully_supported_count == total_sentences else 0.0
```

**Example 1 - Perfect Adherence:**
```
Response sentences:
  a. "Machine learning is a subset of AI."
     └─ Fully supported by document 0a ✓
  b. "It uses algorithms to learn from data."
     └─ Fully supported by document 1b ✓
  c. "Common applications include image recognition."
     └─ Fully supported by documents 2a, 2b ✓

ALL sentences fully supported → Adherence = 1.0 (100%)
```

**Example 2 - Contains Hallucination:**
```
Response sentences:
  a. "Machine learning is a subset of AI."
     └─ Fully supported by document 0a ✓
  b. "It requires quantum computers."
     └─ NOT supported by any document ✗ (Hallucination!)
  c. "Common applications include image recognition."
     └─ Fully supported by documents 2a, 2b ✓

ONE sentence NOT fully supported → Adherence = 0.0 (0%)
```

**What It Tells You:**
- ✓ Is the response truthful/grounded?
- ✓ Does it contain hallucinations?
- ✓ Can we trust the answer?

---

## How GPT Labeling Works

### Step 1: Sentencization

**Documents:**
```
Document 0: "Machine learning is AI. It learns from data."
Document 1: "Neural networks are models. They mimic brains."

↓ Splits into sentences with keys:

0a: "Machine learning is AI."
0b: "It learns from data."
1a: "Neural networks are models."
1b: "They mimic brains."
```

**Response:**
```
"Machine learning uses neural networks. They learn patterns."

↓ Splits into sentences:

a: "Machine learning uses neural networks."
b: "They learn patterns."
```

### Step 2: GPT Labeling

GPT analyzes and identifies:

1. **Relevance:** Which document sentences are relevant to the question?
2. **Utilization:** Which document sentences were actually used in the response?
3. **Support:** Is each response sentence fully/partially/not supported?

**GPT Output (JSON):**
```json
{
  "relevance_explanation": "Document discusses ML basics...",
  "all_relevant_sentence_keys": ["0a", "0b", "1a"],
  "overall_supported_explanation": "Response is grounded...",
  "overall_supported": true,
  "sentence_support_information": [
    {
      "response_sentence_key": "a",
      "explanation": "Matches document sentences...",
      "supporting_sentence_keys": ["0a", "1a"],
      "fully_supported": true
    },
    {
      "response_sentence_key": "b",
      "explanation": "Partially supported...",
      "supporting_sentence_keys": ["1b"],
      "fully_supported": false
    }
  ],
  "all_utilized_sentence_keys": ["0a", "1a", "1b"]
}
```

### Step 3: Metric Calculation

```
Relevant: ["0a", "0b", "1a"]              (3 sentences)
Utilized: ["0a", "1a", "1b"]              (3 sentences)

Context Relevance = |["0a", "0b", "1a"]| / total_sentences
                  = 3 / ~20 = 0.15

Context Utilization = |Utilized| / |Relevant|
                    = 3 / 3 = 1.0

Completeness = |Relevant ∩ Utilized| / |Relevant|
             = |{"0a", "1a"}| / |{"0a", "0b", "1a"}|
             = 2 / 3 = 0.67

Adherence = All sentences fully supported? → Check each
          = sentence_a (true) AND sentence_b (false)
          = false → 0.0
```

---

## Code Implementation

### Metric Calculation Methods

```python
def _compute_context_relevance(self, gpt_labels: GPTLabelingOutput) -> float:
    """Count relevant sentences, normalize to 0-1."""
    if not gpt_labels.all_relevant_sentence_keys:
        return 0.0
    return min(1.0, len(gpt_labels.all_relevant_sentence_keys) / 20.0)

def _compute_context_utilization(self, gpt_labels: GPTLabelingOutput) -> float:
    """Utilized / Relevant."""
    relevant_count = len(gpt_labels.all_relevant_sentence_keys)
    utilized_count = len(gpt_labels.all_utilized_sentence_keys)
    if relevant_count == 0:
        return 0.0
    return min(1.0, utilized_count / relevant_count)

def _compute_completeness(self, gpt_labels: GPTLabelingOutput, 
                        ground_truth: Optional[str] = None) -> float:
    """(Relevant AND Utilized) / Relevant."""
    relevant_set = set(gpt_labels.all_relevant_sentence_keys)
    utilized_set = set(gpt_labels.all_utilized_sentence_keys)
    intersection = len(relevant_set & utilized_set)
    if len(relevant_set) == 0:
        return 1.0 if len(utilized_set) == 0 else 0.0
    return intersection / len(relevant_set)

def _compute_adherence(self, gpt_labels: GPTLabelingOutput) -> float:
    """All sentences fully supported? Boolean: 1.0 or 0.0."""
    total_sentences = len(gpt_labels.sentence_support_information)
    if total_sentences == 0:
        return 1.0
    fully_supported_count = sum(
        1 for s in gpt_labels.sentence_support_information
        if s.get("fully_supported", False)
    )
    return 1.0 if fully_supported_count == total_sentences else 0.0
```

---

## Complete Example: Full Calculation

### Input

**Question:** "What is machine learning?"

**Retrieved Documents:**
```
Doc 0: "Machine learning is a subset of AI. It learns patterns from data. 
        Algorithms improve through experience."
Doc 1: "Deep learning uses neural networks. It's popular in computer vision."
Doc 2: "Supervised learning needs labeled data. Unsupervised learning finds patterns."
```

**LLM Response:**
```
"Machine learning is a field of AI that learns from data. Deep learning 
uses neural networks. It's powerful for image recognition."
```

### GPT Labeling Process

**Sentencized Documents:**
```
0a: "Machine learning is a subset of AI."
0b: "It learns patterns from data."
0c: "Algorithms improve through experience."
1a: "Deep learning uses neural networks."
1b: "It's popular in computer vision."
2a: "Supervised learning needs labeled data."
2b: "Unsupervised learning finds patterns."
```

**Sentencized Response:**
```
a: "Machine learning is a field of AI that learns from data."
b: "Deep learning uses neural networks."
c: "It's powerful for image recognition."
```

**GPT Analysis:**

```json
{
  "all_relevant_sentence_keys": ["0a", "0b", "1a", "1b"],
  "all_utilized_sentence_keys": ["0a", "0b", "1a", "1b"],
  "sentence_support_information": [
    {
      "response_sentence_key": "a",
      "supporting_sentence_keys": ["0a", "0b"],
      "fully_supported": true
    },
    {
      "response_sentence_key": "b",
      "supporting_sentence_keys": ["1a"],
      "fully_supported": true
    },
    {
      "response_sentence_key": "c",
      "supporting_sentence_keys": ["1b"],
      "fully_supported": false  // "powerful for image recognition" not explicitly in docs
    }
  ]
}
```

### Metric Calculation

```
Relevant: ["0a", "0b", "1a", "1b"]       (4 sentences)
Utilized: ["0a", "0b", "1a", "1b"]       (4 sentences)
Total sentences retrieved: 7

Context Relevance = 4 / 20 = 0.20 (20%)
  └─ 4 out of ~20 average sentences are relevant

Context Utilization = 4 / 4 = 1.0 (100%)
  └─ All relevant information was used

Completeness = |{"0a","0b","1a","1b"} ∩ {"0a","0b","1a","1b"}| / 4
             = 4 / 4 = 1.0 (100%)
  └─ All relevant info was covered

Adherence = All fully supported?
          = sentence_a (true) AND sentence_b (true) AND sentence_c (false)
          = false → 0.0 (0%)
  └─ Contains hallucination about "powerful for image recognition"

Average Score = (0.20 + 1.0 + 1.0 + 0.0) / 4 = 0.55
```

---

## Key Insights

### 1. Complementary Metrics

| Metric | Measures | Ideal Value |
|--------|----------|-------------|
| **Relevance** | Quality of retrieval | High (0.7+) |
| **Utilization** | LLM uses available info | High (0.7+) |
| **Completeness** | Coverage of information | High (0.7+) |
| **Adherence** | Grounding (no hallucination) | Perfect (1.0) |

### 2. Common Patterns

**Pattern 1: Good Retrieval, Bad Generation**
```
Relevance: 0.85 (good retrieval)
Utilization: 0.40 (not using it)
→ Problem: LLM not leveraging context
→ Fix: Improve prompt instructions
```

**Pattern 2: Conservative but Accurate**
```
Completeness: 0.50 (missing info)
Adherence: 1.0 (all correct)
→ Problem: Limited but grounded response
→ Fix: Improve retrieval coverage
```

**Pattern 3: Comprehensive and Grounded**
```
Relevance: 0.75, Utilization: 0.80, Completeness: 0.85, Adherence: 1.0
→ Excellent RAG system
→ Action: Monitor and maintain
```

### 3. Mathematical Relationships

```
Completeness ≤ Utilization ≤ Relevance
(Because: utilization requires relevance, completeness requires utilization)

Also:
If Relevance = 0 → Utilization = 0, Completeness = 0
If Utilization = 0 → Completeness = 0 (but Relevance can be > 0)
```

---

## Advantages of GPT Labeling

✅ **Semantic Understanding**
- Not just keyword matching
- Understands meaning and context
- Detects subtle hallucinations

✅ **Fine-Grained Analysis**
- Sentence-level support mapping
- Identifies exactly which info is supported
- Pinpoints problematic claims

✅ **Comprehensive**
- Evaluates all four TRACE metrics
- Single pass through documents
- Complete audit trail

✅ **Transparent**
- Full explanation for each metric
- Shows supporting evidence
- Human-verifiable results

---

## Limitations

❌ **Cost**
- API calls per evaluation (~2.5s per eval with rate limiting)
- At 30 RPM: 50 evals = 3-5 minutes

❌ **Semantic Brittleness**
- Depends on GPT's understanding
- May miss implicit knowledge
- Sensitive to phrasing

❌ **Normalization**
- Context Relevance normalized by 20 (arbitrary baseline)
- Different domain sizes affect scaling

❌ **Binary Adherence**
- One hallucination = 0.0 adherence
- No partial credit for mostly correct

---

## Summary

The GPT labeling approach calculates TRACE metrics by:

1. **Splitting** documents and response into sentences
2. **Analyzing** with GPT to identify relevant/utilized/supported information
3. **Computing** metrics from the labeled sentence keys:
   - **Relevance**: What's relevant in retrieved docs?
   - **Utilization**: What of that was actually used?
   - **Completeness**: What coverage of relevant info?
   - **Adherence**: Is all information grounded?

This enables precise, interpretable evaluation of RAG system quality.