Spaces:

gopikrishnait
/

CapStoneRAG10

Sleeping

File size: 12,579 Bytes

1d10b0a

# TRACE Metrics Calculation - Visual Guide

## Step-by-Step Visualization

### STEP 1: Sentencization

```
DOCUMENTS                          RESPONSE
═══════════════════════════════    ══════════════════════════════
Doc 0:                             "Machine learning is AI that learns
"ML is AI. It learns from data.    from data. Deep learning uses neural
Algorithms improve through time."  networks. It's powerful for images."

↓ Split by sentence ends          ↓ Split by sentence ends

0a: "ML is AI."                   a: "Machine learning is AI that
0b: "It learns from data."           learns from data."
0c: "Algorithms improve            b: "Deep learning uses neural
     through time."                   networks."
                                   c: "It's powerful for images."
```

### STEP 2: GPT Analysis

```
GPT MODEL PROCESSES:
┌─────────────────────────────────────────────────────────────┐
│                                                             │
│  INPUT: Sentencized docs + response + question             │
│                                                             │
│  ANALYSIS:                                                  │
│  ✓ Which doc sentences are relevant to question?           │
│  ✓ Which doc sentences does response use?                  │
│  ✓ Is each response sentence fully/partially supported?    │
│                                                             │
│  OUTPUT: JSON with sentence keys and support mappings      │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### STEP 3: Metric Calculation

```
GPT OUTPUT (SIMPLIFIED):
{
  "all_relevant_sentence_keys": ["0a", "0b"],
  "all_utilized_sentence_keys": ["0a", "0b"],
  "sentence_support_information": [
    {"response_sentence_key": "a", "fully_supported": true},
    {"response_sentence_key": "b", "fully_supported": true},
    {"response_sentence_key": "c", "fully_supported": false}
  ]
}

                    ↓

METRIC CALCULATION:
├─ Context Relevance = |relevant| / 20 = 2/20 = 0.10
├─ Context Utilization = |utilized| / |relevant| = 2/2 = 1.0
├─ Completeness = |relevant ∩ utilized| / |relevant| = 2/2 = 1.0
└─ Adherence = all_fully_supported? = false → 0.0
```

---

## Metric Formulas with Venn Diagrams

### Context Relevance (R)

```
ALL RETRIEVED SENTENCES
┌──────────────────────────┐
│                          │
│  Total: ~20 sentences    │
│                          │
│  ┌──────────────────┐    │
│  │ RELEVANT:        │    │
│  │ ["0a", "0b"]     │    │
│  │ Count: 2         │    │
│  └──────────────────┘    │
│                          │
│  Irrelevant: 18          │
│                          │
└──────────────────────────┘

Formula: R = 2 / 20 = 0.10 (10%)
Interpretation: 10% of retrieved content is relevant to question
```

### Context Utilization (T)

```
RELEVANT SENTENCES
┌──────────────────────────┐
│ RELEVANT: ["0a", "0b"]   │
│                          │
│ ┌────────────────────┐   │
│ │ UTILIZED:          │   │
│ │ ["0a", "0b"]       │   │
│ │ Count: 2           │   │
│ └────────────────────┘   │
│                          │
│ NOT USED: 0              │
│                          │
└──────────────────────────┘

Formula: U = 2 / 2 = 1.0 (100%)
Interpretation: All relevant information was used
```

### Completeness (C)

```
        RELEVANT              UTILIZED
   ┌──────────────┐      ┌──────────────┐
   │ ["0a", "0b"] │      │ ["0a", "0b"] │
   │              │      │              │
   │   COUNT: 2   │      │   COUNT: 2   │
   │              │      │              │
   └──────────────┘      └──────────────┘
            │                    │
            └────────┬───────────┘
                     │
            OVERLAP: ["0a", "0b"]
            COUNT: 2

Formula: C = 2 / 2 = 1.0 (100%)
Interpretation: All relevant info is in response
```

### Adherence (A)

```
RESPONSE SENTENCES:           SUPPORT STATUS:
┌──────────────────┐         ┌──────────────────┐
│ a: "ML is AI..." │ ───────→│ ✓ Fully          │
│                  │         │   Supported      │
│ b: "Deep learns..│ ───────→│ ✓ Fully          │
│                  │         │   Supported      │
│ c: "Powerful..." │ ───────→│ ✗ Not            │
│                  │         │   Supported      │
└──────────────────┘         └──────────────────┘

Formula: A = (all_supported) ? 1.0 : 0.0
       = (true AND true AND false) ? 1.0 : 0.0
       = 0.0 (100% = 0 because of one failure)

Interpretation: Response contains hallucination (adherence fails)
```

---

## Complete Example Walkthrough

### Input

```
QUESTION:
"What makes machine learning different from traditional programming?"

RETRIEVED DOCUMENTS:
0: "Machine learning is a subset of AI. It learns patterns from data.
    Traditional programming requires explicit instructions."
1: "ML algorithms improve through experience. They adapt to new data.
    Rule-based systems are rigid and hard to maintain."

LLM RESPONSE:
"Machine learning differs because it learns from data rather than 
requiring explicit instructions. ML algorithms improve over time.
It's the future of all computing."
```

### Step 1: Sentencization

```
DOCUMENTS:
0a: "Machine learning is a subset of AI."
0b: "It learns patterns from data."
0c: "Traditional programming requires explicit instructions."
1a: "ML algorithms improve through experience."
1b: "They adapt to new data."
1c: "Rule-based systems are rigid and hard to maintain."

RESPONSE:
a: "Machine learning differs because it learns from data rather than
    requiring explicit instructions."
b: "ML algorithms improve over time."
c: "It's the future of all computing."
```

### Step 2: GPT Labeling

```
ANALYSIS BY GPT:

Question focus: Differences between ML and traditional programming
└─ "learns from data" vs "explicit instructions"
└─ "improves through experience"
└─ Adaptability

RELEVANT SENTENCES (to question):
├─ 0a: "subset of AI" → Partially relevant
├─ 0b: "learns patterns from data" → RELEVANT ✓
├─ 0c: "requires explicit instructions" → RELEVANT ✓
├─ 1a: "improve through experience" → RELEVANT ✓
├─ 1b: "adapt to new data" → RELEVANT ✓
└─ 1c: "rule-based systems rigid" → Partially relevant

UTILIZED SENTENCES (used in response):
├─ response_a uses: 0b, 0c → Document references: [0b, 0c]
├─ response_b uses: 1a → Document references: [1a]
└─ response_c uses: NONE → No support → [hallucination]

FULLY SUPPORTED CHECK:
├─ response_a "learns from data, not explicit" → Supported by 0b, 0c ✓
├─ response_b "algorithms improve" → Supported by 1a ✓
└─ response_c "future of all computing" → NOT in documents ✗
```

### Step 3: Metric Calculation

```
EXTRACTED DATA:
all_relevant_sentence_keys = ["0b", "0c", "1a", "1b"]  (4 sentences)
all_utilized_sentence_keys = ["0b", "0c", "1a"]        (3 sentences)
sentence_support_information = [
  {key: "a", fully_supported: true},
  {key: "b", fully_supported: true},
  {key: "c", fully_supported: false}
]

CALCULATIONS:

1. Context Relevance
   = |relevant| / 20
   = 4 / 20
   = 0.20 (20%)
   
2. Context Utilization
   = |utilized| / |relevant|
   = 3 / 4
   = 0.75 (75%)
   
3. Completeness
   = |relevant ∩ utilized| / |relevant|
   = |{0b, 0c, 1a}| / |{0b, 0c, 1a, 1b}|
   = 3 / 4
   = 0.75 (75%)
   
4. Adherence
   = all fully_supported?
   = true AND true AND false
   = FALSE → 0.0 (0%)
```

### Results

```
┌─────────────────────────────────────────┐
│ TRACE METRICS RESULTS                   │
├─────────────────────────────────────────┤
│ Context Relevance:  0.20 (20%)         │
│ Context Utilization: 0.75 (75%)        │
│ Completeness:       0.75 (75%)         │
│ Adherence:          0.0  (0%)          │
├─────────────────────────────────────────┤
│ Average:            0.425 (42.5%)      │
│ RMSE Aggregation:   0.437               │
│ Consistency Score:  0.563               │
└─────────────────────────────────────────┘

INTERPRETATION:
✓ Good relevance targeting (20%)
✓ Decent information usage (75%)
✓ Good coverage of relevant info (75%)
✗ Contains hallucination (0% adherence)

ACTION: Address the hallucination about "future of all computing"
```

---

## Calculation Pseudocode

```python
# INPUT: GPT labeled output
gpt_labels = {
    "all_relevant_sentence_keys": [...],
    "all_utilized_sentence_keys": [...],
    "sentence_support_information": [...]
}

# METRIC 1: Context Relevance
relevant_count = len(gpt_labels["all_relevant_sentence_keys"])
context_relevance = min(1.0, relevant_count / 20.0)

# METRIC 2: Context Utilization
utilized_count = len(gpt_labels["all_utilized_sentence_keys"])
if relevant_count == 0:
    context_utilization = 0.0
else:
    context_utilization = min(1.0, utilized_count / relevant_count)

# METRIC 3: Completeness
relevant_set = set(gpt_labels["all_relevant_sentence_keys"])
utilized_set = set(gpt_labels["all_utilized_sentence_keys"])
overlap_count = len(relevant_set & utilized_set)
if len(relevant_set) == 0:
    completeness = 1.0 if len(utilized_set) == 0 else 0.0
else:
    completeness = overlap_count / len(relevant_set)

# METRIC 4: Adherence
fully_supported_count = sum(
    1 for sentence in gpt_labels["sentence_support_information"]
    if sentence["fully_supported"]
)
total_sentences = len(gpt_labels["sentence_support_information"])
if total_sentences == 0:
    adherence = 1.0
else:
    adherence = 1.0 if fully_supported_count == total_sentences else 0.0

# OUTPUT
scores = {
    "context_relevance": context_relevance,
    "context_utilization": context_utilization,
    "completeness": completeness,
    "adherence": adherence,
    "average": (context_relevance + context_utilization + 
               completeness + adherence) / 4
}
```

---

## Key Takeaways

### 1. Each Metric Answers a Different Question

| Metric | Question | Data Source |
|--------|----------|-------------|
| **R** | Is retrieval good? | Relevant sentences |
| **U** | Does LLM use it? | Utilized sentences |
| **C** | Is response comprehensive? | Overlap |
| **A** | Is response truthful? | Support flags |

### 2. Metrics Are Independent

- Low R, high U is possible (ignore irrelevant)
- Low U, high R is possible (retrieval good, generation bad)
- Low C, high A is possible (limited but correct)

### 3. GPT Labeling is Sentence-Level

- Fine-grained sentence keys (0a, 0b, 1c, etc.)
- Exact mapping of support
- Transparent and verifiable

### 4. All Four Metrics Required for Full Picture

```
Relevance:    ← "Did we retrieve the right docs?"
Utilization:  ← "Did the LLM use them?"
Completeness: ← "Did it cover the information?"
Adherence:    ← "Is it accurate?"
```

All four needed to understand RAG quality.