CapStoneRAG10 / docs /GPT_METRICS_VISUAL_GUIDE.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a
# TRACE Metrics Calculation - Visual Guide
## Step-by-Step Visualization
### STEP 1: Sentencization
```
DOCUMENTS RESPONSE
═══════════════════════════════ ══════════════════════════════
Doc 0: "Machine learning is AI that learns
"ML is AI. It learns from data. from data. Deep learning uses neural
Algorithms improve through time." networks. It's powerful for images."
↓ Split by sentence ends ↓ Split by sentence ends
0a: "ML is AI." a: "Machine learning is AI that
0b: "It learns from data." learns from data."
0c: "Algorithms improve b: "Deep learning uses neural
through time." networks."
c: "It's powerful for images."
```
### STEP 2: GPT Analysis
```
GPT MODEL PROCESSES:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
β”‚ INPUT: Sentencized docs + response + question β”‚
β”‚ β”‚
β”‚ ANALYSIS: β”‚
β”‚ βœ“ Which doc sentences are relevant to question? β”‚
β”‚ βœ“ Which doc sentences does response use? β”‚
β”‚ βœ“ Is each response sentence fully/partially supported? β”‚
β”‚ β”‚
β”‚ OUTPUT: JSON with sentence keys and support mappings β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### STEP 3: Metric Calculation
```
GPT OUTPUT (SIMPLIFIED):
{
"all_relevant_sentence_keys": ["0a", "0b"],
"all_utilized_sentence_keys": ["0a", "0b"],
"sentence_support_information": [
{"response_sentence_key": "a", "fully_supported": true},
{"response_sentence_key": "b", "fully_supported": true},
{"response_sentence_key": "c", "fully_supported": false}
]
}
↓
METRIC CALCULATION:
β”œβ”€ Context Relevance = |relevant| / 20 = 2/20 = 0.10
β”œβ”€ Context Utilization = |utilized| / |relevant| = 2/2 = 1.0
β”œβ”€ Completeness = |relevant ∩ utilized| / |relevant| = 2/2 = 1.0
└─ Adherence = all_fully_supported? = false β†’ 0.0
```
---
## Metric Formulas with Venn Diagrams
### Context Relevance (R)
```
ALL RETRIEVED SENTENCES
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
β”‚ Total: ~20 sentences β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ RELEVANT: β”‚ β”‚
β”‚ β”‚ ["0a", "0b"] β”‚ β”‚
β”‚ β”‚ Count: 2 β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ Irrelevant: 18 β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Formula: R = 2 / 20 = 0.10 (10%)
Interpretation: 10% of retrieved content is relevant to question
```
### Context Utilization (T)
```
RELEVANT SENTENCES
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RELEVANT: ["0a", "0b"] β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ UTILIZED: β”‚ β”‚
β”‚ β”‚ ["0a", "0b"] β”‚ β”‚
β”‚ β”‚ Count: 2 β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ NOT USED: 0 β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Formula: U = 2 / 2 = 1.0 (100%)
Interpretation: All relevant information was used
```
### Completeness (C)
```
RELEVANT UTILIZED
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ["0a", "0b"] β”‚ β”‚ ["0a", "0b"] β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ COUNT: 2 β”‚ β”‚ COUNT: 2 β”‚
β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
OVERLAP: ["0a", "0b"]
COUNT: 2
Formula: C = 2 / 2 = 1.0 (100%)
Interpretation: All relevant info is in response
```
### Adherence (A)
```
RESPONSE SENTENCES: SUPPORT STATUS:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a: "ML is AI..." β”‚ ───────→│ βœ“ Fully β”‚
β”‚ β”‚ β”‚ Supported β”‚
β”‚ b: "Deep learns..β”‚ ───────→│ βœ“ Fully β”‚
β”‚ β”‚ β”‚ Supported β”‚
β”‚ c: "Powerful..." β”‚ ───────→│ βœ— Not β”‚
β”‚ β”‚ β”‚ Supported β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Formula: A = (all_supported) ? 1.0 : 0.0
= (true AND true AND false) ? 1.0 : 0.0
= 0.0 (100% = 0 because of one failure)
Interpretation: Response contains hallucination (adherence fails)
```
---
## Complete Example Walkthrough
### Input
```
QUESTION:
"What makes machine learning different from traditional programming?"
RETRIEVED DOCUMENTS:
0: "Machine learning is a subset of AI. It learns patterns from data.
Traditional programming requires explicit instructions."
1: "ML algorithms improve through experience. They adapt to new data.
Rule-based systems are rigid and hard to maintain."
LLM RESPONSE:
"Machine learning differs because it learns from data rather than
requiring explicit instructions. ML algorithms improve over time.
It's the future of all computing."
```
### Step 1: Sentencization
```
DOCUMENTS:
0a: "Machine learning is a subset of AI."
0b: "It learns patterns from data."
0c: "Traditional programming requires explicit instructions."
1a: "ML algorithms improve through experience."
1b: "They adapt to new data."
1c: "Rule-based systems are rigid and hard to maintain."
RESPONSE:
a: "Machine learning differs because it learns from data rather than
requiring explicit instructions."
b: "ML algorithms improve over time."
c: "It's the future of all computing."
```
### Step 2: GPT Labeling
```
ANALYSIS BY GPT:
Question focus: Differences between ML and traditional programming
└─ "learns from data" vs "explicit instructions"
└─ "improves through experience"
└─ Adaptability
RELEVANT SENTENCES (to question):
β”œβ”€ 0a: "subset of AI" β†’ Partially relevant
β”œβ”€ 0b: "learns patterns from data" β†’ RELEVANT βœ“
β”œβ”€ 0c: "requires explicit instructions" β†’ RELEVANT βœ“
β”œβ”€ 1a: "improve through experience" β†’ RELEVANT βœ“
β”œβ”€ 1b: "adapt to new data" β†’ RELEVANT βœ“
└─ 1c: "rule-based systems rigid" β†’ Partially relevant
UTILIZED SENTENCES (used in response):
β”œβ”€ response_a uses: 0b, 0c β†’ Document references: [0b, 0c]
β”œβ”€ response_b uses: 1a β†’ Document references: [1a]
└─ response_c uses: NONE β†’ No support β†’ [hallucination]
FULLY SUPPORTED CHECK:
β”œβ”€ response_a "learns from data, not explicit" β†’ Supported by 0b, 0c βœ“
β”œβ”€ response_b "algorithms improve" β†’ Supported by 1a βœ“
└─ response_c "future of all computing" β†’ NOT in documents βœ—
```
### Step 3: Metric Calculation
```
EXTRACTED DATA:
all_relevant_sentence_keys = ["0b", "0c", "1a", "1b"] (4 sentences)
all_utilized_sentence_keys = ["0b", "0c", "1a"] (3 sentences)
sentence_support_information = [
{key: "a", fully_supported: true},
{key: "b", fully_supported: true},
{key: "c", fully_supported: false}
]
CALCULATIONS:
1. Context Relevance
= |relevant| / 20
= 4 / 20
= 0.20 (20%)
2. Context Utilization
= |utilized| / |relevant|
= 3 / 4
= 0.75 (75%)
3. Completeness
= |relevant ∩ utilized| / |relevant|
= |{0b, 0c, 1a}| / |{0b, 0c, 1a, 1b}|
= 3 / 4
= 0.75 (75%)
4. Adherence
= all fully_supported?
= true AND true AND false
= FALSE β†’ 0.0 (0%)
```
### Results
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TRACE METRICS RESULTS β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Context Relevance: 0.20 (20%) β”‚
β”‚ Context Utilization: 0.75 (75%) β”‚
β”‚ Completeness: 0.75 (75%) β”‚
β”‚ Adherence: 0.0 (0%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Average: 0.425 (42.5%) β”‚
β”‚ RMSE Aggregation: 0.437 β”‚
β”‚ Consistency Score: 0.563 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
INTERPRETATION:
βœ“ Good relevance targeting (20%)
βœ“ Decent information usage (75%)
βœ“ Good coverage of relevant info (75%)
βœ— Contains hallucination (0% adherence)
ACTION: Address the hallucination about "future of all computing"
```
---
## Calculation Pseudocode
```python
# INPUT: GPT labeled output
gpt_labels = {
"all_relevant_sentence_keys": [...],
"all_utilized_sentence_keys": [...],
"sentence_support_information": [...]
}
# METRIC 1: Context Relevance
relevant_count = len(gpt_labels["all_relevant_sentence_keys"])
context_relevance = min(1.0, relevant_count / 20.0)
# METRIC 2: Context Utilization
utilized_count = len(gpt_labels["all_utilized_sentence_keys"])
if relevant_count == 0:
context_utilization = 0.0
else:
context_utilization = min(1.0, utilized_count / relevant_count)
# METRIC 3: Completeness
relevant_set = set(gpt_labels["all_relevant_sentence_keys"])
utilized_set = set(gpt_labels["all_utilized_sentence_keys"])
overlap_count = len(relevant_set & utilized_set)
if len(relevant_set) == 0:
completeness = 1.0 if len(utilized_set) == 0 else 0.0
else:
completeness = overlap_count / len(relevant_set)
# METRIC 4: Adherence
fully_supported_count = sum(
1 for sentence in gpt_labels["sentence_support_information"]
if sentence["fully_supported"]
)
total_sentences = len(gpt_labels["sentence_support_information"])
if total_sentences == 0:
adherence = 1.0
else:
adherence = 1.0 if fully_supported_count == total_sentences else 0.0
# OUTPUT
scores = {
"context_relevance": context_relevance,
"context_utilization": context_utilization,
"completeness": completeness,
"adherence": adherence,
"average": (context_relevance + context_utilization +
completeness + adherence) / 4
}
```
---
## Key Takeaways
### 1. Each Metric Answers a Different Question
| Metric | Question | Data Source |
|--------|----------|-------------|
| **R** | Is retrieval good? | Relevant sentences |
| **U** | Does LLM use it? | Utilized sentences |
| **C** | Is response comprehensive? | Overlap |
| **A** | Is response truthful? | Support flags |
### 2. Metrics Are Independent
- Low R, high U is possible (ignore irrelevant)
- Low U, high R is possible (retrieval good, generation bad)
- Low C, high A is possible (limited but correct)
### 3. GPT Labeling is Sentence-Level
- Fine-grained sentence keys (0a, 0b, 1c, etc.)
- Exact mapping of support
- Transparent and verifiable
### 4. All Four Metrics Required for Full Picture
```
Relevance: ← "Did we retrieve the right docs?"
Utilization: ← "Did the LLM use them?"
Completeness: ← "Did it cover the information?"
Adherence: ← "Is it accurate?"
```
All four needed to understand RAG quality.