CapStoneRAG10 / docs /GPT_METRICS_VISUAL_GUIDE.md
Developer
Initial commit for HuggingFace Spaces - RAG Capstone Project with Qdrant Cloud
1d10b0a

TRACE Metrics Calculation - Visual Guide

Step-by-Step Visualization

STEP 1: Sentencization

DOCUMENTS                          RESPONSE
═══════════════════════════════    ══════════════════════════════
Doc 0:                             "Machine learning is AI that learns
"ML is AI. It learns from data.    from data. Deep learning uses neural
Algorithms improve through time."  networks. It's powerful for images."

↓ Split by sentence ends          ↓ Split by sentence ends

0a: "ML is AI."                   a: "Machine learning is AI that
0b: "It learns from data."           learns from data."
0c: "Algorithms improve            b: "Deep learning uses neural
     through time."                   networks."
                                   c: "It's powerful for images."

STEP 2: GPT Analysis

GPT MODEL PROCESSES:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                             β”‚
β”‚  INPUT: Sentencized docs + response + question             β”‚
β”‚                                                             β”‚
β”‚  ANALYSIS:                                                  β”‚
β”‚  βœ“ Which doc sentences are relevant to question?           β”‚
β”‚  βœ“ Which doc sentences does response use?                  β”‚
β”‚  βœ“ Is each response sentence fully/partially supported?    β”‚
β”‚                                                             β”‚
β”‚  OUTPUT: JSON with sentence keys and support mappings      β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

STEP 3: Metric Calculation

GPT OUTPUT (SIMPLIFIED):
{
  "all_relevant_sentence_keys": ["0a", "0b"],
  "all_utilized_sentence_keys": ["0a", "0b"],
  "sentence_support_information": [
    {"response_sentence_key": "a", "fully_supported": true},
    {"response_sentence_key": "b", "fully_supported": true},
    {"response_sentence_key": "c", "fully_supported": false}
  ]
}

                    ↓

METRIC CALCULATION:
β”œβ”€ Context Relevance = |relevant| / 20 = 2/20 = 0.10
β”œβ”€ Context Utilization = |utilized| / |relevant| = 2/2 = 1.0
β”œβ”€ Completeness = |relevant ∩ utilized| / |relevant| = 2/2 = 1.0
└─ Adherence = all_fully_supported? = false β†’ 0.0

Metric Formulas with Venn Diagrams

Context Relevance (R)

ALL RETRIEVED SENTENCES
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          β”‚
β”‚  Total: ~20 sentences    β”‚
β”‚                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ RELEVANT:        β”‚    β”‚
β”‚  β”‚ ["0a", "0b"]     β”‚    β”‚
β”‚  β”‚ Count: 2         β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                          β”‚
β”‚  Irrelevant: 18          β”‚
β”‚                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Formula: R = 2 / 20 = 0.10 (10%)
Interpretation: 10% of retrieved content is relevant to question

Context Utilization (T)

RELEVANT SENTENCES
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RELEVANT: ["0a", "0b"]   β”‚
β”‚                          β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚ β”‚ UTILIZED:          β”‚   β”‚
β”‚ β”‚ ["0a", "0b"]       β”‚   β”‚
β”‚ β”‚ Count: 2           β”‚   β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                          β”‚
β”‚ NOT USED: 0              β”‚
β”‚                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Formula: U = 2 / 2 = 1.0 (100%)
Interpretation: All relevant information was used

Completeness (C)

        RELEVANT              UTILIZED
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ ["0a", "0b"] β”‚      β”‚ ["0a", "0b"] β”‚
   β”‚              β”‚      β”‚              β”‚
   β”‚   COUNT: 2   β”‚      β”‚   COUNT: 2   β”‚
   β”‚              β”‚      β”‚              β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                    β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
            OVERLAP: ["0a", "0b"]
            COUNT: 2

Formula: C = 2 / 2 = 1.0 (100%)
Interpretation: All relevant info is in response

Adherence (A)

RESPONSE SENTENCES:           SUPPORT STATUS:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a: "ML is AI..." β”‚ ───────→│ βœ“ Fully          β”‚
β”‚                  β”‚         β”‚   Supported      β”‚
β”‚ b: "Deep learns..β”‚ ───────→│ βœ“ Fully          β”‚
β”‚                  β”‚         β”‚   Supported      β”‚
β”‚ c: "Powerful..." β”‚ ───────→│ βœ— Not            β”‚
β”‚                  β”‚         β”‚   Supported      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Formula: A = (all_supported) ? 1.0 : 0.0
       = (true AND true AND false) ? 1.0 : 0.0
       = 0.0 (100% = 0 because of one failure)

Interpretation: Response contains hallucination (adherence fails)

Complete Example Walkthrough

Input

QUESTION:
"What makes machine learning different from traditional programming?"

RETRIEVED DOCUMENTS:
0: "Machine learning is a subset of AI. It learns patterns from data.
    Traditional programming requires explicit instructions."
1: "ML algorithms improve through experience. They adapt to new data.
    Rule-based systems are rigid and hard to maintain."

LLM RESPONSE:
"Machine learning differs because it learns from data rather than 
requiring explicit instructions. ML algorithms improve over time.
It's the future of all computing."

Step 1: Sentencization

DOCUMENTS:
0a: "Machine learning is a subset of AI."
0b: "It learns patterns from data."
0c: "Traditional programming requires explicit instructions."
1a: "ML algorithms improve through experience."
1b: "They adapt to new data."
1c: "Rule-based systems are rigid and hard to maintain."

RESPONSE:
a: "Machine learning differs because it learns from data rather than
    requiring explicit instructions."
b: "ML algorithms improve over time."
c: "It's the future of all computing."

Step 2: GPT Labeling

ANALYSIS BY GPT:

Question focus: Differences between ML and traditional programming
└─ "learns from data" vs "explicit instructions"
└─ "improves through experience"
└─ Adaptability

RELEVANT SENTENCES (to question):
β”œβ”€ 0a: "subset of AI" β†’ Partially relevant
β”œβ”€ 0b: "learns patterns from data" β†’ RELEVANT βœ“
β”œβ”€ 0c: "requires explicit instructions" β†’ RELEVANT βœ“
β”œβ”€ 1a: "improve through experience" β†’ RELEVANT βœ“
β”œβ”€ 1b: "adapt to new data" β†’ RELEVANT βœ“
└─ 1c: "rule-based systems rigid" β†’ Partially relevant

UTILIZED SENTENCES (used in response):
β”œβ”€ response_a uses: 0b, 0c β†’ Document references: [0b, 0c]
β”œβ”€ response_b uses: 1a β†’ Document references: [1a]
└─ response_c uses: NONE β†’ No support β†’ [hallucination]

FULLY SUPPORTED CHECK:
β”œβ”€ response_a "learns from data, not explicit" β†’ Supported by 0b, 0c βœ“
β”œβ”€ response_b "algorithms improve" β†’ Supported by 1a βœ“
└─ response_c "future of all computing" β†’ NOT in documents βœ—

Step 3: Metric Calculation

EXTRACTED DATA:
all_relevant_sentence_keys = ["0b", "0c", "1a", "1b"]  (4 sentences)
all_utilized_sentence_keys = ["0b", "0c", "1a"]        (3 sentences)
sentence_support_information = [
  {key: "a", fully_supported: true},
  {key: "b", fully_supported: true},
  {key: "c", fully_supported: false}
]

CALCULATIONS:

1. Context Relevance
   = |relevant| / 20
   = 4 / 20
   = 0.20 (20%)
   
2. Context Utilization
   = |utilized| / |relevant|
   = 3 / 4
   = 0.75 (75%)
   
3. Completeness
   = |relevant ∩ utilized| / |relevant|
   = |{0b, 0c, 1a}| / |{0b, 0c, 1a, 1b}|
   = 3 / 4
   = 0.75 (75%)
   
4. Adherence
   = all fully_supported?
   = true AND true AND false
   = FALSE β†’ 0.0 (0%)

Results

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TRACE METRICS RESULTS                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Context Relevance:  0.20 (20%)         β”‚
β”‚ Context Utilization: 0.75 (75%)        β”‚
β”‚ Completeness:       0.75 (75%)         β”‚
β”‚ Adherence:          0.0  (0%)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Average:            0.425 (42.5%)      β”‚
β”‚ RMSE Aggregation:   0.437               β”‚
β”‚ Consistency Score:  0.563               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

INTERPRETATION:
βœ“ Good relevance targeting (20%)
βœ“ Decent information usage (75%)
βœ“ Good coverage of relevant info (75%)
βœ— Contains hallucination (0% adherence)

ACTION: Address the hallucination about "future of all computing"

Calculation Pseudocode

# INPUT: GPT labeled output
gpt_labels = {
    "all_relevant_sentence_keys": [...],
    "all_utilized_sentence_keys": [...],
    "sentence_support_information": [...]
}

# METRIC 1: Context Relevance
relevant_count = len(gpt_labels["all_relevant_sentence_keys"])
context_relevance = min(1.0, relevant_count / 20.0)

# METRIC 2: Context Utilization
utilized_count = len(gpt_labels["all_utilized_sentence_keys"])
if relevant_count == 0:
    context_utilization = 0.0
else:
    context_utilization = min(1.0, utilized_count / relevant_count)

# METRIC 3: Completeness
relevant_set = set(gpt_labels["all_relevant_sentence_keys"])
utilized_set = set(gpt_labels["all_utilized_sentence_keys"])
overlap_count = len(relevant_set & utilized_set)
if len(relevant_set) == 0:
    completeness = 1.0 if len(utilized_set) == 0 else 0.0
else:
    completeness = overlap_count / len(relevant_set)

# METRIC 4: Adherence
fully_supported_count = sum(
    1 for sentence in gpt_labels["sentence_support_information"]
    if sentence["fully_supported"]
)
total_sentences = len(gpt_labels["sentence_support_information"])
if total_sentences == 0:
    adherence = 1.0
else:
    adherence = 1.0 if fully_supported_count == total_sentences else 0.0

# OUTPUT
scores = {
    "context_relevance": context_relevance,
    "context_utilization": context_utilization,
    "completeness": completeness,
    "adherence": adherence,
    "average": (context_relevance + context_utilization + 
               completeness + adherence) / 4
}

Key Takeaways

1. Each Metric Answers a Different Question

Metric Question Data Source
R Is retrieval good? Relevant sentences
U Does LLM use it? Utilized sentences
C Is response comprehensive? Overlap
A Is response truthful? Support flags

2. Metrics Are Independent

  • Low R, high U is possible (ignore irrelevant)
  • Low U, high R is possible (retrieval good, generation bad)
  • Low C, high A is possible (limited but correct)

3. GPT Labeling is Sentence-Level

  • Fine-grained sentence keys (0a, 0b, 1c, etc.)
  • Exact mapping of support
  • Transparent and verifiable

4. All Four Metrics Required for Full Picture

Relevance:    ← "Did we retrieve the right docs?"
Utilization:  ← "Did the LLM use them?"
Completeness: ← "Did it cover the information?"
Adherence:    ← "Is it accurate?"

All four needed to understand RAG quality.