Spaces:
Sleeping
Sleeping
TRACE Metrics Calculation - Visual Guide
Step-by-Step Visualization
STEP 1: Sentencization
DOCUMENTS RESPONSE
βββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββ
Doc 0: "Machine learning is AI that learns
"ML is AI. It learns from data. from data. Deep learning uses neural
Algorithms improve through time." networks. It's powerful for images."
β Split by sentence ends β Split by sentence ends
0a: "ML is AI." a: "Machine learning is AI that
0b: "It learns from data." learns from data."
0c: "Algorithms improve b: "Deep learning uses neural
through time." networks."
c: "It's powerful for images."
STEP 2: GPT Analysis
GPT MODEL PROCESSES:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β INPUT: Sentencized docs + response + question β
β β
β ANALYSIS: β
β β Which doc sentences are relevant to question? β
β β Which doc sentences does response use? β
β β Is each response sentence fully/partially supported? β
β β
β OUTPUT: JSON with sentence keys and support mappings β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
STEP 3: Metric Calculation
GPT OUTPUT (SIMPLIFIED):
{
"all_relevant_sentence_keys": ["0a", "0b"],
"all_utilized_sentence_keys": ["0a", "0b"],
"sentence_support_information": [
{"response_sentence_key": "a", "fully_supported": true},
{"response_sentence_key": "b", "fully_supported": true},
{"response_sentence_key": "c", "fully_supported": false}
]
}
β
METRIC CALCULATION:
ββ Context Relevance = |relevant| / 20 = 2/20 = 0.10
ββ Context Utilization = |utilized| / |relevant| = 2/2 = 1.0
ββ Completeness = |relevant β© utilized| / |relevant| = 2/2 = 1.0
ββ Adherence = all_fully_supported? = false β 0.0
Metric Formulas with Venn Diagrams
Context Relevance (R)
ALL RETRIEVED SENTENCES
ββββββββββββββββββββββββββββ
β β
β Total: ~20 sentences β
β β
β ββββββββββββββββββββ β
β β RELEVANT: β β
β β ["0a", "0b"] β β
β β Count: 2 β β
β ββββββββββββββββββββ β
β β
β Irrelevant: 18 β
β β
ββββββββββββββββββββββββββββ
Formula: R = 2 / 20 = 0.10 (10%)
Interpretation: 10% of retrieved content is relevant to question
Context Utilization (T)
RELEVANT SENTENCES
ββββββββββββββββββββββββββββ
β RELEVANT: ["0a", "0b"] β
β β
β ββββββββββββββββββββββ β
β β UTILIZED: β β
β β ["0a", "0b"] β β
β β Count: 2 β β
β ββββββββββββββββββββββ β
β β
β NOT USED: 0 β
β β
ββββββββββββββββββββββββββββ
Formula: U = 2 / 2 = 1.0 (100%)
Interpretation: All relevant information was used
Completeness (C)
RELEVANT UTILIZED
ββββββββββββββββ ββββββββββββββββ
β ["0a", "0b"] β β ["0a", "0b"] β
β β β β
β COUNT: 2 β β COUNT: 2 β
β β β β
ββββββββββββββββ ββββββββββββββββ
β β
ββββββββββ¬ββββββββββββ
β
OVERLAP: ["0a", "0b"]
COUNT: 2
Formula: C = 2 / 2 = 1.0 (100%)
Interpretation: All relevant info is in response
Adherence (A)
RESPONSE SENTENCES: SUPPORT STATUS:
ββββββββββββββββββββ ββββββββββββββββββββ
β a: "ML is AI..." β βββββββββ β Fully β
β β β Supported β
β b: "Deep learns..β βββββββββ β Fully β
β β β Supported β
β c: "Powerful..." β βββββββββ β Not β
β β β Supported β
ββββββββββββββββββββ ββββββββββββββββββββ
Formula: A = (all_supported) ? 1.0 : 0.0
= (true AND true AND false) ? 1.0 : 0.0
= 0.0 (100% = 0 because of one failure)
Interpretation: Response contains hallucination (adherence fails)
Complete Example Walkthrough
Input
QUESTION:
"What makes machine learning different from traditional programming?"
RETRIEVED DOCUMENTS:
0: "Machine learning is a subset of AI. It learns patterns from data.
Traditional programming requires explicit instructions."
1: "ML algorithms improve through experience. They adapt to new data.
Rule-based systems are rigid and hard to maintain."
LLM RESPONSE:
"Machine learning differs because it learns from data rather than
requiring explicit instructions. ML algorithms improve over time.
It's the future of all computing."
Step 1: Sentencization
DOCUMENTS:
0a: "Machine learning is a subset of AI."
0b: "It learns patterns from data."
0c: "Traditional programming requires explicit instructions."
1a: "ML algorithms improve through experience."
1b: "They adapt to new data."
1c: "Rule-based systems are rigid and hard to maintain."
RESPONSE:
a: "Machine learning differs because it learns from data rather than
requiring explicit instructions."
b: "ML algorithms improve over time."
c: "It's the future of all computing."
Step 2: GPT Labeling
ANALYSIS BY GPT:
Question focus: Differences between ML and traditional programming
ββ "learns from data" vs "explicit instructions"
ββ "improves through experience"
ββ Adaptability
RELEVANT SENTENCES (to question):
ββ 0a: "subset of AI" β Partially relevant
ββ 0b: "learns patterns from data" β RELEVANT β
ββ 0c: "requires explicit instructions" β RELEVANT β
ββ 1a: "improve through experience" β RELEVANT β
ββ 1b: "adapt to new data" β RELEVANT β
ββ 1c: "rule-based systems rigid" β Partially relevant
UTILIZED SENTENCES (used in response):
ββ response_a uses: 0b, 0c β Document references: [0b, 0c]
ββ response_b uses: 1a β Document references: [1a]
ββ response_c uses: NONE β No support β [hallucination]
FULLY SUPPORTED CHECK:
ββ response_a "learns from data, not explicit" β Supported by 0b, 0c β
ββ response_b "algorithms improve" β Supported by 1a β
ββ response_c "future of all computing" β NOT in documents β
Step 3: Metric Calculation
EXTRACTED DATA:
all_relevant_sentence_keys = ["0b", "0c", "1a", "1b"] (4 sentences)
all_utilized_sentence_keys = ["0b", "0c", "1a"] (3 sentences)
sentence_support_information = [
{key: "a", fully_supported: true},
{key: "b", fully_supported: true},
{key: "c", fully_supported: false}
]
CALCULATIONS:
1. Context Relevance
= |relevant| / 20
= 4 / 20
= 0.20 (20%)
2. Context Utilization
= |utilized| / |relevant|
= 3 / 4
= 0.75 (75%)
3. Completeness
= |relevant β© utilized| / |relevant|
= |{0b, 0c, 1a}| / |{0b, 0c, 1a, 1b}|
= 3 / 4
= 0.75 (75%)
4. Adherence
= all fully_supported?
= true AND true AND false
= FALSE β 0.0 (0%)
Results
βββββββββββββββββββββββββββββββββββββββββββ
β TRACE METRICS RESULTS β
βββββββββββββββββββββββββββββββββββββββββββ€
β Context Relevance: 0.20 (20%) β
β Context Utilization: 0.75 (75%) β
β Completeness: 0.75 (75%) β
β Adherence: 0.0 (0%) β
βββββββββββββββββββββββββββββββββββββββββββ€
β Average: 0.425 (42.5%) β
β RMSE Aggregation: 0.437 β
β Consistency Score: 0.563 β
βββββββββββββββββββββββββββββββββββββββββββ
INTERPRETATION:
β Good relevance targeting (20%)
β Decent information usage (75%)
β Good coverage of relevant info (75%)
β Contains hallucination (0% adherence)
ACTION: Address the hallucination about "future of all computing"
Calculation Pseudocode
# INPUT: GPT labeled output
gpt_labels = {
"all_relevant_sentence_keys": [...],
"all_utilized_sentence_keys": [...],
"sentence_support_information": [...]
}
# METRIC 1: Context Relevance
relevant_count = len(gpt_labels["all_relevant_sentence_keys"])
context_relevance = min(1.0, relevant_count / 20.0)
# METRIC 2: Context Utilization
utilized_count = len(gpt_labels["all_utilized_sentence_keys"])
if relevant_count == 0:
context_utilization = 0.0
else:
context_utilization = min(1.0, utilized_count / relevant_count)
# METRIC 3: Completeness
relevant_set = set(gpt_labels["all_relevant_sentence_keys"])
utilized_set = set(gpt_labels["all_utilized_sentence_keys"])
overlap_count = len(relevant_set & utilized_set)
if len(relevant_set) == 0:
completeness = 1.0 if len(utilized_set) == 0 else 0.0
else:
completeness = overlap_count / len(relevant_set)
# METRIC 4: Adherence
fully_supported_count = sum(
1 for sentence in gpt_labels["sentence_support_information"]
if sentence["fully_supported"]
)
total_sentences = len(gpt_labels["sentence_support_information"])
if total_sentences == 0:
adherence = 1.0
else:
adherence = 1.0 if fully_supported_count == total_sentences else 0.0
# OUTPUT
scores = {
"context_relevance": context_relevance,
"context_utilization": context_utilization,
"completeness": completeness,
"adherence": adherence,
"average": (context_relevance + context_utilization +
completeness + adherence) / 4
}
Key Takeaways
1. Each Metric Answers a Different Question
| Metric | Question | Data Source |
|---|---|---|
| R | Is retrieval good? | Relevant sentences |
| U | Does LLM use it? | Utilized sentences |
| C | Is response comprehensive? | Overlap |
| A | Is response truthful? | Support flags |
2. Metrics Are Independent
- Low R, high U is possible (ignore irrelevant)
- Low U, high R is possible (retrieval good, generation bad)
- Low C, high A is possible (limited but correct)
3. GPT Labeling is Sentence-Level
- Fine-grained sentence keys (0a, 0b, 1c, etc.)
- Exact mapping of support
- Transparent and verifiable
4. All Four Metrics Required for Full Picture
Relevance: β "Did we retrieve the right docs?"
Utilization: β "Did the LLM use them?"
Completeness: β "Did it cover the information?"
Adherence: β "Is it accurate?"
All four needed to understand RAG quality.