Spaces:
Sleeping
Sleeping
TRACe Metrics - Before & After Fixes
Issue #1: Evaluation Logs Appearing Multiple Times
Before β
π **Evaluation Logs:**
π **Evaluation Logs:**
π **Evaluation Logs:** β Repeated header
β±οΈ Evaluation started...
π **Evaluation Logs:**
π **Evaluation Logs:**
π Dataset: hotpotqa
After β
π Evaluation Logs: β Header appears once
β±οΈ Evaluation started...
π Dataset: hotpotqa
π Total samples: 10
π€ LLM Model: llama-3.1-8b-instant
...
Issue #2: Adherence Metric (Decimal vs Boolean)
Before β
Adherence Metric Values:
- Query 1: 0.67 (decimal, not Boolean)
- Query 2: 0.58 (decimal, unclear if grounded)
- Query 3: 0.89 (decimal, hard to interpret)
- Query 4: 0.43 (decimal, is this grounded or not?)
π Results:
Adherence: 0.644 (average) β Decimal, not Boolean
Problem: Hard to determine if response is grounded or hallucinated.
After β
Adherence Metric Values (Boolean):
- Query 1: 1.0 β
Fully grounded (>50% of words in docs)
- Query 2: 0.0 β Contains hallucinations (<50% grounding)
- Query 3: 1.0 β
Fully grounded
- Query 4: 0.0 β Contains hallucinations
π Results:
Adherence: 0.5 (50% of responses grounded)
Benefits:
- Clear: 1.0 = trust this response, 0.0 = don't trust it
- Binary decision: grounded vs hallucinated
- Aligns with RAGBench paper definition
Issue #3: Completeness Always Returning 1.0
Before β
Completeness Metric Values:
- Query 1: 1.0 (response has date keyword β score 1.0)
- Query 2: 1.0 (response has location keyword β score 1.0)
- Query 3: 1.0 (response has person name β score 1.0)
- Query 4: 1.0 (response has period keyword β score 1.0)
- Query 5: 1.0 (always 1.0)
- Query 10: 1.0 (always 1.0)
π Results:
Completeness: 1.0 (always!) β No variation, not informative
Problem: Metric is not discriminative; always returns 1.0
After β
Completeness Metric Values:
- Query 1 (When): 0.72 (Ground truth coverage: 40% + length: 1.0 = 0.3*1.0 + 0.7*0.40 = 0.58 avg)
- Query 2 (Where): 0.45 (Ground truth coverage: 15% + others = 0.45)
- Query 3 (Who): 0.88 (Ground truth coverage: 90% + length: 1.0 = 0.3*1.0 + 0.7*0.90 = 0.93 avg)
- Query 4 (What): 0.31 (Ground truth coverage: 10% β low completeness)
- Query 5 (Why): 0.55 (No ground truth, has keywords β 0.7)
- Query 10 (How): 0.62 (Ground truth coverage: 55%)
π Results:
Completeness: 0.59 (varies by response quality) β
Informative!
Formula Used:
- With ground truth:
0.3 * (length_score) + 0.7 * (overlap_ratio) - Without ground truth:
0.3(default) or0.7(if has answer keywords)
Interpretation:
- 0.1β0.3 = Poor coverage of relevant info
- 0.4β0.6 = Moderate coverage
- 0.7β1.0 = Good coverage of relevant information
Comprehensive Before/After Comparison
Test Case: Query: "When was World War II?"
Before (Broken Metrics) β
Retrieved Documents:
- Doc1: "World War II lasted from 1939 to 1945"
- Doc2: "About 70 million people died in WW2"
- Doc3: "The war involved many countries"
Response: "World War II started in 1939 and ended in 1945."
Metrics:
ββ Utilization: 0.75 (decimal, somewhat confusing)
ββ Relevance: 0.82 (decimal, okay)
ββ Adherence: 0.85 β WRONG: Should be Boolean (1.0)
ββ Completeness: 1.0 β WRONG: Always 1.0, not informative
ββ Average: 0.86
After (Fixed Metrics) β
Retrieved Documents:
- Doc1: "World War II lasted from 1939 to 1945"
- Doc2: "About 70 million people died in WW2"
- Doc3: "The war involved many countries"
Response: "World War II started in 1939 and ended in 1945."
Ground Truth: "World War II occurred from 1939-1945."
Metrics:
ββ Utilization: 0.75 (uses 2/3 docs with good depth)
ββ Relevance: 0.82 (retrieved docs are relevant to query)
ββ Adherence: 1.0 β
CORRECT: Response fully grounded in docs
ββ Completeness: 0.85 β
CORRECT: Response covers 85% of ground truth info
ββ Average: 0.85 (reliable score)
Summary of Fixes
| Metric | Issue | Before | After | Benefit |
|---|---|---|---|---|
| Logs | Duplicated | Multiple headers | Single header | Cleaner UI |
| Adherence | Wrong type | Decimal (0.67) | Boolean (1.0/0.0) | Clear grounding assessment |
| Completeness | Always max | Always 1.0 | Varies (0.3β1.0) | Discriminative scoring |
All metrics now align with the RAGBench paper definitions and provide meaningful, actionable insights into RAG system performance. β