Spaces:
Sleeping
Sleeping
File size: 4,673 Bytes
1d10b0a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | # TRACe Metrics - Before & After Fixes
## Issue #1: Evaluation Logs Appearing Multiple Times
### Before β
```
π **Evaluation Logs:**
π **Evaluation Logs:**
π **Evaluation Logs:** β Repeated header
β±οΈ Evaluation started...
π **Evaluation Logs:**
π **Evaluation Logs:**
π Dataset: hotpotqa
```
### After β
```
π Evaluation Logs: β Header appears once
β±οΈ Evaluation started...
π Dataset: hotpotqa
π Total samples: 10
π€ LLM Model: llama-3.1-8b-instant
...
```
---
## Issue #2: Adherence Metric (Decimal vs Boolean)
### Before β
```
Adherence Metric Values:
- Query 1: 0.67 (decimal, not Boolean)
- Query 2: 0.58 (decimal, unclear if grounded)
- Query 3: 0.89 (decimal, hard to interpret)
- Query 4: 0.43 (decimal, is this grounded or not?)
π Results:
Adherence: 0.644 (average) β Decimal, not Boolean
```
**Problem**: Hard to determine if response is grounded or hallucinated.
### After β
```
Adherence Metric Values (Boolean):
- Query 1: 1.0 β
Fully grounded (>50% of words in docs)
- Query 2: 0.0 β Contains hallucinations (<50% grounding)
- Query 3: 1.0 β
Fully grounded
- Query 4: 0.0 β Contains hallucinations
π Results:
Adherence: 0.5 (50% of responses grounded)
```
**Benefits**:
- Clear: 1.0 = trust this response, 0.0 = don't trust it
- Binary decision: grounded vs hallucinated
- Aligns with RAGBench paper definition
---
## Issue #3: Completeness Always Returning 1.0
### Before β
```
Completeness Metric Values:
- Query 1: 1.0 (response has date keyword β score 1.0)
- Query 2: 1.0 (response has location keyword β score 1.0)
- Query 3: 1.0 (response has person name β score 1.0)
- Query 4: 1.0 (response has period keyword β score 1.0)
- Query 5: 1.0 (always 1.0)
- Query 10: 1.0 (always 1.0)
π Results:
Completeness: 1.0 (always!) β No variation, not informative
```
**Problem**: Metric is not discriminative; always returns 1.0
### After β
```
Completeness Metric Values:
- Query 1 (When): 0.72 (Ground truth coverage: 40% + length: 1.0 = 0.3*1.0 + 0.7*0.40 = 0.58 avg)
- Query 2 (Where): 0.45 (Ground truth coverage: 15% + others = 0.45)
- Query 3 (Who): 0.88 (Ground truth coverage: 90% + length: 1.0 = 0.3*1.0 + 0.7*0.90 = 0.93 avg)
- Query 4 (What): 0.31 (Ground truth coverage: 10% β low completeness)
- Query 5 (Why): 0.55 (No ground truth, has keywords β 0.7)
- Query 10 (How): 0.62 (Ground truth coverage: 55%)
π Results:
Completeness: 0.59 (varies by response quality) β
Informative!
```
**Formula Used**:
- With ground truth: `0.3 * (length_score) + 0.7 * (overlap_ratio)`
- Without ground truth: `0.3` (default) or `0.7` (if has answer keywords)
**Interpretation**:
- 0.1β0.3 = Poor coverage of relevant info
- 0.4β0.6 = Moderate coverage
- 0.7β1.0 = Good coverage of relevant information
---
## Comprehensive Before/After Comparison
### Test Case: Query: "When was World War II?"
#### Before (Broken Metrics) β
```
Retrieved Documents:
- Doc1: "World War II lasted from 1939 to 1945"
- Doc2: "About 70 million people died in WW2"
- Doc3: "The war involved many countries"
Response: "World War II started in 1939 and ended in 1945."
Metrics:
ββ Utilization: 0.75 (decimal, somewhat confusing)
ββ Relevance: 0.82 (decimal, okay)
ββ Adherence: 0.85 β WRONG: Should be Boolean (1.0)
ββ Completeness: 1.0 β WRONG: Always 1.0, not informative
ββ Average: 0.86
```
#### After (Fixed Metrics) β
```
Retrieved Documents:
- Doc1: "World War II lasted from 1939 to 1945"
- Doc2: "About 70 million people died in WW2"
- Doc3: "The war involved many countries"
Response: "World War II started in 1939 and ended in 1945."
Ground Truth: "World War II occurred from 1939-1945."
Metrics:
ββ Utilization: 0.75 (uses 2/3 docs with good depth)
ββ Relevance: 0.82 (retrieved docs are relevant to query)
ββ Adherence: 1.0 β
CORRECT: Response fully grounded in docs
ββ Completeness: 0.85 β
CORRECT: Response covers 85% of ground truth info
ββ Average: 0.85 (reliable score)
```
---
## Summary of Fixes
| Metric | Issue | Before | After | Benefit |
|--------|-------|--------|-------|---------|
| **Logs** | Duplicated | Multiple headers | Single header | Cleaner UI |
| **Adherence** | Wrong type | Decimal (0.67) | Boolean (1.0/0.0) | Clear grounding assessment |
| **Completeness** | Always max | Always 1.0 | Varies (0.3β1.0) | Discriminative scoring |
All metrics now align with the **RAGBench paper** definitions and provide **meaningful, actionable insights** into RAG system performance. β
|