File size: 1,027 Bytes
c4fe0a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Evaluation Report

**Total cases evaluated:** 35

## Metrics

| Metric | Value | Target | Pass |
|--------|-------|--------|------|
| schema_pass_rate | 100.00% | 98.00% | PASS |
| evidence_coverage_rate | 100.00% | 90.00% | PASS |
| review_required_rate | 0.00% | — | — |
| unsupported_recommendation_rate | 0.00% | 2.00% | PASS |
| root_cause_consistency | 100.00% | 70.00% | PASS |

## Gate Distribution

- Auto-routed: 35
- Review-routed: 0

## Failure Modes

**Total failures detected:** 26
**Cases affected:** 24

| Mode | Count | Examples |
|------|-------|----------|
| hallucination | 23 | `case-05a46709`: Evidence not found in source: ['I was charged twice for the ; `case-0d2ab501`: Evidence not found in source: ['I was charged twice for the  |
| omission | 3 | `case-380fd7e4`: Urgent signals in text but risk_level=medium; `case-652870dc`: Outage signals in text but root_cause=billing |
| ambiguity | 0 | — |
| overconfidence | 0 | — |
| language_drift | 0 | — |

---
*Generated by eval/run_eval.py*