bobaoxu2001
Deploy forward-deployed AI simulation dashboard
c4fe0a4
# Evaluation Report
**Total cases evaluated:** 35
## Metrics
| Metric | Value | Target | Pass |
|--------|-------|--------|------|
| schema_pass_rate | 100.00% | 98.00% | PASS |
| evidence_coverage_rate | 100.00% | 90.00% | PASS |
| review_required_rate | 0.00% | β€” | β€” |
| unsupported_recommendation_rate | 0.00% | 2.00% | PASS |
| root_cause_consistency | 100.00% | 70.00% | PASS |
## Gate Distribution
- Auto-routed: 35
- Review-routed: 0
## Failure Modes
**Total failures detected:** 26
**Cases affected:** 24
| Mode | Count | Examples |
|------|-------|----------|
| hallucination | 23 | `case-05a46709`: Evidence not found in source: ['I was charged twice for the ; `case-0d2ab501`: Evidence not found in source: ['I was charged twice for the |
| omission | 3 | `case-380fd7e4`: Urgent signals in text but risk_level=medium; `case-652870dc`: Outage signals in text but root_cause=billing |
| ambiguity | 0 | β€” |
| overconfidence | 0 | β€” |
| language_drift | 0 | β€” |
---
*Generated by eval/run_eval.py*