Spaces:
Running
Running
Evaluation Report
Total cases evaluated: 35
Metrics
| Metric | Value | Target | Pass |
|---|---|---|---|
| schema_pass_rate | 100.00% | 98.00% | PASS |
| evidence_coverage_rate | 100.00% | 90.00% | PASS |
| review_required_rate | 0.00% | — | — |
| unsupported_recommendation_rate | 0.00% | 2.00% | PASS |
| root_cause_consistency | 100.00% | 70.00% | PASS |
Gate Distribution
- Auto-routed: 35
- Review-routed: 0
Failure Modes
Total failures detected: 26 Cases affected: 24
| Mode | Count | Examples |
|---|---|---|
| hallucination | 23 | case-05a46709: Evidence not found in source: ['I was charged twice for the ; case-0d2ab501: Evidence not found in source: ['I was charged twice for the |
| omission | 3 | case-380fd7e4: Urgent signals in text but risk_level=medium; case-652870dc: Outage signals in text but root_cause=billing |
| ambiguity | 0 | — |
| overconfidence | 0 | — |
| language_drift | 0 | — |
Generated by eval/run_eval.py