Spaces:
Running
Running
| # Evaluation Report | |
| **Total cases evaluated:** 35 | |
| ## Metrics | |
| | Metric | Value | Target | Pass | | |
| |--------|-------|--------|------| | |
| | schema_pass_rate | 100.00% | 98.00% | PASS | | |
| | evidence_coverage_rate | 100.00% | 90.00% | PASS | | |
| | review_required_rate | 0.00% | β | β | | |
| | unsupported_recommendation_rate | 0.00% | 2.00% | PASS | | |
| | root_cause_consistency | 100.00% | 70.00% | PASS | | |
| ## Gate Distribution | |
| - Auto-routed: 35 | |
| - Review-routed: 0 | |
| ## Failure Modes | |
| **Total failures detected:** 26 | |
| **Cases affected:** 24 | |
| | Mode | Count | Examples | | |
| |------|-------|----------| | |
| | hallucination | 23 | `case-05a46709`: Evidence not found in source: ['I was charged twice for the ; `case-0d2ab501`: Evidence not found in source: ['I was charged twice for the | | |
| | omission | 3 | `case-380fd7e4`: Urgent signals in text but risk_level=medium; `case-652870dc`: Outage signals in text but root_cause=billing | | |
| | ambiguity | 0 | β | | |
| | overconfidence | 0 | β | | |
| | language_drift | 0 | β | | |
| --- | |
| *Generated by eval/run_eval.py* |