bobaoxu2001
Deploy forward-deployed AI simulation dashboard
c4fe0a4

Evaluation Report

Total cases evaluated: 35

Metrics

Metric Value Target Pass
schema_pass_rate 100.00% 98.00% PASS
evidence_coverage_rate 100.00% 90.00% PASS
review_required_rate 0.00%
unsupported_recommendation_rate 0.00% 2.00% PASS
root_cause_consistency 100.00% 70.00% PASS

Gate Distribution

  • Auto-routed: 35
  • Review-routed: 0

Failure Modes

Total failures detected: 26 Cases affected: 24

Mode Count Examples
hallucination 23 case-05a46709: Evidence not found in source: ['I was charged twice for the ; case-0d2ab501: Evidence not found in source: ['I was charged twice for the
omission 3 case-380fd7e4: Urgent signals in text but risk_level=medium; case-652870dc: Outage signals in text but root_cause=billing
ambiguity 0
overconfidence 0
language_drift 0

Generated by eval/run_eval.py