Spaces:

ax2183
/

forward-deployed-ai-sim

Running

bobaoxu2001

Deploy forward-deployed AI simulation dashboard

c4fe0a4 10 days ago

1.03 kB

Evaluation Report

Total cases evaluated: 35

Metric	Value	Target	Pass
schema_pass_rate	100.00%	98.00%	PASS
evidence_coverage_rate	100.00%	90.00%	PASS
review_required_rate	0.00%	—	—
unsupported_recommendation_rate	0.00%	2.00%	PASS
root_cause_consistency	100.00%	70.00%	PASS

Total failures detected: 26 Cases affected: 24

Mode	Count	Examples
hallucination	23	`case-05a46709`: Evidence not found in source: ['I was charged twice for the ; `case-0d2ab501`: Evidence not found in source: ['I was charged twice for the
omission	3	`case-380fd7e4`: Urgent signals in text but risk_level=medium; `case-652870dc`: Outage signals in text but root_cause=billing
ambiguity	0	—
overconfidence	0	—
language_drift	0	—

Generated by eval/run_eval.py