Spaces:

ax2183
/

forward-deployed-ai-sim

Running

bobaoxu2001

Deploy forward-deployed AI simulation dashboard

c4fe0a4 10 days ago

1.03 kB

	# Evaluation Report

	Total cases evaluated: 35

	## Metrics

	\| Metric \| Value \| Target \| Pass \|
	\|--------\|-------\|--------\|------\|
	\| schema_pass_rate \| 100.00% \| 98.00% \| PASS \|
	\| evidence_coverage_rate \| 100.00% \| 90.00% \| PASS \|
	\| review_required_rate \| 0.00% \| — \| — \|
	\| unsupported_recommendation_rate \| 0.00% \| 2.00% \| PASS \|
	\| root_cause_consistency \| 100.00% \| 70.00% \| PASS \|

	## Gate Distribution

	- Auto-routed: 35
	- Review-routed: 0

	## Failure Modes

	Total failures detected: 26
	Cases affected: 24

	\| Mode \| Count \| Examples \|
	\|------\|-------\|----------\|
	\| hallucination \| 23 \| `case-05a46709`: Evidence not found in source: ['I was charged twice for the ; `case-0d2ab501`: Evidence not found in source: ['I was charged twice for the \|
	\| omission \| 3 \| `case-380fd7e4`: Urgent signals in text but risk_level=medium; `case-652870dc`: Outage signals in text but root_cause=billing \|
	\| ambiguity \| 0 \| — \|
	\| overconfidence \| 0 \| — \|
	\| language_drift \| 0 \| — \|

	---
	Generated by eval/run_eval.py

	# Evaluation Report

	Total cases evaluated: 35

	## Metrics

	\| Metric \| Value \| Target \| Pass \|
	\|--------\|-------\|--------\|------\|
	\| schema_pass_rate \| 100.00% \| 98.00% \| PASS \|
	\| evidence_coverage_rate \| 100.00% \| 90.00% \| PASS \|
	\| review_required_rate \| 0.00% \| — \| — \|
	\| unsupported_recommendation_rate \| 0.00% \| 2.00% \| PASS \|
	\| root_cause_consistency \| 100.00% \| 70.00% \| PASS \|

	## Gate Distribution

	- Auto-routed: 35
	- Review-routed: 0

	## Failure Modes

	Total failures detected: 26
	Cases affected: 24

	\| Mode \| Count \| Examples \|
	\|------\|-------\|----------\|
	\| hallucination \| 23 \| `case-05a46709`: Evidence not found in source: ['I was charged twice for the ; `case-0d2ab501`: Evidence not found in source: ['I was charged twice for the \|
	\| omission \| 3 \| `case-380fd7e4`: Urgent signals in text but risk_level=medium; `case-652870dc`: Outage signals in text but root_cause=billing \|
	\| ambiguity \| 0 \| — \|
	\| overconfidence \| 0 \| — \|
	\| language_drift \| 0 \| — \|

	---
	Generated by eval/run_eval.py