routercore / docs /baseline_eval.md
Jayteare's picture
Deploy RouterCore Gradio demo
1137e50 verified
# Baseline Evaluation
## Baseline Summary
The deterministic `FakeRouter` establishes a conservative pre-fine-tuning baseline. It shows what RouterCore can do with keyword routing plus schema validation and an authoritative policy layer, before adding a prompted base model or fine-tuned router.
| Metric | Score |
| --- | ---: |
| `json_validity_rate` | 100.00% |
| `workflow_accuracy` | 97.01% |
| `status_accuracy` | 57.33% |
| `required_field_presence_accuracy` | 28.57% |
| `unsafe_rejection_accuracy` | 100.00% |
| `false_route_rate` | 0.00% |
## Interpretation
JSON validity is perfect because `FakeRouter` emits structured output.
Workflow accuracy is high because keyword routing can usually identify the broad workflow.
Status accuracy is limited because routing state decisions require more nuance than simple keyword matching.
Required-field presence accuracy is low because the deterministic router does not reliably extract complete structured parameters.
Unsafe rejection accuracy and false route rate are strong because the policy layer is conservative and authoritative.
## Why Fine-Tuning Is Still Needed
Fine-tuning should target:
- Better parameter extraction
- Better status classification
- Better clarification question generation
- Better handling of ambiguous phrasing
The goal is not to replace the policy layer. The goal is to improve the router proposal while preserving validation and policy redundancy.
## Track 2 Framing
This creates a clear Track 2 fine-tuning objective:
> Improve structured routing quality while preserving safe-routing behavior.