supplymind / docs /final_eval_set.md
Rishav
Add final eval set
0af60f8
|
Raw
History Blame Contribute Delete
666 Bytes
# SupplyMind Final Eval Set
Use this small held-out set for final model comparisons:
| tier | task_id | seed |
|---|---|---:|
| easy | `v2_train_easy` | `131` |
| medium | `v2_train_medium` | `211` |
| hard | `v2_train_hard` | `307` |
Compare the same cases across:
1. Base center + base warehouse
2. SFT center + SFT warehouse
3. GRPO center + best warehouse
4. Best center + best warehouse
Primary plots:
- global score by setup
- center role score by setup
- warehouse role score by setup
- invalid payload/action counts by setup
This set is intentionally small so it can be rerun quickly during the hackathon. Do not use these seeds for further training.