# SupplyMind Final Eval Set

Use this small held-out set for final model comparisons:

| tier | task_id | seed |
|---|---|---:|
| easy | `v2_train_easy` | `131` |
| medium | `v2_train_medium` | `211` |
| hard | `v2_train_hard` | `307` |

Compare the same cases across:

1. Base center + base warehouse
2. SFT center + SFT warehouse
3. GRPO center + best warehouse
4. Best center + best warehouse

Primary plots:

- global score by setup
- center role score by setup
- warehouse role score by setup
- invalid payload/action counts by setup

This set is intentionally small so it can be rerun quickly during the hackathon. Do not use these seeds for further training.