Spaces:
Sleeping
Sleeping
File size: 666 Bytes
0af60f8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | # SupplyMind Final Eval Set
Use this small held-out set for final model comparisons:
| tier | task_id | seed |
|---|---|---:|
| easy | `v2_train_easy` | `131` |
| medium | `v2_train_medium` | `211` |
| hard | `v2_train_hard` | `307` |
Compare the same cases across:
1. Base center + base warehouse
2. SFT center + SFT warehouse
3. GRPO center + best warehouse
4. Best center + best warehouse
Primary plots:
- global score by setup
- center role score by setup
- warehouse role score by setup
- invalid payload/action counts by setup
This set is intentionally small so it can be rerun quickly during the hackathon. Do not use these seeds for further training.
|