Spaces:
Sleeping
Sleeping
| # SupplyMind Final Eval Set | |
| Use this small held-out set for final model comparisons: | |
| | tier | task_id | seed | | |
| |---|---|---:| | |
| | easy | `v2_train_easy` | `131` | | |
| | medium | `v2_train_medium` | `211` | | |
| | hard | `v2_train_hard` | `307` | | |
| Compare the same cases across: | |
| 1. Base center + base warehouse | |
| 2. SFT center + SFT warehouse | |
| 3. GRPO center + best warehouse | |
| 4. Best center + best warehouse | |
| Primary plots: | |
| - global score by setup | |
| - center role score by setup | |
| - warehouse role score by setup | |
| - invalid payload/action counts by setup | |
| This set is intentionally small so it can be rerun quickly during the hackathon. Do not use these seeds for further training. | |