File size: 666 Bytes
0af60f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# SupplyMind Final Eval Set

Use this small held-out set for final model comparisons:

| tier | task_id | seed |
|---|---|---:|
| easy | `v2_train_easy` | `131` |
| medium | `v2_train_medium` | `211` |
| hard | `v2_train_hard` | `307` |

Compare the same cases across:

1. Base center + base warehouse
2. SFT center + SFT warehouse
3. GRPO center + best warehouse
4. Best center + best warehouse

Primary plots:

- global score by setup
- center role score by setup
- warehouse role score by setup
- invalid payload/action counts by setup

This set is intentionally small so it can be rerun quickly during the hackathon. Do not use these seeds for further training.