Spaces:

rishavutk
/

supplymind

Sleeping

supplymind / docs /final_eval_set.md

Rishav

Add final eval set

0af60f8 2 months ago

666 Bytes

	# SupplyMind Final Eval Set

	Use this small held-out set for final model comparisons:

	\| tier \| task_id \| seed \|
	\|---\|---\|---:\|
	\| easy \| `v2_train_easy` \| `131` \|
	\| medium \| `v2_train_medium` \| `211` \|
	\| hard \| `v2_train_hard` \| `307` \|

	Compare the same cases across:

	1. Base center + base warehouse
	2. SFT center + SFT warehouse
	3. GRPO center + best warehouse
	4. Best center + best warehouse

	Primary plots:

	- global score by setup
	- center role score by setup
	- warehouse role score by setup
	- invalid payload/action counts by setup

	This set is intentionally small so it can be rerun quickly during the hackathon. Do not use these seeds for further training.