skillzero-best-checkpoints / GPU_AND_SLURM_CONFIG.md

Upload folder using huggingface_hub

1d91be3 verified 6 days ago

1.62 kB

	# GPU and Slurm Configuration

	## ALFWorld Best Checkpoints

	The best ALFWorld checkpoints were trained with:

	```bash
	#SBATCH -p a100
	#SBATCH --gres=gpu:4
	#SBATCH --cpus-per-task=32
	#SBATCH --mem=200G
	#SBATCH -t 2-00:00:00
	```

	Important training overrides:

	```bash
	trainer.n_gpus_per_node=4
	trainer.nnodes=1
	trainer.total_training_steps=180
	trainer.save_freq=10
	trainer.test_freq=10
	env.env_name=alfworld/AlfredTWEnv
	env.rollout.n=4
	data.train_batch_size=8
	data.val_batch_size=16
	actor_rollout_ref.rollout.gpu_memory_utilization=0.4
	actor_rollout_ref.rollout.max_model_len=3072
	```

	## Search Run

	The Search run used one node with 4 A100 GPUs allocated:

	```bash
	#SBATCH -p a100
	#SBATCH --gres=gpu:4
	#SBATCH --cpus-per-task=32
	#SBATCH --mem=220G
	#SBATCH -t 2-00:00:00
	```

	GPU assignment:

	```bash
	CUDA_VISIBLE_DEVICES=3 # local retriever service
	CUDA_VISIBLE_DEVICES=0,1,2 # training
	```

	Important Search fix:

	```bash
	data.max_prompt_length=6144
	actor_rollout_ref.rollout.max_model_len=6144
	```

	This avoids the observed Qwen2-VL RoPE shape mismatch where generated prompt state exceeded 4096 tokens.

	## Docker Runtime

	Suggested runtime command:

	```bash
	docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
	-v /path/to/SkillZero:/workspace/SkillZero \
	-v /path/to/checkpoints:/workspace/SkillZero/checkpoints \
	-it skillzero:export
	```

	For Slurm clusters, prefer running through the provided Slurm scripts rather than plain Docker unless the cluster explicitly supports Docker or Enroot/Singularity.