skillzero-best-checkpoints / GPU_AND_SLURM_CONFIG.md
Nickybcybc's picture
Upload folder using huggingface_hub
1d91be3 verified
# GPU and Slurm Configuration
## ALFWorld Best Checkpoints
The best ALFWorld checkpoints were trained with:
```bash
#SBATCH -p a100
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=32
#SBATCH --mem=200G
#SBATCH -t 2-00:00:00
```
Important training overrides:
```bash
trainer.n_gpus_per_node=4
trainer.nnodes=1
trainer.total_training_steps=180
trainer.save_freq=10
trainer.test_freq=10
env.env_name=alfworld/AlfredTWEnv
env.rollout.n=4
data.train_batch_size=8
data.val_batch_size=16
actor_rollout_ref.rollout.gpu_memory_utilization=0.4
actor_rollout_ref.rollout.max_model_len=3072
```
## Search Run
The Search run used one node with 4 A100 GPUs allocated:
```bash
#SBATCH -p a100
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=32
#SBATCH --mem=220G
#SBATCH -t 2-00:00:00
```
GPU assignment:
```bash
CUDA_VISIBLE_DEVICES=3 # local retriever service
CUDA_VISIBLE_DEVICES=0,1,2 # training
```
Important Search fix:
```bash
data.max_prompt_length=6144
actor_rollout_ref.rollout.max_model_len=6144
```
This avoids the observed Qwen2-VL RoPE shape mismatch where generated prompt state exceeded 4096 tokens.
## Docker Runtime
Suggested runtime command:
```bash
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v /path/to/SkillZero:/workspace/SkillZero \
-v /path/to/checkpoints:/workspace/SkillZero/checkpoints \
-it skillzero:export
```
For Slurm clusters, prefer running through the provided Slurm scripts rather than plain Docker unless the cluster explicitly supports Docker or Enroot/Singularity.