RL Nemotron-Bash FP8 Terminus-2 (Qwen3-32B)
RL-trained Qwen3-32B for terminal-based coding tasks (nl2bash with test verification), using FP8 inference for faster generation during training.
Training Details
- Base model:
laion/sft_GLM-4-7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k_Qwen3-32B - Dataset: nemotron-bash-withtests (10k tasks with test verification)
- Algorithm: RLOO-N (async, fully asynchronous RL)
- Agent: terminus-2 (multi-turn terminal agent with thinking)
- Inference: FP8 quantized inference (per-tensor FP8 with batched weight sync)
- Training precision: BF16 (FSDP2)
- Hardware: 12 nodes × 4 GH200 GPUs (Jupiter/JSC)
- Steps: 48
- Learning rate: 5e-6
- Batch size: 64
- Max staleness: 8
FP8 Inference
During training, vLLM inference engines run in FP8 precision for ~1.5x throughput speedup. After each training step, BF16 weights are broadcast from the FSDP trainer and requantized to FP8 on the inference engines. This model checkpoint is saved in BF16 (the standard HuggingFace format).
Performance
Average pass@8 on nemotron-bash-withtests during training: ~0.43 (comparable to BF16 baseline at ~0.42).
- Downloads last month
- 5