RL Nemotron-Bash FP8 Terminus-2 (Qwen3-32B)

RL-trained Qwen3-32B for terminal-based coding tasks (nl2bash with test verification), using FP8 inference for faster generation during training.

Training Details

Base model: laion/sft_GLM-4-7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k_Qwen3-32B
Dataset: nemotron-bash-withtests (10k tasks with test verification)
Algorithm: RLOO-N (async, fully asynchronous RL)
Agent: terminus-2 (multi-turn terminal agent with thinking)
Inference: FP8 quantized inference (per-tensor FP8 with batched weight sync)
Training precision: BF16 (FSDP2)
Hardware: 12 nodes × 4 GH200 GPUs (Jupiter/JSC)
Steps: 48
Learning rate: 5e-6
Batch size: 64
Max staleness: 8

FP8 Inference

During training, vLLM inference engines run in FP8 precision for ~1.5x throughput speedup. After each training step, BF16 weights are broadcast from the FSDP trainer and requantized to FP8 on the inference engines. This model checkpoint is saved in BF16 (the standard HuggingFace format).

Performance

Average pass@8 on nemotron-bash-withtests during training: ~0.43 (comparable to BF16 baseline at ~0.42).

Downloads last month: 1

Safetensors

Model size

33B params

Tensor type

BF16

Video Preview

Reinforcement Learning

Model tree for laion/rl__nemotron-bash_fp8_terminus-2_32b

Base model

Qwen/Qwen3-32B

Finetuned

laion/sft_GLM-4-7-swesmith-sandboxes-with_tests-oracle_verified_120s-maxeps-131k_Qwen3-32B

Finetuned

(1)

this model