Buckets:
CUDA / Nunchaku setup runbook — RTX PRO 4500 Blackwell (sm_120)
Everything needed to bring this box from bare to "running Nunchaku FP4 kernels + building Nunchaku from source", plus every footgun hit on the way. Written 2026-06-13. No key-man risk: follow this top to bottom.
0. The box (what you're dealing with)
- GPU: NVIDIA RTX PRO 4500 Blackwell, 32 GB, sm_120 (
torch.cuda.get_device_capability()==(12,0)). Driver: CUDA 13.0. 251 GB RAM, 48 cores. /workspaceis a network volume (persists); everything else (/usr/local, site-packages,/tmpon the overlay) is ephemeral — wiped on restart. This is the #1 footgun (see §1)./workspacehas a ~250 GB quota (not the 2 PB the clusterdfshows). Watch it;du -sh /workspace.
1. ⚠️ After ANY box restart: the env + models are GONE
A restart wipes site-packages AND models/. Symptoms: ModuleNotFoundError: diffusers, models/
missing. Recover:
# (a) teacher weights (~23 GB, public, no token)
hf download black-forest-labs/FLUX.2-klein-4B --local-dir models/klein-4b
# (b) python stack — torch MUST be cu130 (cu124/cu126 have NO sm_120 kernels; cuda.is_available()
# lies and returns True, then every kernel launch fails with "no kernel image")
export PIP_CACHE_DIR=/workspace/.cache/pip
pip install torch==2.12.0 --index-url https://download.pytorch.org/whl/cu130
pip install transformers==5.10.2 torchao==0.17 accelerate safetensors
pip install "git+https://github.com/huggingface/diffusers" # for Flux2* classes
# FOOTGUN: a leftover cu124 torchvision/torchaudio breaks `Qwen3ForCausalLM` import
# (dead torchvision::nms op) -> Flux2KleinPipeline import fails. Remove them:
pip uninstall -y torchvision torchaudio
Verify: python3 -c "import torch;print(torch.__version__, torch.cuda.get_device_name(0))" →
2.12.0+cu130 NVIDIA RTX PRO 4500 Blackwell, and from diffusers import Flux2KleinPipeline works.
The Unable to import torchao Tensor objects warning is harmless (our quant math is pure-torch).
2. Nunchaku runtime (prebuilt wheel) — for RUNNING FP4 kernels
The published wheel does NOT yet ship NunchakuFlux2Transformer2DModel (PR #926 unmerged), so you
copy the FLUX.2 loader from the checkpoint repo into the installed package.
# (a) the dev wheel matching this box (cu13.0 + torch2.12 + cp311 + linux). Canonical org is
# nunchaku-ai (nunchaku-tech / mit-han-lab 301-redirect to it). NOT plain `pip install nunchaku`.
pip install "https://github.com/nunchaku-ai/nunchaku/releases/download/v1.3.0dev20260306/nunchaku-1.3.0.dev20260306%2Bcu13.0torch2.12-cp311-cp311-linux_x86_64.whl"
# (b) the FLUX.2 checkpoint repo (self-contained: fp4+int4 transformers, Qwen3 TE, vae, loader code)
hf download tonera/FLUX.2-klein-9B-Nunchaku --local-dir models/klein-9b-nunchaku \
--exclude "transformer/diffusion_pytorch_model-*.safetensors" # skip the 18 GB bf16 shards
# FOOTGUN: hf --exclude only reliably honors ONE pattern; multi-pattern silently mis-parses.
# FOOTGUN: a Xet "Background writer channel closed" error -> retry with HF_HUB_DISABLE_XET=1.
# (c) copy the FLUX.2 loader files into the installed nunchaku package
bash scripts/setup_nunchaku.sh # transformer_flux2.py -> models/transformers/, torch_transfer_utils.py
# -> nunchaku/, common/ -> nunchaku/lora/common/
nunchaku.utils.get_precision() returns fp4 on this card (it returns int4 on Ada). Use FP4.
INT4 is unsupported-by-design on Blackwell (no INT4 tensor cores; the svdq-int4 file runs an
emulated path at 1677 ms/step — slower than bf16). NVFP4 is the only fast low-bit format here.
3. Building Nunchaku from SOURCE — for MODIFYING kernels
Needed only if you want to change kernels. The runfile installer DOES NOT WORK headless
(cannot create /dev/tty; and a misleading "Extraction failed... not enough space" under script).
Use conda CUDA 13 instead.
# (a) toolchain: Miniforge + conda CUDA 13.0 (matches torch cu130; supports sm_120a)
curl -sL -o miniforge.sh https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash miniforge.sh -b -p /workspace/miniforge3
/workspace/miniforge3/bin/mamba create -y -n nuncbuild -c nvidia -c conda-forge cuda-toolkit=13.0
pip install cmake ninja build wheel pybind11 # cmake/ninja are NOT preinstalled
# (b) source
git clone --recursive --depth 1 https://github.com/nunchaku-ai/nunchaku.git /workspace/build_nunchaku/src
# (c) build — see scripts referenced below; the do_build.sh in /workspace/build_nunchaku captures it.
The four footguns that make or break the build (all handled in /workspace/build_nunchaku/do_build.sh):
- Use the SYSTEM python (
/usr/bin/python3, has torch), NOT conda's python. Do not prepend condabinto PATH orpython3resolves to the torch-less conda python (No module named torch). - Host compiler must be system g++ 11.4 (
CUDAHOSTCXX=/usr/bin/g++). conda's gcc-14.3 sysroot breaks nvcc 13 (_Float32undefined, bf16-literal errors). - conda CUDA header/lib layout is
$CUDA_HOME/targets/x86_64-linux/{include,lib}, but torch's-I/-Llook in$CUDA_HOME/{include,lib64}. Symlink them in, AND symlink NVTX (buried under nsight-compute):ln -sfn $CUDA_HOME/targets/x86_64-linux/include/* $CUDA_HOME/include/ ; mkdir -p $CUDA_HOME/lib64; ln -sfn $CUDA_HOME/targets/x86_64-linux/lib/* $CUDA_HOME/lib64/ ; ln -sfn $CUDA_HOME/nsight-compute-*/host/target-linux-x64/nvtx/include/nvtx3 $CUDA_HOME/include/nvtx3. Without these:fatal error: cuda_runtime_api.hthennvtx3/nvToolsExt.h: No such file. NUNCHAKU_INSTALL_MODE=FASTbuilds for the current GPU only (sm_120a) → ~5-10 min. Build a wheel (setup.py bdist_wheel), not editabledevelop(which globally shadows the working wheel if it fails). After installing your wheel, re-runscripts/setup_nunchaku.sh(pip install replaces the loader files).
The full reproducible build script lives at /workspace/build_nunchaku/do_build.sh. Result wheel:
nunchaku-1.3.0.dev*+cu13.0torch2.12-cp311...whl.
4. Runtime footguns (when running the model)
- No
torch.autocastwith the FP4 kernels: autocast runs norms in fp32 → fp32 acts hit the FP4 kernel →gemm_w4a4.cu:28 Assertion 'false'. Run pure bf16. - Fused model = batch=1: the packed-rotary path asserts
rotary_emb.shape[0]*shape[1]==M. Loop prompts one at a time (or fix the rotary to broadcast over batch — a speedup TODO). - Build/eval race (our own scripts):
scripts/12writesquant_state.ptthenquant_config.json; wait for theDONE ->line before eval. - Running python from
/workspace/build_nunchaku/srcshadows the installednunchakuwith the source dir (no compiled_C) →No module named 'nunchaku._C'. Run from elsewhere.
5. Quick smoke test (verify the whole stack)
PYTHONPATH=. python3 scripts/test_nvfp4.py # our NVFP4 fake-quant primitives
PYTHONPATH=. python3 scripts/20_nunchaku_profile.py fp4 4 1024 1024 1 # real 9B FP4 kernel, end-to-end
Expect the 9B FP4 path at ~254 ms/step / 1.29 s/img / 24.95 GB.
6. Backup & restore (HF bucket hf://buckets/Mercity/FluxDistill)
We back up everything irreplaceable + the kernels + the champion weights, excluding only what is
re-downloadable / re-installable / regenerable. Kernels ARE saved: the compiled wheel
(build_nunchaku/src/dist/*.whl = the _C.so), the kernel source (build_nunchaku/src/src/*.cu/.cuh),
and CUTLASS (build_nunchaku/src/third_party/) so the build is self-contained/offline-rebuildable.
Excluded: build/ (temp .o), models/klein-4b (public HF), miniforge3 (conda, via §3), caches/tmp.
export HF_TOKEN=hf_... # rotate if leaked; never commit
hf sync ./ hf://buckets/Mercity/FluxDistill --dry-run \
--exclude "models/klein-4b/**" --exclude "miniforge3/**" \
--exclude "build_nunchaku/src/build/**" \
--exclude ".cache/**" --exclude "tmp/**" --exclude "**/__pycache__/**" \
--exclude "*.pyc" --exclude ".ipynb_checkpoints/**" \
--exclude "recovered/**" --exclude "recovered.zip"
# drop --dry-run to actually sync. --no-delete is default. ~76 GB (72 GB = 9 champion quant_state.pt).
# lean (no fake-quant weights, ~4 GB): add --exclude "**/quant_state.pt"
Restore onto a fresh box: (1) hf download the bucket back to /workspace; (2) reinstall the
python env per §1; (3) pip install build_nunchaku/src/dist/nunchaku-*.whl (the saved compiled kernels)
or rebuild via build_nunchaku/do_build.sh (CUTLASS is saved, so offline-OK); (4) re-run
scripts/setup_nunchaku.sh to re-copy the FLUX.2 loader; (5) re-download models/klein-4b (§1).
The deployable model outputs/nvfp4/deploy/klein4b_nvfp4_fused.safetensors + the champion
quant_state.pt come straight back from the bucket — no recompute.
Xet Storage Details
- Size:
- 8.96 kB
- Xet hash:
- 7bacbf89809a4270e8477a9cbc5b0b82a6afe6329ce1de2def1135e5c20e505f
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.