Buckets:

Mercity/FluxDistill / docs /CUDA_SETUP_RUNBOOK.md
Pranav2748's picture
|
download
raw
8.96 kB

CUDA / Nunchaku setup runbook — RTX PRO 4500 Blackwell (sm_120)

Everything needed to bring this box from bare to "running Nunchaku FP4 kernels + building Nunchaku from source", plus every footgun hit on the way. Written 2026-06-13. No key-man risk: follow this top to bottom.

0. The box (what you're dealing with)

  • GPU: NVIDIA RTX PRO 4500 Blackwell, 32 GB, sm_120 (torch.cuda.get_device_capability()==(12,0)). Driver: CUDA 13.0. 251 GB RAM, 48 cores.
  • /workspace is a network volume (persists); everything else (/usr/local, site-packages, /tmp on the overlay) is ephemeral — wiped on restart. This is the #1 footgun (see §1).
  • /workspace has a ~250 GB quota (not the 2 PB the cluster df shows). Watch it; du -sh /workspace.

1. ⚠️ After ANY box restart: the env + models are GONE

A restart wipes site-packages AND models/. Symptoms: ModuleNotFoundError: diffusers, models/ missing. Recover:

# (a) teacher weights (~23 GB, public, no token)
hf download black-forest-labs/FLUX.2-klein-4B --local-dir models/klein-4b
# (b) python stack — torch MUST be cu130 (cu124/cu126 have NO sm_120 kernels; cuda.is_available()
#     lies and returns True, then every kernel launch fails with "no kernel image")
export PIP_CACHE_DIR=/workspace/.cache/pip
pip install torch==2.12.0 --index-url https://download.pytorch.org/whl/cu130
pip install transformers==5.10.2 torchao==0.17 accelerate safetensors
pip install "git+https://github.com/huggingface/diffusers"   # for Flux2* classes
# FOOTGUN: a leftover cu124 torchvision/torchaudio breaks `Qwen3ForCausalLM` import
#   (dead torchvision::nms op) -> Flux2KleinPipeline import fails. Remove them:
pip uninstall -y torchvision torchaudio

Verify: python3 -c "import torch;print(torch.__version__, torch.cuda.get_device_name(0))"2.12.0+cu130 NVIDIA RTX PRO 4500 Blackwell, and from diffusers import Flux2KleinPipeline works. The Unable to import torchao Tensor objects warning is harmless (our quant math is pure-torch).

2. Nunchaku runtime (prebuilt wheel) — for RUNNING FP4 kernels

The published wheel does NOT yet ship NunchakuFlux2Transformer2DModel (PR #926 unmerged), so you copy the FLUX.2 loader from the checkpoint repo into the installed package.

# (a) the dev wheel matching this box (cu13.0 + torch2.12 + cp311 + linux). Canonical org is
#     nunchaku-ai (nunchaku-tech / mit-han-lab 301-redirect to it). NOT plain `pip install nunchaku`.
pip install "https://github.com/nunchaku-ai/nunchaku/releases/download/v1.3.0dev20260306/nunchaku-1.3.0.dev20260306%2Bcu13.0torch2.12-cp311-cp311-linux_x86_64.whl"
# (b) the FLUX.2 checkpoint repo (self-contained: fp4+int4 transformers, Qwen3 TE, vae, loader code)
hf download tonera/FLUX.2-klein-9B-Nunchaku --local-dir models/klein-9b-nunchaku \
  --exclude "transformer/diffusion_pytorch_model-*.safetensors"   # skip the 18 GB bf16 shards
#   FOOTGUN: hf --exclude only reliably honors ONE pattern; multi-pattern silently mis-parses.
#   FOOTGUN: a Xet "Background writer channel closed" error -> retry with HF_HUB_DISABLE_XET=1.
# (c) copy the FLUX.2 loader files into the installed nunchaku package
bash scripts/setup_nunchaku.sh    # transformer_flux2.py -> models/transformers/, torch_transfer_utils.py
                                  # -> nunchaku/, common/ -> nunchaku/lora/common/

nunchaku.utils.get_precision() returns fp4 on this card (it returns int4 on Ada). Use FP4. INT4 is unsupported-by-design on Blackwell (no INT4 tensor cores; the svdq-int4 file runs an emulated path at 1677 ms/step — slower than bf16). NVFP4 is the only fast low-bit format here.

3. Building Nunchaku from SOURCE — for MODIFYING kernels

Needed only if you want to change kernels. The runfile installer DOES NOT WORK headless (cannot create /dev/tty; and a misleading "Extraction failed... not enough space" under script). Use conda CUDA 13 instead.

# (a) toolchain: Miniforge + conda CUDA 13.0 (matches torch cu130; supports sm_120a)
curl -sL -o miniforge.sh https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash miniforge.sh -b -p /workspace/miniforge3
/workspace/miniforge3/bin/mamba create -y -n nuncbuild -c nvidia -c conda-forge cuda-toolkit=13.0
pip install cmake ninja build wheel pybind11   # cmake/ninja are NOT preinstalled
# (b) source
git clone --recursive --depth 1 https://github.com/nunchaku-ai/nunchaku.git /workspace/build_nunchaku/src
# (c) build — see scripts referenced below; the do_build.sh in /workspace/build_nunchaku captures it.

The four footguns that make or break the build (all handled in /workspace/build_nunchaku/do_build.sh):

  1. Use the SYSTEM python (/usr/bin/python3, has torch), NOT conda's python. Do not prepend conda bin to PATH or python3 resolves to the torch-less conda python (No module named torch).
  2. Host compiler must be system g++ 11.4 (CUDAHOSTCXX=/usr/bin/g++). conda's gcc-14.3 sysroot breaks nvcc 13 (_Float32 undefined, bf16-literal errors).
  3. conda CUDA header/lib layout is $CUDA_HOME/targets/x86_64-linux/{include,lib}, but torch's -I/-L look in $CUDA_HOME/{include,lib64}. Symlink them in, AND symlink NVTX (buried under nsight-compute): ln -sfn $CUDA_HOME/targets/x86_64-linux/include/* $CUDA_HOME/include/ ; mkdir -p $CUDA_HOME/lib64; ln -sfn $CUDA_HOME/targets/x86_64-linux/lib/* $CUDA_HOME/lib64/ ; ln -sfn $CUDA_HOME/nsight-compute-*/host/target-linux-x64/nvtx/include/nvtx3 $CUDA_HOME/include/nvtx3. Without these: fatal error: cuda_runtime_api.h then nvtx3/nvToolsExt.h: No such file.
  4. NUNCHAKU_INSTALL_MODE=FAST builds for the current GPU only (sm_120a) → ~5-10 min. Build a wheel (setup.py bdist_wheel), not editable develop (which globally shadows the working wheel if it fails). After installing your wheel, re-run scripts/setup_nunchaku.sh (pip install replaces the loader files).

The full reproducible build script lives at /workspace/build_nunchaku/do_build.sh. Result wheel: nunchaku-1.3.0.dev*+cu13.0torch2.12-cp311...whl.

4. Runtime footguns (when running the model)

  • No torch.autocast with the FP4 kernels: autocast runs norms in fp32 → fp32 acts hit the FP4 kernel → gemm_w4a4.cu:28 Assertion 'false'. Run pure bf16.
  • Fused model = batch=1: the packed-rotary path asserts rotary_emb.shape[0]*shape[1]==M. Loop prompts one at a time (or fix the rotary to broadcast over batch — a speedup TODO).
  • Build/eval race (our own scripts): scripts/12 writes quant_state.pt then quant_config.json; wait for the DONE -> line before eval.
  • Running python from /workspace/build_nunchaku/src shadows the installed nunchaku with the source dir (no compiled _C) → No module named 'nunchaku._C'. Run from elsewhere.

5. Quick smoke test (verify the whole stack)

PYTHONPATH=. python3 scripts/test_nvfp4.py            # our NVFP4 fake-quant primitives
PYTHONPATH=. python3 scripts/20_nunchaku_profile.py fp4 4 1024 1024 1   # real 9B FP4 kernel, end-to-end

Expect the 9B FP4 path at ~254 ms/step / 1.29 s/img / 24.95 GB.

6. Backup & restore (HF bucket hf://buckets/Mercity/FluxDistill)

We back up everything irreplaceable + the kernels + the champion weights, excluding only what is re-downloadable / re-installable / regenerable. Kernels ARE saved: the compiled wheel (build_nunchaku/src/dist/*.whl = the _C.so), the kernel source (build_nunchaku/src/src/*.cu/.cuh), and CUTLASS (build_nunchaku/src/third_party/) so the build is self-contained/offline-rebuildable. Excluded: build/ (temp .o), models/klein-4b (public HF), miniforge3 (conda, via §3), caches/tmp.

export HF_TOKEN=hf_...          # rotate if leaked; never commit
hf sync ./ hf://buckets/Mercity/FluxDistill --dry-run \
  --exclude "models/klein-4b/**" --exclude "miniforge3/**" \
  --exclude "build_nunchaku/src/build/**" \
  --exclude ".cache/**" --exclude "tmp/**" --exclude "**/__pycache__/**" \
  --exclude "*.pyc" --exclude ".ipynb_checkpoints/**" \
  --exclude "recovered/**" --exclude "recovered.zip"
# drop --dry-run to actually sync. --no-delete is default. ~76 GB (72 GB = 9 champion quant_state.pt).
# lean (no fake-quant weights, ~4 GB): add  --exclude "**/quant_state.pt"

Restore onto a fresh box: (1) hf download the bucket back to /workspace; (2) reinstall the python env per §1; (3) pip install build_nunchaku/src/dist/nunchaku-*.whl (the saved compiled kernels) or rebuild via build_nunchaku/do_build.sh (CUTLASS is saved, so offline-OK); (4) re-run scripts/setup_nunchaku.sh to re-copy the FLUX.2 loader; (5) re-download models/klein-4b (§1). The deployable model outputs/nvfp4/deploy/klein4b_nvfp4_fused.safetensors + the champion quant_state.pt come straight back from the bucket — no recompute.

Xet Storage Details

Size:
8.96 kB
·
Xet hash:
7bacbf89809a4270e8477a9cbc5b0b82a6afe6329ce1de2def1135e5c20e505f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.