"Fits on 4× ≥80 GB GPUs"

#1
by fraserprice - opened

"Fits on 4× ≥80 GB GPUs at TP=4 (~110 GB/GPU)"

do I need to download more RAM for my 80GB GPU? 🤔

Good catch, that line was just wrong and I have fixed it in the model card. You do not need more RAM :) it is a tensor-parallel thing, not a RAM thing. At --tp 4 the weights are about 110 GB per GPU, which will not fit an 80 GB card. For 80 GB GPUs use --tp 8 instead (about 55 GB of weights per GPU), so 8x H100 or A100-80GB works fine. --tp 4 is meant for cards with 128 GB or more (H200, B200, MI300X). Thanks for flagging it.

Anyway to get it running on 7* 95 GB GPU's (ie. 7* RTX pro 6000) ??

7 is an awkward number here :) Tensor parallel has to divide the model evenly, and GLM-5.2's dims are all powers-of-2 (64 attention heads, 256 experts, 6144 hidden, 2048 MoE intermediate), so only TP=2/4/8 are valid (TP=6 and TP=7 are not). The catch: TP=4 needs about 102 GB of weights per GPU, just over your 96 GB, while TP=8 fits nicely (~51 GB/GPU) but needs 8 GPUs.

So two options:

  1. Add an 8th RTX Pro 6000 and run --tp 8. Cleanest and fastest.
  2. With exactly 7 GPUs, use pipeline parallelism instead: sglang --pp-size 7 --tp-size 1 splits the 78 layers across the cards (~11 layers, ~58 GB each, fits). Pipeline parallel has lower throughput than TP and the PP + DSA + NVFP4 path is less battle-tested, so verify output, but memory-wise it works.

Also note RTX Pro 6000 is Blackwell sm_120, while the NVFP4 cutlass MoE kernels are mainly tuned for datacenter Blackwell (sm_100/103), so sanity-check generation quality on your cards.

Thanks for the detailed response. I’ve run into similar issues with other models as well. Have experimented a bit with pipeline parallelism, but I kept hitting roadblocks and haven’t had much time to dig deeper. I’ll give this approach a try and see how it goes, but if it ends up reducing throughput, I’ll need to compare it against the output from the upcoming GGUFs.
Adding one more GPU would probably make things a lot easier.

Sign up or log in to comment