How to serving for sglang, blackwell pro 6000? yet, only serving sm100(B100?)

#5
by gigascake - opened

How to serving for sglang, blackwell pro 6000? yet, only serving sm100(B100?)

i havnt tried it on sglang i only switch when vllm isnt working.

Not sure if this is helpful in your case but if you can do vllm (vllm/vllm-openai:stepfun37) here is a setup optimized to the max for 2x blackwell pro 6000.

The b12x fallback doesnt work atm until a fix for SWIGLUSTEP support to B12X is implemented so your stuck on marlin .

Comments on args:
max-num-batched-tokens : this value makes the stupid 20x60000=~7GB vision encoder startup to pass its "safety" check. someone decided 20 images worst case test for vision encoder was a good idea. why ??? test 8 images maybe not 20 . and no you cant oom and make backend crash cause you sent more then 20 images and how is that relevant as a startup safety check on boot... someone didnt cybersecurity cook here

why not 256k context length? the users never exceeds this number. unless its a 200+ page document ingested. (you shouldnt be doing that, and teach your clients the better way) we gain concurrency on kvcache aswell which is better. and 131k is well enough for agent harnesses i run 6 profiles that spin up sub agents just fine for advanced tasks.

mm limit per prompt : limit users to 3 images per prompt. its great for context to send images but i rarely send more than 3 . the width and height is just limit thats a high ress image enough to read by agents.

MTP 2: 3 sucks dont use it your throwing away 50% of the 3rd token and just wasting compute the acceptance rate is horrible. so the gpu does through the whole decode validate process 1 time every cycle and throws it away in the end.

args:
- "/data/hf/models/models--stepfun-ai--Step-3.7-Flash-NVFP4/snapshots/4275532ffd9a9496ff36b7a2dc4a9db1048da438"
- "--served-model-name=primary"
- "--host=0.0.0.0"
- "--port=8000"
- "--quantization=modelopt"
- "--kv-cache-dtype=fp8"
- "--tensor-parallel-size=2"
- "--max-model-len=131072"
- "--max-num-batched-tokens=60000"
- "--max-num-seqs=50"
- "--enable-prefix-caching"
- "--gpu-memory-utilization=0.9"
- "--limit-mm-per-prompt"
- '{"image": {"count": 3, "width": 1024, "height": 1024}}'
- "--enable-expert-parallel"
- "--disable-cascade-attn"
- "--reasoning-parser=step3p5"
- "--enable-auto-tool-choice"
- "--tool-call-parser=step3p5"
- "--trust-remote-code"
- "--async-scheduling"
- "--speculative-config"
- '{"method":"mtp","num_speculative_tokens":2}'
- "--override-generation-config"
- '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'

Sign up or log in to comment