Instructions to use stepfun-ai/Step-3.7-Flash-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "stepfun-ai/Step-3.7-Flash-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stepfun-ai/Step-3.7-Flash-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4
- SGLang
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stepfun-ai/Step-3.7-Flash-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "stepfun-ai/Step-3.7-Flash-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Docker Model Runner:
docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4
How to serving for sglang, blackwell pro 6000? yet, only serving sm100(B100?)
How to serving for sglang, blackwell pro 6000? yet, only serving sm100(B100?)
i havnt tried it on sglang i only switch when vllm isnt working.
Not sure if this is helpful in your case but if you can do vllm (vllm/vllm-openai:stepfun37) here is a setup optimized to the max for 2x blackwell pro 6000.
The b12x fallback doesnt work atm until a fix for SWIGLUSTEP support to B12X is implemented so your stuck on marlin .
Comments on args:
max-num-batched-tokens : this value makes the stupid 20x60000=~7GB vision encoder startup to pass its "safety" check. someone decided 20 images worst case test for vision encoder was a good idea. why ??? test 8 images maybe not 20 . and no you cant oom and make backend crash cause you sent more then 20 images and how is that relevant as a startup safety check on boot... someone didnt cybersecurity cook here
why not 256k context length? the users never exceeds this number. unless its a 200+ page document ingested. (you shouldnt be doing that, and teach your clients the better way) we gain concurrency on kvcache aswell which is better. and 131k is well enough for agent harnesses i run 6 profiles that spin up sub agents just fine for advanced tasks.
mm limit per prompt : limit users to 3 images per prompt. its great for context to send images but i rarely send more than 3 . the width and height is just limit thats a high ress image enough to read by agents.
MTP 2: 3 sucks dont use it your throwing away 50% of the 3rd token and just wasting compute the acceptance rate is horrible. so the gpu does through the whole decode validate process 1 time every cycle and throws it away in the end.
args:
- "/data/hf/models/models--stepfun-ai--Step-3.7-Flash-NVFP4/snapshots/4275532ffd9a9496ff36b7a2dc4a9db1048da438"
- "--served-model-name=primary"
- "--host=0.0.0.0"
- "--port=8000"
- "--quantization=modelopt"
- "--kv-cache-dtype=fp8"
- "--tensor-parallel-size=2"
- "--max-model-len=131072"
- "--max-num-batched-tokens=60000"
- "--max-num-seqs=50"
- "--enable-prefix-caching"
- "--gpu-memory-utilization=0.9"
- "--limit-mm-per-prompt"
- '{"image": {"count": 3, "width": 1024, "height": 1024}}'
- "--enable-expert-parallel"
- "--disable-cascade-attn"
- "--reasoning-parser=step3p5"
- "--enable-auto-tool-choice"
- "--tool-call-parser=step3p5"
- "--trust-remote-code"
- "--async-scheduling"
- "--speculative-config"
- '{"method":"mtp","num_speculative_tokens":2}'
- "--override-generation-config"
- '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'