Instructions to use stepfun-ai/Step-3.7-Flash-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("stepfun-ai/Step-3.7-Flash-NVFP4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use stepfun-ai/Step-3.7-Flash-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stepfun-ai/Step-3.7-Flash-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4

SGLang

How to use stepfun-ai/Step-3.7-Flash-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "stepfun-ai/Step-3.7-Flash-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stepfun-ai/Step-3.7-Flash-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use stepfun-ai/Step-3.7-Flash-NVFP4 with Docker Model Runner:
```
docker model run hf.co/stepfun-ai/Step-3.7-Flash-NVFP4
```

How to serving for sglang, blackwell pro 6000? yet, only serving sm100(B100?)

by gigascake - opened 26 days ago

Discussion

gigascake

26 days ago

How to serving for sglang, blackwell pro 6000? yet, only serving sm100(B100?)

archib4

22 days ago

•

edited 22 days ago

i havnt tried it on sglang i only switch when vllm isnt working.

Not sure if this is helpful in your case but if you can do vllm (vllm/vllm-openai:stepfun37) here is a setup optimized to the max for 2x blackwell pro 6000.

The b12x fallback doesnt work atm until a fix for SWIGLUSTEP support to B12X is implemented so your stuck on marlin .

Comments on args:
max-num-batched-tokens : this value makes the stupid 20x60000=~7GB vision encoder startup to pass its "safety" check. someone decided 20 images worst case test for vision encoder was a good idea. why ??? test 8 images maybe not 20 . and no you cant oom and make backend crash cause you sent more then 20 images and how is that relevant as a startup safety check on boot... someone didnt cybersecurity cook here

why not 256k context length? the users never exceeds this number. unless its a 200+ page document ingested. (you shouldnt be doing that, and teach your clients the better way) we gain concurrency on kvcache aswell which is better. and 131k is well enough for agent harnesses i run 6 profiles that spin up sub agents just fine for advanced tasks.

mm limit per prompt : limit users to 3 images per prompt. its great for context to send images but i rarely send more than 3 . the width and height is just limit thats a high ress image enough to read by agents.

MTP 2: 3 sucks dont use it your throwing away 50% of the 3rd token and just wasting compute the acceptance rate is horrible. so the gpu does through the whole decode validate process 1 time every cycle and throws it away in the end.

args:
- "/data/hf/models/models--stepfun-ai--Step-3.7-Flash-NVFP4/snapshots/4275532ffd9a9496ff36b7a2dc4a9db1048da438"
- "--served-model-name=primary"
- "--host=0.0.0.0"
- "--port=8000"
- "--quantization=modelopt"
- "--kv-cache-dtype=fp8"
- "--tensor-parallel-size=2"
- "--max-model-len=131072"
- "--max-num-batched-tokens=60000"
- "--max-num-seqs=50"
- "--enable-prefix-caching"
- "--gpu-memory-utilization=0.9"
- "--limit-mm-per-prompt"
- '{"image": {"count": 3, "width": 1024, "height": 1024}}'
- "--enable-expert-parallel"
- "--disable-cascade-attn"
- "--reasoning-parser=step3p5"
- "--enable-auto-tool-choice"
- "--tool-call-parser=step3p5"
- "--trust-remote-code"
- "--async-scheduling"
- "--speculative-config"
- '{"method":"mtp","num_speculative_tokens":2}'
- "--override-generation-config"
- '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}'

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment