Instructions to use nameistoken/Step-3.5-Flash-Quark-W8A8-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nameistoken/Step-3.5-Flash-Quark-W8A8-INT8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nameistoken/Step-3.5-Flash-Quark-W8A8-INT8", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nameistoken/Step-3.5-Flash-Quark-W8A8-INT8", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nameistoken/Step-3.5-Flash-Quark-W8A8-INT8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nameistoken/Step-3.5-Flash-Quark-W8A8-INT8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nameistoken/Step-3.5-Flash-Quark-W8A8-INT8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nameistoken/Step-3.5-Flash-Quark-W8A8-INT8

SGLang

How to use nameistoken/Step-3.5-Flash-Quark-W8A8-INT8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nameistoken/Step-3.5-Flash-Quark-W8A8-INT8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nameistoken/Step-3.5-Flash-Quark-W8A8-INT8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nameistoken/Step-3.5-Flash-Quark-W8A8-INT8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nameistoken/Step-3.5-Flash-Quark-W8A8-INT8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nameistoken/Step-3.5-Flash-Quark-W8A8-INT8 with Docker Model Runner:
```
docker model run hf.co/nameistoken/Step-3.5-Flash-Quark-W8A8-INT8
```

Step-3.5-Flash-Quark-W8A8-INT8

W8A8 INT8 quantized version of stepfun-ai/Step-3.5-Flash using AMD Quark.

Model Details


Base Model	`stepfun-ai/Step-3.5-Flash`
Architecture	`Step3p5ForCausalLM` (Sparse MoE, 45 layers, 288 routed experts + 1 shared)
Parameters	196.81B total / ~11B activated per token
Quantization	W8A8 INT8 (per-channel weight + per-token dynamic activation)
Quantizer	AMD Quark `0.11.1` (`ptpc_int8` scheme, `pack_method='order'`)
Model Size	~191 GB (INT8 + BF16 mix)
Original Size	~400 GB (BF16)
Compression	~2x size reduction

Quantization Scheme

Component	dtype	Granularity	Mode
Routed-expert FFN (layers 3-44)	INT8	per-channel (`ch_axis=0`)	symmetric, static
Self-attention `q/k/v/o_proj`	INT8	per-channel (`ch_axis=0`)	symmetric, static
Activations (linear inputs)	INT8	per-token (`ch_axis=1`)	symmetric, dynamic
`lm_head`, `embed_tokens`	BF16	-	unquantized
MoE router `gate` (all layers)	BF16	-	unquantized
Self-attention `g_proj`	BF16	-	unquantized
Dense FFN (layers 0-2 `mlp.{gate,up,down}_proj`)	BF16	-	unquantized
Share-expert FFN (layers 3-44)	BF16	-	unquantized
MTP module (layers 45-47)	BF16	-	unquantized

Accuracy

GSM8K 8-shot evaluation on the full 1319-question test split (vLLM, temperature=0, concurrency=16, max_tokens=1024, standard chat template, #### answer format), evaluated on AMD MI355X:

Model	Scheme	Accuracy	Correct
`stepfun-ai/Step-3.5-Flash` (BF16 baseline)	-	95.91%	1265 / 1319
This model (Quark W8A8 INT8)	per-channel weight + per-token act.	95.91%	1265 / 1319

Delta vs BF16: 0.00pp (lossless on this benchmark).

How to Use

With vLLM (Recommended)

Note: requires a vLLM build with the QuarkW8A8Int8 channel-scale shape fix (squeeze weight_scale from [out, 1] to [out] in the Quark INT8 loader; vLLM 0.19.2rc1+).

vllm serve nameistoken/Step-3.5-Flash-Quark-W8A8-INT8 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --disable-cascade-attn \
    --reasoning-parser step3p5 \
    --enable-auto-tool-choice \
    --tool-call-parser step3p5 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "nameistoken/Step-3.5-Flash-Quark-W8A8-INT8",
      "messages": [{"role": "user", "content": "Hello!"}],
      "max_tokens": 256, "temperature": 0.6, "top_p": 0.95
    }'

Hardware Requirements

Minimum: 8 x AMD MI300X / MI350X / MI355X (192 GB+ VRAM each), or equivalent NVIDIA H100/H200 (TP=8). The model itself is ~191 GB plus KV cache and activation overhead.
Tested: AMD MI355X (TP=2 with --enable-expert-parallel for 9k context; MI355X has 288 GB HBM3e per device).

Quantization Details

This model was quantized using AMD Quark's per-token per-channel INT8 scheme:

Weight: INT8 per-channel symmetric static.
Activation: INT8 per-token symmetric dynamic.
Excluded layers (kept BF16):
- lm_head, *embed_tokens*
- *mlp.gate (MoE router gates, all layers)
- *self_attn.g_proj*
- Dense FFN mlp.{down,gate,up}_proj for layers 0-2
- share_expert.{down,gate,up}_proj for layers 3-44
- All MTP-module sub-layers (layers 45-47)
Export: pack_method='order', weight_format='real_quantized', custom_mode='quark'.

Citation

If you use this model, please cite the original Step 3.5 Flash technical report:

@misc{huang2026step35flashopen,
      title={Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters},
      author={StepFun},
      year={2026},
      eprint={2602.10604},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.10604}
}

License

This model is released under the Apache License 2.0, following the upstream stepfun-ai/Step-3.5-Flash.

This is a quantized derivative of stepfun-ai/Step-3.5-Flash. Per Apache 2.0:

Modified files (the INT8-quantized model-*.safetensors shards and the appended quantization_config block in config.json) carry this notice as part of the model card.
Original copyright and attribution notices from the base model are preserved (see NOTICE).
A copy of the Apache 2.0 license text is included as LICENSE.

Original weights (c) StepFun. Quantization performed by the model author; no warranty of any kind is provided.

Downloads last month: 21

Safetensors

Model size

199B params

Tensor type

BF16

F32

Model tree for nameistoken/Step-3.5-Flash-Quark-W8A8-INT8

Base model

stepfun-ai/Step-3.5-Flash

Quantized

(25)

this model

Paper for nameistoken/Step-3.5-Flash-Quark-W8A8-INT8

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Paper • 2602.10604 • Published Feb 11 • 199