Step-3.5-Flash-Quark-W8A8-INT8

W8A8 INT8 quantized version of stepfun-ai/Step-3.5-Flash using AMD Quark.

Model Details

Base Model stepfun-ai/Step-3.5-Flash
Architecture Step3p5ForCausalLM (Sparse MoE, 45 layers, 288 routed experts + 1 shared)
Parameters 196.81B total / ~11B activated per token
Quantization W8A8 INT8 (per-channel weight + per-token dynamic activation)
Quantizer AMD Quark 0.11.1 (ptpc_int8 scheme, pack_method='order')
Model Size ~191 GB (INT8 + BF16 mix)
Original Size ~400 GB (BF16)
Compression ~2x size reduction

Quantization Scheme

Component dtype Granularity Mode
Routed-expert FFN (layers 3-44) INT8 per-channel (ch_axis=0) symmetric, static
Self-attention q/k/v/o_proj INT8 per-channel (ch_axis=0) symmetric, static
Activations (linear inputs) INT8 per-token (ch_axis=1) symmetric, dynamic
lm_head, embed_tokens BF16 - unquantized
MoE router gate (all layers) BF16 - unquantized
Self-attention g_proj BF16 - unquantized
Dense FFN (layers 0-2 mlp.{gate,up,down}_proj) BF16 - unquantized
Share-expert FFN (layers 3-44) BF16 - unquantized
MTP module (layers 45-47) BF16 - unquantized

Accuracy

GSM8K 8-shot evaluation on the full 1319-question test split (vLLM, temperature=0, concurrency=16, max_tokens=1024, standard chat template, #### answer format), evaluated on AMD MI355X:

Model Scheme Accuracy Correct
stepfun-ai/Step-3.5-Flash (BF16 baseline) - 95.91% 1265 / 1319
This model (Quark W8A8 INT8) per-channel weight + per-token act. 95.91% 1265 / 1319

Delta vs BF16: 0.00pp (lossless on this benchmark).

How to Use

With vLLM (Recommended)

Note: requires a vLLM build with the QuarkW8A8Int8 channel-scale shape fix (squeeze weight_scale from [out, 1] to [out] in the Quark INT8 loader; vLLM 0.19.2rc1+).

vllm serve nameistoken/Step-3.5-Flash-Quark-W8A8-INT8 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --disable-cascade-attn \
    --reasoning-parser step3p5 \
    --enable-auto-tool-choice \
    --tool-call-parser step3p5 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "nameistoken/Step-3.5-Flash-Quark-W8A8-INT8",
      "messages": [{"role": "user", "content": "Hello!"}],
      "max_tokens": 256, "temperature": 0.6, "top_p": 0.95
    }'

Hardware Requirements

  • Minimum: 8 x AMD MI300X / MI350X / MI355X (192 GB+ VRAM each), or equivalent NVIDIA H100/H200 (TP=8). The model itself is ~191 GB plus KV cache and activation overhead.
  • Tested: AMD MI355X (TP=2 with --enable-expert-parallel for 9k context; MI355X has 288 GB HBM3e per device).

Quantization Details

This model was quantized using AMD Quark's per-token per-channel INT8 scheme:

  • Weight: INT8 per-channel symmetric static.
  • Activation: INT8 per-token symmetric dynamic.
  • Excluded layers (kept BF16):
    • lm_head, *embed_tokens*
    • *mlp.gate (MoE router gates, all layers)
    • *self_attn.g_proj*
    • Dense FFN mlp.{down,gate,up}_proj for layers 0-2
    • share_expert.{down,gate,up}_proj for layers 3-44
    • All MTP-module sub-layers (layers 45-47)
  • Export: pack_method='order', weight_format='real_quantized', custom_mode='quark'.

Citation

If you use this model, please cite the original Step 3.5 Flash technical report:

@misc{huang2026step35flashopen,
      title={Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters},
      author={StepFun},
      year={2026},
      eprint={2602.10604},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.10604}
}

License

This model is released under the Apache License 2.0, following the upstream stepfun-ai/Step-3.5-Flash.

This is a quantized derivative of stepfun-ai/Step-3.5-Flash. Per Apache 2.0:

  • Modified files (the INT8-quantized model-*.safetensors shards and the appended quantization_config block in config.json) carry this notice as part of the model card.
  • Original copyright and attribution notices from the base model are preserved (see NOTICE).
  • A copy of the Apache 2.0 license text is included as LICENSE.

Original weights (c) StepFun. Quantization performed by the model author; no warranty of any kind is provided.

Downloads last month
21
Safetensors
Model size
199B params
Tensor type
BF16
·
I8
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nameistoken/Step-3.5-Flash-Quark-W8A8-INT8

Quantized
(25)
this model

Paper for nameistoken/Step-3.5-Flash-Quark-W8A8-INT8