Instructions to use amd/Step-3.5-Flash-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amd/Step-3.5-Flash-MXFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="amd/Step-3.5-Flash-MXFP4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("amd/Step-3.5-Flash-MXFP4", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use amd/Step-3.5-Flash-MXFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "amd/Step-3.5-Flash-MXFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amd/Step-3.5-Flash-MXFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/amd/Step-3.5-Flash-MXFP4
- SGLang
How to use amd/Step-3.5-Flash-MXFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "amd/Step-3.5-Flash-MXFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amd/Step-3.5-Flash-MXFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "amd/Step-3.5-Flash-MXFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "amd/Step-3.5-Flash-MXFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use amd/Step-3.5-Flash-MXFP4 with Docker Model Runner:
docker model run hf.co/amd/Step-3.5-Flash-MXFP4
| license: apache-2.0 | |
| base_model: | |
| - stepfun-ai/Step-3.5-Flash | |
| library_name: transformers | |
| # Model Overview | |
| - **Model Architecture:** Step3p5ForCausalLM | |
| - **Input:** Text | |
| - **Output:** Text | |
| - **Supported Hardware Microarchitecture:** AMD MI350/MI355 | |
| - **ROCm**: 7.1.0 | |
| - **PyTorch**: 2.10.0 | |
| - **Transformers**: 4.57.6 | |
| - **Operating System(s):** Linux | |
| - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) | |
| - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) | |
| - **Weight quantization:** MoE-only, OCP MXFP4, Static | |
| - **Activation quantization:** MoE-only, OCP MXFP4, Dynamic | |
| - **Docker Image:** rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463 | |
| # Model Quantization | |
| The model was quantized from [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are both quantized to MXFP4. **Please note that a custom quantization script is needed, and is included in this repository (`step3p5_quantize_quark.py`).** | |
| **Quantization scripts:** | |
| ``` | |
| python3 step3p5_quantize_quark.py --model_dir $MODEL_DIR \ | |
| --num_calib_data 128 \ | |
| --multi_gpu \ | |
| --trust_remote_code \ | |
| --preset mxfp4_moe_only_no_kvcache | |
| --output_dir $output_dir | |
| ``` | |
| For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers. | |
| # Deployment | |
| ### Use with vLLM | |
| This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. | |
| ## Evaluation | |
| The model was evaluated on gsm8k benchmarks using the [vLLM](https://docs.vllm.ai/en/latest/) framework. | |
| ### Accuracy | |
| <table> | |
| <tr> | |
| <td><strong>Benchmark</strong> | |
| </td> | |
| <td><strong>stepfun-ai/Step-3.5-Flash (bf16)</strong> | |
| </td> | |
| <td><strong>amd/Step-3.5-Flash-MXFP4 (this model)</strong> | |
| </td> | |
| <td><strong>Recovery</strong> | |
| </td> | |
| </tr> | |
| <tr> | |
| <td>gsm8k (flexible-extract) | |
| </td> | |
| <td>0.8939 | |
| </td> | |
| <td>0.8726 | |
| </td> | |
| <td>97.6% | |
| </td> | |
| </tr> | |
| </table> | |
| ### Reproduction | |
| The GSM8K results were obtained using the vLLM framework, based on the Docker image `rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463`. | |
| #### Note: Due to model support issues in vLLM for Step-3.5-Flash, a few patches need to be applied (specified below) in order to run inference and evaluation using vLLM. | |
| #### Preparation in container | |
| ``` | |
| # Reinstall vLLM | |
| pip uninstall vllm -y | |
| git clone https://github.com/vllm-project/vllm.git | |
| cd vllm | |
| git checkout de7dd634b969adc6e5f50cff0cc09c1be1711d01 | |
| pip install -r requirements/rocm.txt | |
| python setup.py develop | |
| cd .. | |
| export QUARK_MXFP4_IMPL="triton" | |
| ``` | |
| Modify `vllm/model_executor/models/step3p5.py` by adding the below packed_modules_mapping attribute to the Step3p5ForCausalLM class: | |
| ``` | |
| ... | |
| class Step3p5ForCausalLM(nn.Module, SupportsPP, MixtureOfExperts): | |
| hf_to_vllm_mapper = WeightsMapper( | |
| orig_to_new_substr={".share_expert.": ".moe.share_expert."} | |
| ) | |
| + packed_modules_mapping = { | |
| + "qkv_proj": [ | |
| + "q_proj", | |
| + "k_proj", | |
| + "v_proj", | |
| + ], | |
| + "gate_up_proj": [ | |
| + "gate_proj", | |
| + "up_proj", | |
| + ], | |
| + } | |
| def __init__( | |
| self, | |
| *, | |
| vllm_config: VllmConfig, | |
| prefix: str = "", | |
| ): | |
| super().__init__() | |
| ... | |
| ``` | |
| Additionally, modify the same file (`step3p5.py`) by adding the below MoE expert name mapping to the model's `load_weights` function: | |
| ``` | |
| def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: | |
| config = self.config | |
| assert config.num_attention_groups > 1, "Only support GQA" | |
| ... | |
| for name, loaded_weight in weights: | |
| if name.startswith("model."): | |
| local_name = name[len("model.") :] | |
| full_name = name | |
| else: | |
| local_name = name | |
| full_name = f"model.{name}" if name else "model" | |
| + # Normalize legacy MoE expert naming like ".moe.<E>.gate_proj" to | |
| + # the ".moe.experts.<E>.gate_proj" format | |
| + if ".moe.experts." not in local_name and ".moe." in local_name: | |
| + parts = local_name.split(".moe.", 1) | |
| + if len(parts) == 2 and "." in parts[1]: | |
| + expert_and_rest = parts[1] | |
| + expert_id, remainder = expert_and_rest.split(".", 1) | |
| + if expert_id.isdigit(): | |
| + local_name = f"{parts[0]}.moe.experts.{expert_id}.{remainder}" | |
| spec_layer = get_spec_layer_idx_from_weight_name(config, full_name) | |
| if spec_layer is not None: | |
| continue # skip spec decode layers for main model | |
| ... | |
| ``` | |
| Finally, modify `vllm/model_executor/layers/quantization/quark/quark_moe.py` by forcing `self.emulate` to "True" ([alternate resolution](https://github.com/vllm-project/vllm/pull/39436)): | |
| ``` | |
| class QuarkOCP_MX_MoEMethod(QuarkMoEMethod): | |
| def __init__(...): | |
| super().__init__(moe) | |
| ... | |
| self.model_type = getattr( | |
| get_current_vllm_config().model_config.hf_config, "model_type", None | |
| ) | |
| - self.emulate = ( | |
| - not current_platform.supports_mx() | |
| - or not self.ocp_mx_scheme.startswith("w_mxfp4") | |
| - ) and (self.mxfp4_backend is None or not self.use_rocm_aiter_moe) | |
| + self.emulate = True | |
| logger.warning_once( | |
| ... | |
| ``` | |
| ***Note:** If Memory Access Faults are encountered, ensure that the `QUARK_MXFP4_IMPL="triton"` environmental variable is set.* | |
| #### Evaluating model using lm_eval | |
| ``` | |
| lm_eval --model vllm --model_args 'pretrained=$MODEL_DIR,attention_backend=ROCM_AITER_UNIFIED_ATTN,quantization='quark',trust_remote_code=True' --tasks gsm8k --batch_size auto | |
| ``` | |
| # License | |
| Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved. |