Instructions to use amd/Step-3.5-Flash-MXFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use amd/Step-3.5-Flash-MXFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="amd/Step-3.5-Flash-MXFP4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("amd/Step-3.5-Flash-MXFP4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use amd/Step-3.5-Flash-MXFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "amd/Step-3.5-Flash-MXFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amd/Step-3.5-Flash-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/amd/Step-3.5-Flash-MXFP4

SGLang

How to use amd/Step-3.5-Flash-MXFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "amd/Step-3.5-Flash-MXFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amd/Step-3.5-Flash-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "amd/Step-3.5-Flash-MXFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amd/Step-3.5-Flash-MXFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use amd/Step-3.5-Flash-MXFP4 with Docker Model Runner:
```
docker model run hf.co/amd/Step-3.5-Flash-MXFP4
```

Step-3.5-Flash-MXFP4 / README.md

linzhao-amd

Update README.md

51e8199 verified 2 months ago

preview code

Raw

History Blame Contribute Delete

6.18 kB

	---
	license: apache-2.0
	base_model:
	- stepfun-ai/Step-3.5-Flash
	library_name: transformers
	---

	# Model Overview

	- Model Architecture: Step3p5ForCausalLM
	- Input: Text
	- Output: Text
	- Supported Hardware Microarchitecture: AMD MI350/MI355
	- ROCm: 7.1.0
	- PyTorch: 2.10.0
	- Transformers: 4.57.6
	- Operating System(s): Linux
	- Inference Engine: [vLLM](https://docs.vllm.ai/en/latest/)
	- Model Optimizer: [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
	- Weight quantization: MoE-only, OCP MXFP4, Static
	- Activation quantization: MoE-only, OCP MXFP4, Dynamic
	- Docker Image: rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463


	# Model Quantization

	The model was quantized from [stepfun-ai/Step-3.5-Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are both quantized to MXFP4. Please note that a custom quantization script is needed, and is included in this repository (`step3p5_quantize_quark.py`).


	Quantization scripts:
	```
	python3 step3p5_quantize_quark.py --model_dir $MODEL_DIR \
	--num_calib_data 128 \
	--multi_gpu \
	--trust_remote_code \
	--preset mxfp4_moe_only_no_kvcache
	--output_dir $output_dir
	```
	For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers.

	# Deployment
	### Use with vLLM

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

	## Evaluation
	The model was evaluated on gsm8k benchmarks using the [vLLM](https://docs.vllm.ai/en/latest/) framework.

	### Accuracy

	<table>
	<tr>
	<td><strong>Benchmark</strong>
	</td>
	<td><strong>stepfun-ai/Step-3.5-Flash (bf16)</strong>
	</td>
	<td><strong>amd/Step-3.5-Flash-MXFP4 (this model)</strong>
	</td>
	<td><strong>Recovery</strong>
	</td>
	</tr>
	<tr>
	<td>gsm8k (flexible-extract)
	</td>
	<td>0.8939
	</td>
	<td>0.8726
	</td>
	<td>97.6%
	</td>
	</tr>
	</table>


	### Reproduction

	The GSM8K results were obtained using the vLLM framework, based on the Docker image `rocm/vllm-dev@sha256:63f1fe04d87376bb173a1e837fba8610ab2dd77039fe7c9b97195f2a89d4d463`.

	#### Note: Due to model support issues in vLLM for Step-3.5-Flash, a few patches need to be applied (specified below) in order to run inference and evaluation using vLLM.

	#### Preparation in container
	```
	# Reinstall vLLM
	pip uninstall vllm -y
	git clone https://github.com/vllm-project/vllm.git
	cd vllm
	git checkout de7dd634b969adc6e5f50cff0cc09c1be1711d01
	pip install -r requirements/rocm.txt
	python setup.py develop
	cd ..
	export QUARK_MXFP4_IMPL="triton"
	```
	Modify `vllm/model_executor/models/step3p5.py` by adding the below packed_modules_mapping attribute to the Step3p5ForCausalLM class:
	```
	...

	class Step3p5ForCausalLM(nn.Module, SupportsPP, MixtureOfExperts):
	hf_to_vllm_mapper = WeightsMapper(
	orig_to_new_substr={".share_expert.": ".moe.share_expert."}
	)

	+ packed_modules_mapping = {
	+ "qkv_proj": [
	+ "q_proj",
	+ "k_proj",
	+ "v_proj",
	+ ],
	+ "gate_up_proj": [
	+ "gate_proj",
	+ "up_proj",
	+ ],
	+ }

	def __init__(
	self,
	*,
	vllm_config: VllmConfig,
	prefix: str = "",
	):
	super().__init__()
	...
	```
	Additionally, modify the same file (`step3p5.py`) by adding the below MoE expert name mapping to the model's `load_weights` function:
	```
	def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
	config = self.config
	assert config.num_attention_groups > 1, "Only support GQA"

	...

	for name, loaded_weight in weights:
	if name.startswith("model."):
	local_name = name[len("model.") :]
	full_name = name
	else:
	local_name = name
	full_name = f"model.{name}" if name else "model"

	+ # Normalize legacy MoE expert naming like ".moe.<E>.gate_proj" to
	+ # the ".moe.experts.<E>.gate_proj" format
	+ if ".moe.experts." not in local_name and ".moe." in local_name:
	+ parts = local_name.split(".moe.", 1)
	+ if len(parts) == 2 and "." in parts[1]:
	+ expert_and_rest = parts[1]
	+ expert_id, remainder = expert_and_rest.split(".", 1)
	+ if expert_id.isdigit():
	+ local_name = f"{parts[0]}.moe.experts.{expert_id}.{remainder}"

	spec_layer = get_spec_layer_idx_from_weight_name(config, full_name)
	if spec_layer is not None:
	continue # skip spec decode layers for main model
	...
	```
	Finally, modify `vllm/model_executor/layers/quantization/quark/quark_moe.py` by forcing `self.emulate` to "True" ([alternate resolution](https://github.com/vllm-project/vllm/pull/39436)):
	```
	class QuarkOCP_MX_MoEMethod(QuarkMoEMethod):
	def __init__(...):
	super().__init__(moe)
	...

	self.model_type = getattr(
	get_current_vllm_config().model_config.hf_config, "model_type", None
	)

	- self.emulate = (
	- not current_platform.supports_mx()
	- or not self.ocp_mx_scheme.startswith("w_mxfp4")
	- ) and (self.mxfp4_backend is None or not self.use_rocm_aiter_moe)
	+ self.emulate = True

	logger.warning_once(
	...
	```

	*Note: If Memory Access Faults are encountered, ensure that the `QUARK_MXFP4_IMPL="triton"` environmental variable is set.*


	#### Evaluating model using lm_eval
	```
	lm_eval --model vllm --model_args 'pretrained=$MODEL_DIR,attention_backend=ROCM_AITER_UNIFIED_ATTN,quantization='quark',trust_remote_code=True' --tasks gsm8k --batch_size auto
	```


	# License
	Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.