Step-3.7-Flash-oQ5-MLX

This repository contains an oMLX oQ5 mixed-precision MLX quantization of stepfun-ai/Step-3.7-Flash.

Step-3.7-Flash is a sparse Mixture-of-Experts vision-language model from StepFun. The upstream model card describes it as a 198B-parameter model with a 196B language backbone, a 1.8B vision encoder, approximately 11B active parameters per token, and a 256K context window.

Quantization

Field Value
Method oMLX oQ mixed-precision MLX
Quantization oQ5
Base model revision 5f6244077ac62e04eec3f320501ff8c2b293373a
Model type step3p7 / step3p5 text backbone
Group size 64
Quantization mode affine
Base bits 5
Effective plan 5.79 bpw
Output shards 27 safetensors
Output size 131.0 GiB
Non-quantized/scales dtype bfloat16
Vision weights preserved
Native MTP weights not present in upstream weights

Notes

Vision weights are preserved from the upstream model.

Native MTP weights are not included in this artifact. The upstream text_config declares num_nextn_predict_layers=3, but the published safetensors index does not contain mtp.* weights, so oMLX keeps the quantized output self-consistent with Native MTP disabled.

This is an unofficial quantized derivative. It is not affiliated with, sponsored by, or endorsed by StepFun.

Validation

Artifact validation completed locally with the bundled oMLX runtime on macOS:

source model: stepfun-ai/Step-3.7-Flash
source revision: 5f6244077ac62e04eec3f320501ff8c2b293373a
quantization: oQ5
config.json: present
model.safetensors.index.json: present
safetensor shards: 27
vision tensors: present
mtp tensors: not present

Generation smoke testing is intentionally not claimed here because Step-3.7-Flash is a very large VLM/MoE checkpoint and runtime support depends on the local MLX/oMLX build and available unified memory.

Usage

Use an MLX/oMLX build that supports the packaged Step3p7 architecture and multimodal processor.

huggingface-cli download \
  --local-dir Step-3.7-Flash-oQ5-MLX \
  dawncr0w/Step-3.7-Flash-oQ5-MLX

For a text-only smoke test, adapt the command to your local MLX/oMLX runtime:

python -m mlx_lm generate \
  --model /path/to/Step-3.7-Flash-oQ5-MLX \
  --prompt "Hello" \
  --max-tokens 32 \
  --temp 0

For multimodal inference, use an oMLX/MLX-VLM build that supports Step3p7 image-text-to-text models and pass the model directory as the local checkpoint.

License And Notice

The base model is distributed under the Apache License 2.0. This quantized artifact follows the same license.

Downloads last month
-
Safetensors
Model size
39B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dawncr0w/Step-3.7-Flash-oQ5-MLX

Quantized
(30)
this model