How to use from the
Use from the
MLX library
# Make sure mlx-vlm is installed
# pip install --upgrade mlx-vlm

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model, processor = load("osmapi/Step-3.7-Flash-MXFP4-mlx")
config = load_config("osmapi/Step-3.7-Flash-MXFP4-mlx")

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=1
)

# Generate output
output = generate(model, processor, formatted_prompt, image)
print(output)

Step-3.7-Flash-MXFP4-mlx

osmapi/Step-3.7-Flash-MXFP4-mlx is an Apple-Silicon MLX MXFP4 tensor-format quantization of stepfun-ai/Step-3.7-Flash.

No fine-tuning, distillation, or retraining was applied. The upstream StepFun checkpoint was downloaded and verified locally, then eligible text and vision .weight tensors were converted with MLX mode="mxfp4" quantization. Tokenizer, chat template, custom Step3.7 Python modules, and non-quantized control tensors are preserved from the source release.

Compatibility Status

This upload is a standard MLX MXFP4 safetensors bundle, but it is not yet a drop-in mlx_lm.load(...) or mlx_vlm.load(...) model.

At conversion time, vanilla mlx-lm 0.31.3 and mlx-vlm 0.5.0 did not register model_type: step3p7. Upstream mlx-lm had Step3p5 support, but no Step3p7 model class was found in the installed packages, current upstream trees, open PR search, Hugging Face community page, Reddit LocalLLaMA discussion, or X/Twitter search.

This repository is intended for MLX runtime authors, loader implementers, and researchers who want a verified MXFP4 Step-3.7-Flash tensor bundle. Native inference will require Step3p7 model-class support in MLX/MLX-LM/MLX-VLM or a compatible custom loader.

Mac Inference Guidance

Best current use of this repository on Apple Silicon is:

Use case Recommended path Status
Tensor inspection and loader development Direct MLX tensor loading with mlx.core.load Works for inspecting shards and validating MXFP4 tensors
Native text inference mlx-lm with a future or custom Step3p7 model implementation Best MLX target once Step3p7 support exists
Native vision-language inference mlx-vlm with a future or custom Step3p7 model implementation Best Mac multimodal target once Step3p7 support exists
App-style local chat A Mac app or server backed by the above custom MLX loader Requires Step3p7 loader work
llama.cpp, LM Studio, Ollama, Jan, or similar GGUF apps Use a GGUF Step-3.7-Flash quant instead of this MLX repo This MXFP4 MLX tensor bundle is not GGUF

In short: this is the right artifact for MLX implementers and high-memory Apple-Silicon experiments. For a normal point-and-click local chat app today, users should wait for Step3p7 support in MLX tooling or use a GGUF build made specifically for llama.cpp-compatible runtimes.

Useful upstream/runtime references:

Mac Memory And Storage Recommendations

This MXFP4 bundle is about 101 GB on disk. Runtime memory is not just file size: the loader, KV cache, vision inputs, tokenizer/processor state, temporary tensors, OS memory pressure, and any dequantization or unsupported fallback path can add substantial overhead.

Mac unified memory Recommendation
64 GB or less Not recommended for this MXFP4 bundle. Use a smaller/distilled model or a much smaller GGUF quant.
96 GB Likely too tight for practical use; may only be useful for partial tensor inspection or loader debugging.
128 GB Minimum class to try local inference with a memory-efficient native MLX Step3p7 loader, short context, and no other heavy apps.
192 GB Recommended practical target for local experiments, multimodal use, and moderate context once a compatible loader exists.
256 GB+ Best target for long-context experiments, server use, larger image batches, and safer headroom.

Storage recommendation: keep at least 130 GB free for the repository itself, and more if using Hugging Face cache, duplicate downloads, conversion workspaces, or multiple quant variants. For development, 220 GB or more free space is more comfortable.

Context recommendation: Step-3.7-Flash advertises a very long context window, but full-context local inference will be dominated by KV cache memory. Start with short contexts, then raise context length only after measuring memory use on the target Mac. If the runtime exposes KV-cache limits, set them deliberately.

Model Details

Property Value
Base model stepfun-ai/Step-3.7-Flash
Architecture Step3p7 sparse MoE vision-language model
Parameters 198B total, about 11B active per token
Context length 256k
Vision encoder 1.8B perception encoder, preserved in 2 vision shards
Local profile MLX-MXFP4
Bundle size About 101 GB
Shards 24 text safetensors + 2 vision safetensors
Source license Apache-2.0
Validation HF file diff, safetensors index validation, config metadata validation, MLX tensor sample load

Quantization Recipe

Tensor class Codec Bits / handling
Linear, embedding, MoE, and vision .weight tensors with compatible shape MLX MXFP4 4-bit FP4/E2M1, group size 32
Quantized tensor layout MLX floating-point quantized layout .weight, .scales
Scale format MX E8M0 one shared 8-bit scale per group
Norms, biases, routing/control tensors, and incompatible tensors passthrough source precision preserved

MXFP4 tensors do not use affine biases. Quantized tensors use:

  • .weight
  • .scales

Conversion summary:

Metric Value
Quantized weights 702
Passthrough tensors 769
Group size 32
Quantization bits 4
Mode mxfp4
Bias tensors none
Effective BPW About 4.25, including E8M0 scale overhead

The effective BPW estimate is 4 + 8/32: 4 data bits plus one 8-bit scale per group of 32 values.

Files

  • model-00001.safetensors to model-00024.safetensors: text/model shards in MLX MXFP4 tensor format.
  • model-vit-00001.safetensors and model-vit-00002.safetensors: vision encoder shards in MLX MXFP4 tensor format.
  • model.safetensors.index.json: rewritten safetensors index for the quantized weight/scale tensors.
  • mlx_quantization_manifest.json: conversion manifest with quantized/passthrough tensor counts and tensor-level metadata.
  • config.json: upstream config with added MLX quantization metadata.
  • configuration_step3p7.py, modeling_step3p7.py, processing_step3.py, vision_encoder.py: upstream custom Step3.7 code.
  • tokenizer.json, tokenizer_config.json, special_tokens_map.json, chat_template.jinja: upstream tokenizer and prompt assets.

All 26 safetensors shards are required. The index references every model-*.safetensors shard and both model-vit-*.safetensors shards; none are extra convenience files. A downloader needs the full shard set plus model.safetensors.index.json, config.json, tokenizer files, chat template, and custom Step3.7 Python modules for a complete local model bundle.

mlx_quantization_manifest.json is not required for inference, but it is intentionally included for auditability, debugging, and reproducibility of the MXFP4 conversion.

Tensor Inspection

Until Step3p7 support lands in an MLX runtime, use MLX tensor loading for inspection or custom loader development:

import mlx.core as mx

tensors = mx.load("model-00001.safetensors")
prefix = "model.layers.0.self_attn.q_proj"

print(tensors[prefix + ".weight"].shape, tensors[prefix + ".weight"].dtype)
print(tensors[prefix + ".scales"].shape, tensors[prefix + ".scales"].dtype)
print(prefix + ".biases" in tensors)

Representative local verification for that tensor returned:

Tensor Shape Dtype
model.layers.0.self_attn.q_proj.weight (8192, 512) uint32
model.layers.0.self_attn.q_proj.scales (8192, 128) uint8
model.layers.0.self_attn.q_proj.biases absent none

Limitations

  • This is a tensor-format MLX MXFP4 conversion, not a complete native Step3p7 MLX inference implementation.
  • Current vanilla mlx-lm and mlx-vlm releases need Step3p7 architecture support before this can be used as a normal one-line load/generate model.
  • MXFP4 is more compact than the 8-bit affine variant, but quality has not been benchmarked after conversion.
  • Multimodal prompt plumbing depends on future Step3p7 loader/runtime support.
  • Behavior, benchmark scores, and deployment claims come from the upstream StepFun release; this quantization has not been re-benchmarked.

Credits

Thank you to both sides of this release:

Quantization & release osmAPI research team and Terv Student Research Team
Foundation model StepFun, creators of stepfun-ai/Step-3.7-Flash

License: Apache-2.0, following the upstream StepFun release.

Downloads last month
230
Safetensors
Model size
38B params
Tensor type
BF16
·
U8
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osmapi/Step-3.7-Flash-MXFP4-mlx

Quantized
(18)
this model