Instructions to use osmapi/Step-3.7-Flash-MXFP4-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use osmapi/Step-3.7-Flash-MXFP4-mlx with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("osmapi/Step-3.7-Flash-MXFP4-mlx") config = load_config("osmapi/Step-3.7-Flash-MXFP4-mlx") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi
How to use osmapi/Step-3.7-Flash-MXFP4-mlx with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "osmapi/Step-3.7-Flash-MXFP4-mlx"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "osmapi/Step-3.7-Flash-MXFP4-mlx" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use osmapi/Step-3.7-Flash-MXFP4-mlx with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "osmapi/Step-3.7-Flash-MXFP4-mlx"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default osmapi/Step-3.7-Flash-MXFP4-mlx
Run Hermes
hermes
license: apache-2.0
language:
- en
library_name: mlx
pipeline_tag: image-text-to-text
base_model: stepfun-ai/Step-3.7-Flash
base_model_relation: quantized
tags:
- mlx
- safetensors
- stepfun
- step3p7
- step-3.7-flash
- vision-language
- multimodal
- moe
- quantized
- mxfp4
- microscaling
- fp4
- apple-silicon
- image-text-to-text
- text-generation
Step-3.7-Flash-MXFP4-mlx
osmapi/Step-3.7-Flash-MXFP4-mlx is an Apple-Silicon MLX MXFP4 tensor-format quantization of stepfun-ai/Step-3.7-Flash.
No fine-tuning, distillation, or retraining was applied. The upstream StepFun checkpoint was downloaded and verified locally, then eligible text and vision .weight tensors were converted with MLX mode="mxfp4" quantization. Tokenizer, chat template, custom Step3.7 Python modules, and non-quantized control tensors are preserved from the source release.
Compatibility Status
This upload is a standard MLX MXFP4 safetensors bundle, but it is not yet a drop-in mlx_lm.load(...) or mlx_vlm.load(...) model.
At conversion time, vanilla mlx-lm 0.31.3 and mlx-vlm 0.5.0 did not register model_type: step3p7. Upstream mlx-lm had Step3p5 support, but no Step3p7 model class was found in the installed packages, current upstream trees, open PR search, Hugging Face community page, Reddit LocalLLaMA discussion, or X/Twitter search.
This repository is intended for MLX runtime authors, loader implementers, and researchers who want a verified MXFP4 Step-3.7-Flash tensor bundle. Native inference will require Step3p7 model-class support in MLX/MLX-LM/MLX-VLM or a compatible custom loader.
Mac Inference Guidance
Best current use of this repository on Apple Silicon is:
| Use case | Recommended path | Status |
|---|---|---|
| Tensor inspection and loader development | Direct MLX tensor loading with mlx.core.load |
Works for inspecting shards and validating MXFP4 tensors |
| Native text inference | mlx-lm with a future or custom Step3p7 model implementation |
Best MLX target once Step3p7 support exists |
| Native vision-language inference | mlx-vlm with a future or custom Step3p7 model implementation |
Best Mac multimodal target once Step3p7 support exists |
| App-style local chat | A Mac app or server backed by the above custom MLX loader | Requires Step3p7 loader work |
llama.cpp, LM Studio, Ollama, Jan, or similar GGUF apps |
Use a GGUF Step-3.7-Flash quant instead of this MLX repo | This MXFP4 MLX tensor bundle is not GGUF |
In short: this is the right artifact for MLX implementers and high-memory Apple-Silicon experiments. For a normal point-and-click local chat app today, users should wait for Step3p7 support in MLX tooling or use a GGUF build made specifically for llama.cpp-compatible runtimes.
Useful upstream/runtime references:
- MLX: https://github.com/ml-explore/mlx
- MLX-LM: https://github.com/ml-explore/mlx-lm
- MLX-VLM: https://github.com/Blaizzy/mlx-vlm
- StepFun base model: https://huggingface.co/stepfun-ai/Step-3.7-Flash
Mac Memory And Storage Recommendations
This MXFP4 bundle is about 101 GB on disk. Runtime memory is not just file size: the loader, KV cache, vision inputs, tokenizer/processor state, temporary tensors, OS memory pressure, and any dequantization or unsupported fallback path can add substantial overhead.
| Mac unified memory | Recommendation |
|---|---|
| 64 GB or less | Not recommended for this MXFP4 bundle. Use a smaller/distilled model or a much smaller GGUF quant. |
| 96 GB | Likely too tight for practical use; may only be useful for partial tensor inspection or loader debugging. |
| 128 GB | Minimum class to try local inference with a memory-efficient native MLX Step3p7 loader, short context, and no other heavy apps. |
| 192 GB | Recommended practical target for local experiments, multimodal use, and moderate context once a compatible loader exists. |
| 256 GB+ | Best target for long-context experiments, server use, larger image batches, and safer headroom. |
Storage recommendation: keep at least 130 GB free for the repository itself, and more if using Hugging Face cache, duplicate downloads, conversion workspaces, or multiple quant variants. For development, 220 GB or more free space is more comfortable.
Context recommendation: Step-3.7-Flash advertises a very long context window, but full-context local inference will be dominated by KV cache memory. Start with short contexts, then raise context length only after measuring memory use on the target Mac. If the runtime exposes KV-cache limits, set them deliberately.
Model Details
| Property | Value |
|---|---|
| Base model | stepfun-ai/Step-3.7-Flash |
| Architecture | Step3p7 sparse MoE vision-language model |
| Parameters | 198B total, about 11B active per token |
| Context length | 256k |
| Vision encoder | 1.8B perception encoder, preserved in 2 vision shards |
| Local profile | MLX-MXFP4 |
| Bundle size | About 101 GB |
| Shards | 24 text safetensors + 2 vision safetensors |
| Source license | Apache-2.0 |
| Validation | HF file diff, safetensors index validation, config metadata validation, MLX tensor sample load |
Quantization Recipe
| Tensor class | Codec | Bits / handling |
|---|---|---|
Linear, embedding, MoE, and vision .weight tensors with compatible shape |
MLX MXFP4 | 4-bit FP4/E2M1, group size 32 |
| Quantized tensor layout | MLX floating-point quantized layout | .weight, .scales |
| Scale format | MX E8M0 | one shared 8-bit scale per group |
| Norms, biases, routing/control tensors, and incompatible tensors | passthrough | source precision preserved |
MXFP4 tensors do not use affine biases. Quantized tensors use:
.weight.scales
Conversion summary:
| Metric | Value |
|---|---|
| Quantized weights | 702 |
| Passthrough tensors | 769 |
| Group size | 32 |
| Quantization bits | 4 |
| Mode | mxfp4 |
| Bias tensors | none |
| Effective BPW | About 4.25, including E8M0 scale overhead |
The effective BPW estimate is 4 + 8/32: 4 data bits plus one 8-bit scale per group of 32 values.
Files
model-00001.safetensorstomodel-00024.safetensors: text/model shards in MLX MXFP4 tensor format.model-vit-00001.safetensorsandmodel-vit-00002.safetensors: vision encoder shards in MLX MXFP4 tensor format.model.safetensors.index.json: rewritten safetensors index for the quantized weight/scale tensors.mlx_quantization_manifest.json: conversion manifest with quantized/passthrough tensor counts and tensor-level metadata.config.json: upstream config with added MLX quantization metadata.configuration_step3p7.py,modeling_step3p7.py,processing_step3.py,vision_encoder.py: upstream custom Step3.7 code.tokenizer.json,tokenizer_config.json,special_tokens_map.json,chat_template.jinja: upstream tokenizer and prompt assets.
All 26 safetensors shards are required. The index references every model-*.safetensors shard and both model-vit-*.safetensors shards; none are extra convenience files. A downloader needs the full shard set plus model.safetensors.index.json, config.json, tokenizer files, chat template, and custom Step3.7 Python modules for a complete local model bundle.
mlx_quantization_manifest.json is not required for inference, but it is intentionally included for auditability, debugging, and reproducibility of the MXFP4 conversion.
Tensor Inspection
Until Step3p7 support lands in an MLX runtime, use MLX tensor loading for inspection or custom loader development:
import mlx.core as mx
tensors = mx.load("model-00001.safetensors")
prefix = "model.layers.0.self_attn.q_proj"
print(tensors[prefix + ".weight"].shape, tensors[prefix + ".weight"].dtype)
print(tensors[prefix + ".scales"].shape, tensors[prefix + ".scales"].dtype)
print(prefix + ".biases" in tensors)
Representative local verification for that tensor returned:
| Tensor | Shape | Dtype |
|---|---|---|
model.layers.0.self_attn.q_proj.weight |
(8192, 512) |
uint32 |
model.layers.0.self_attn.q_proj.scales |
(8192, 128) |
uint8 |
model.layers.0.self_attn.q_proj.biases |
absent | none |
Limitations
- This is a tensor-format MLX MXFP4 conversion, not a complete native Step3p7 MLX inference implementation.
- Current vanilla
mlx-lmandmlx-vlmreleases need Step3p7 architecture support before this can be used as a normal one-line load/generate model. - MXFP4 is more compact than the 8-bit affine variant, but quality has not been benchmarked after conversion.
- Multimodal prompt plumbing depends on future Step3p7 loader/runtime support.
- Behavior, benchmark scores, and deployment claims come from the upstream StepFun release; this quantization has not been re-benchmarked.
Credits
Thank you to both sides of this release:
| Quantization & release | osmAPI research team and Terv Student Research Team |
| Foundation model | StepFun, creators of stepfun-ai/Step-3.7-Flash |
License: Apache-2.0, following the upstream StepFun release.