--- license: apache-2.0 language: - en library_name: mlx pipeline_tag: image-text-to-text base_model: stepfun-ai/Step-3.7-Flash base_model_relation: quantized tags: - mlx - safetensors - stepfun - step3p7 - step-3.7-flash - vision-language - multimodal - moe - quantized - mxfp4 - microscaling - fp4 - apple-silicon - image-text-to-text - text-generation --- # Step-3.7-Flash-MXFP4-mlx `osmapi/Step-3.7-Flash-MXFP4-mlx` is an Apple-Silicon MLX MXFP4 tensor-format quantization of [`stepfun-ai/Step-3.7-Flash`](https://huggingface.co/stepfun-ai/Step-3.7-Flash). No fine-tuning, distillation, or retraining was applied. The upstream StepFun checkpoint was downloaded and verified locally, then eligible text and vision `.weight` tensors were converted with MLX `mode="mxfp4"` quantization. Tokenizer, chat template, custom Step3.7 Python modules, and non-quantized control tensors are preserved from the source release. ## Compatibility Status This upload is a standard MLX MXFP4 safetensors bundle, but it is not yet a drop-in `mlx_lm.load(...)` or `mlx_vlm.load(...)` model. At conversion time, vanilla `mlx-lm 0.31.3` and `mlx-vlm 0.5.0` did not register `model_type: step3p7`. Upstream `mlx-lm` had Step3p5 support, but no Step3p7 model class was found in the installed packages, current upstream trees, open PR search, Hugging Face community page, Reddit LocalLLaMA discussion, or X/Twitter search. This repository is intended for MLX runtime authors, loader implementers, and researchers who want a verified MXFP4 Step-3.7-Flash tensor bundle. Native inference will require Step3p7 model-class support in MLX/MLX-LM/MLX-VLM or a compatible custom loader. ## Mac Inference Guidance Best current use of this repository on Apple Silicon is: | Use case | Recommended path | Status | |---|---|---| | Tensor inspection and loader development | Direct MLX tensor loading with `mlx.core.load` | Works for inspecting shards and validating MXFP4 tensors | | Native text inference | `mlx-lm` with a future or custom Step3p7 model implementation | Best MLX target once Step3p7 support exists | | Native vision-language inference | `mlx-vlm` with a future or custom Step3p7 model implementation | Best Mac multimodal target once Step3p7 support exists | | App-style local chat | A Mac app or server backed by the above custom MLX loader | Requires Step3p7 loader work | | `llama.cpp`, LM Studio, Ollama, Jan, or similar GGUF apps | Use a GGUF Step-3.7-Flash quant instead of this MLX repo | This MXFP4 MLX tensor bundle is not GGUF | In short: this is the right artifact for MLX implementers and high-memory Apple-Silicon experiments. For a normal point-and-click local chat app today, users should wait for Step3p7 support in MLX tooling or use a GGUF build made specifically for llama.cpp-compatible runtimes. Useful upstream/runtime references: - MLX: https://github.com/ml-explore/mlx - MLX-LM: https://github.com/ml-explore/mlx-lm - MLX-VLM: https://github.com/Blaizzy/mlx-vlm - StepFun base model: https://huggingface.co/stepfun-ai/Step-3.7-Flash ## Mac Memory And Storage Recommendations This MXFP4 bundle is about 101 GB on disk. Runtime memory is not just file size: the loader, KV cache, vision inputs, tokenizer/processor state, temporary tensors, OS memory pressure, and any dequantization or unsupported fallback path can add substantial overhead. | Mac unified memory | Recommendation | |---:|---| | 64 GB or less | Not recommended for this MXFP4 bundle. Use a smaller/distilled model or a much smaller GGUF quant. | | 96 GB | Likely too tight for practical use; may only be useful for partial tensor inspection or loader debugging. | | 128 GB | Minimum class to try local inference with a memory-efficient native MLX Step3p7 loader, short context, and no other heavy apps. | | 192 GB | Recommended practical target for local experiments, multimodal use, and moderate context once a compatible loader exists. | | 256 GB+ | Best target for long-context experiments, server use, larger image batches, and safer headroom. | Storage recommendation: keep at least 130 GB free for the repository itself, and more if using Hugging Face cache, duplicate downloads, conversion workspaces, or multiple quant variants. For development, 220 GB or more free space is more comfortable. Context recommendation: Step-3.7-Flash advertises a very long context window, but full-context local inference will be dominated by KV cache memory. Start with short contexts, then raise context length only after measuring memory use on the target Mac. If the runtime exposes KV-cache limits, set them deliberately. ## Model Details | Property | Value | |---|---| | Base model | `stepfun-ai/Step-3.7-Flash` | | Architecture | Step3p7 sparse MoE vision-language model | | Parameters | 198B total, about 11B active per token | | Context length | 256k | | Vision encoder | 1.8B perception encoder, preserved in 2 vision shards | | Local profile | `MLX-MXFP4` | | Bundle size | About 101 GB | | Shards | 24 text safetensors + 2 vision safetensors | | Source license | Apache-2.0 | | Validation | HF file diff, safetensors index validation, config metadata validation, MLX tensor sample load | ## Quantization Recipe | Tensor class | Codec | Bits / handling | |---|---:|---| | Linear, embedding, MoE, and vision `.weight` tensors with compatible shape | MLX MXFP4 | 4-bit FP4/E2M1, group size 32 | | Quantized tensor layout | MLX floating-point quantized layout | `.weight`, `.scales` | | Scale format | MX E8M0 | one shared 8-bit scale per group | | Norms, biases, routing/control tensors, and incompatible tensors | passthrough | source precision preserved | MXFP4 tensors do not use affine biases. Quantized tensors use: - `.weight` - `.scales` Conversion summary: | Metric | Value | |---|---:| | Quantized weights | 702 | | Passthrough tensors | 769 | | Group size | 32 | | Quantization bits | 4 | | Mode | `mxfp4` | | Bias tensors | none | | Effective BPW | About 4.25, including E8M0 scale overhead | The effective BPW estimate is `4 + 8/32`: 4 data bits plus one 8-bit scale per group of 32 values. ## Files - `model-00001.safetensors` to `model-00024.safetensors`: text/model shards in MLX MXFP4 tensor format. - `model-vit-00001.safetensors` and `model-vit-00002.safetensors`: vision encoder shards in MLX MXFP4 tensor format. - `model.safetensors.index.json`: rewritten safetensors index for the quantized weight/scale tensors. - `mlx_quantization_manifest.json`: conversion manifest with quantized/passthrough tensor counts and tensor-level metadata. - `config.json`: upstream config with added MLX quantization metadata. - `configuration_step3p7.py`, `modeling_step3p7.py`, `processing_step3.py`, `vision_encoder.py`: upstream custom Step3.7 code. - `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `chat_template.jinja`: upstream tokenizer and prompt assets. All 26 safetensors shards are required. The index references every `model-*.safetensors` shard and both `model-vit-*.safetensors` shards; none are extra convenience files. A downloader needs the full shard set plus `model.safetensors.index.json`, `config.json`, tokenizer files, chat template, and custom Step3.7 Python modules for a complete local model bundle. `mlx_quantization_manifest.json` is not required for inference, but it is intentionally included for auditability, debugging, and reproducibility of the MXFP4 conversion. ## Tensor Inspection Until Step3p7 support lands in an MLX runtime, use MLX tensor loading for inspection or custom loader development: ```python import mlx.core as mx tensors = mx.load("model-00001.safetensors") prefix = "model.layers.0.self_attn.q_proj" print(tensors[prefix + ".weight"].shape, tensors[prefix + ".weight"].dtype) print(tensors[prefix + ".scales"].shape, tensors[prefix + ".scales"].dtype) print(prefix + ".biases" in tensors) ``` Representative local verification for that tensor returned: | Tensor | Shape | Dtype | |---|---:|---| | `model.layers.0.self_attn.q_proj.weight` | `(8192, 512)` | `uint32` | | `model.layers.0.self_attn.q_proj.scales` | `(8192, 128)` | `uint8` | | `model.layers.0.self_attn.q_proj.biases` | absent | none | ## Limitations - This is a tensor-format MLX MXFP4 conversion, not a complete native Step3p7 MLX inference implementation. - Current vanilla `mlx-lm` and `mlx-vlm` releases need Step3p7 architecture support before this can be used as a normal one-line load/generate model. - MXFP4 is more compact than the 8-bit affine variant, but quality has not been benchmarked after conversion. - Multimodal prompt plumbing depends on future Step3p7 loader/runtime support. - Behavior, benchmark scores, and deployment claims come from the upstream StepFun release; this quantization has not been re-benchmarked. ## Credits Thank you to both sides of this release: | | | |---|---| | **Quantization & release** | [osmAPI](https://osmAPI.com) research team and [Terv Student Research Team](https://terv.pro) | | **Foundation model** | [StepFun](https://huggingface.co/stepfun-ai), creators of [`stepfun-ai/Step-3.7-Flash`](https://huggingface.co/stepfun-ai/Step-3.7-Flash) | License: Apache-2.0, following the upstream StepFun release.