Add files using upload-large-folder tool

42b40a9 verified 1 day ago

9.22 kB

license: apache-2.0
language:
  - en
library_name: mlx
pipeline_tag: image-text-to-text
base_model: stepfun-ai/Step-3.7-Flash
base_model_relation: quantized
tags:
  - mlx
  - safetensors
  - stepfun
  - step3p7
  - step-3.7-flash
  - vision-language
  - multimodal
  - moe
  - quantized
  - mxfp4
  - microscaling
  - fp4
  - apple-silicon
  - image-text-to-text
  - text-generation

Step-3.7-Flash-MXFP4-mlx

osmapi/Step-3.7-Flash-MXFP4-mlx is an Apple-Silicon MLX MXFP4 tensor-format quantization of stepfun-ai/Step-3.7-Flash.

No fine-tuning, distillation, or retraining was applied. The upstream StepFun checkpoint was downloaded and verified locally, then eligible text and vision .weight tensors were converted with MLX mode="mxfp4" quantization. Tokenizer, chat template, custom Step3.7 Python modules, and non-quantized control tensors are preserved from the source release.

Compatibility Status

This upload is a standard MLX MXFP4 safetensors bundle, but it is not yet a drop-in mlx_lm.load(...) or mlx_vlm.load(...) model.

At conversion time, vanilla mlx-lm 0.31.3 and mlx-vlm 0.5.0 did not register model_type: step3p7. Upstream mlx-lm had Step3p5 support, but no Step3p7 model class was found in the installed packages, current upstream trees, open PR search, Hugging Face community page, Reddit LocalLLaMA discussion, or X/Twitter search.

This repository is intended for MLX runtime authors, loader implementers, and researchers who want a verified MXFP4 Step-3.7-Flash tensor bundle. Native inference will require Step3p7 model-class support in MLX/MLX-LM/MLX-VLM or a compatible custom loader.

Mac Inference Guidance

Best current use of this repository on Apple Silicon is:

Use case	Recommended path	Status
Tensor inspection and loader development	Direct MLX tensor loading with `mlx.core.load`	Works for inspecting shards and validating MXFP4 tensors
Native text inference	`mlx-lm` with a future or custom Step3p7 model implementation	Best MLX target once Step3p7 support exists
Native vision-language inference	`mlx-vlm` with a future or custom Step3p7 model implementation	Best Mac multimodal target once Step3p7 support exists
App-style local chat	A Mac app or server backed by the above custom MLX loader	Requires Step3p7 loader work
`llama.cpp`, LM Studio, Ollama, Jan, or similar GGUF apps	Use a GGUF Step-3.7-Flash quant instead of this MLX repo	This MXFP4 MLX tensor bundle is not GGUF

In short: this is the right artifact for MLX implementers and high-memory Apple-Silicon experiments. For a normal point-and-click local chat app today, users should wait for Step3p7 support in MLX tooling or use a GGUF build made specifically for llama.cpp-compatible runtimes.

Useful upstream/runtime references:

MLX: https://github.com/ml-explore/mlx
MLX-LM: https://github.com/ml-explore/mlx-lm
MLX-VLM: https://github.com/Blaizzy/mlx-vlm
StepFun base model: https://huggingface.co/stepfun-ai/Step-3.7-Flash

Mac Memory And Storage Recommendations

This MXFP4 bundle is about 101 GB on disk. Runtime memory is not just file size: the loader, KV cache, vision inputs, tokenizer/processor state, temporary tensors, OS memory pressure, and any dequantization or unsupported fallback path can add substantial overhead.

Mac unified memory	Recommendation
64 GB or less	Not recommended for this MXFP4 bundle. Use a smaller/distilled model or a much smaller GGUF quant.
96 GB	Likely too tight for practical use; may only be useful for partial tensor inspection or loader debugging.
128 GB	Minimum class to try local inference with a memory-efficient native MLX Step3p7 loader, short context, and no other heavy apps.
192 GB	Recommended practical target for local experiments, multimodal use, and moderate context once a compatible loader exists.
256 GB+	Best target for long-context experiments, server use, larger image batches, and safer headroom.

Storage recommendation: keep at least 130 GB free for the repository itself, and more if using Hugging Face cache, duplicate downloads, conversion workspaces, or multiple quant variants. For development, 220 GB or more free space is more comfortable.

Context recommendation: Step-3.7-Flash advertises a very long context window, but full-context local inference will be dominated by KV cache memory. Start with short contexts, then raise context length only after measuring memory use on the target Mac. If the runtime exposes KV-cache limits, set them deliberately.

Model Details

Property	Value
Base model	`stepfun-ai/Step-3.7-Flash`
Architecture	Step3p7 sparse MoE vision-language model
Parameters	198B total, about 11B active per token
Context length	256k
Vision encoder	1.8B perception encoder, preserved in 2 vision shards
Local profile	`MLX-MXFP4`
Bundle size	About 101 GB
Shards	24 text safetensors + 2 vision safetensors
Source license	Apache-2.0
Validation	HF file diff, safetensors index validation, config metadata validation, MLX tensor sample load

Quantization Recipe

Tensor class	Codec	Bits / handling
Linear, embedding, MoE, and vision `.weight` tensors with compatible shape	MLX MXFP4	4-bit FP4/E2M1, group size 32
Quantized tensor layout	MLX floating-point quantized layout	`.weight`, `.scales`
Scale format	MX E8M0	one shared 8-bit scale per group
Norms, biases, routing/control tensors, and incompatible tensors	passthrough	source precision preserved

MXFP4 tensors do not use affine biases. Quantized tensors use:

.weight
.scales

Conversion summary:

Metric	Value
Quantized weights	702
Passthrough tensors	769
Group size	32
Quantization bits	4
Mode	`mxfp4`
Bias tensors	none
Effective BPW	About 4.25, including E8M0 scale overhead

The effective BPW estimate is 4 + 8/32: 4 data bits plus one 8-bit scale per group of 32 values.

Files

model-00001.safetensors to model-00024.safetensors: text/model shards in MLX MXFP4 tensor format.
model-vit-00001.safetensors and model-vit-00002.safetensors: vision encoder shards in MLX MXFP4 tensor format.
model.safetensors.index.json: rewritten safetensors index for the quantized weight/scale tensors.
mlx_quantization_manifest.json: conversion manifest with quantized/passthrough tensor counts and tensor-level metadata.
config.json: upstream config with added MLX quantization metadata.
configuration_step3p7.py, modeling_step3p7.py, processing_step3.py, vision_encoder.py: upstream custom Step3.7 code.
tokenizer.json, tokenizer_config.json, special_tokens_map.json, chat_template.jinja: upstream tokenizer and prompt assets.

All 26 safetensors shards are required. The index references every model-*.safetensors shard and both model-vit-*.safetensors shards; none are extra convenience files. A downloader needs the full shard set plus model.safetensors.index.json, config.json, tokenizer files, chat template, and custom Step3.7 Python modules for a complete local model bundle.

mlx_quantization_manifest.json is not required for inference, but it is intentionally included for auditability, debugging, and reproducibility of the MXFP4 conversion.

Tensor Inspection

Until Step3p7 support lands in an MLX runtime, use MLX tensor loading for inspection or custom loader development:

import mlx.core as mx

tensors = mx.load("model-00001.safetensors")
prefix = "model.layers.0.self_attn.q_proj"

print(tensors[prefix + ".weight"].shape, tensors[prefix + ".weight"].dtype)
print(tensors[prefix + ".scales"].shape, tensors[prefix + ".scales"].dtype)
print(prefix + ".biases" in tensors)

Representative local verification for that tensor returned:

Tensor	Shape	Dtype
`model.layers.0.self_attn.q_proj.weight`	`(8192, 512)`	`uint32`
`model.layers.0.self_attn.q_proj.scales`	`(8192, 128)`	`uint8`
`model.layers.0.self_attn.q_proj.biases`	absent	none

Limitations

This is a tensor-format MLX MXFP4 conversion, not a complete native Step3p7 MLX inference implementation.
Current vanilla mlx-lm and mlx-vlm releases need Step3p7 architecture support before this can be used as a normal one-line load/generate model.
MXFP4 is more compact than the 8-bit affine variant, but quality has not been benchmarked after conversion.
Multimodal prompt plumbing depends on future Step3p7 loader/runtime support.
Behavior, benchmark scores, and deployment claims come from the upstream StepFun release; this quantization has not been re-benchmarked.

Credits

Thank you to both sides of this release:


Quantization & release	osmAPI research team and Terv Student Research Team
Foundation model	StepFun, creators of `stepfun-ai/Step-3.7-Flash`

License: Apache-2.0, following the upstream StepFun release.