Configuration Parsing Warning:In config.json: "quantization_config.bits" must be a number

Step-3.7-Flash-OptiQ-3.7bpw-mlx

osmapi/Step-3.7-Flash-OptiQ-3.7bpw-mlx is an Apple-Silicon MLX affine mixed-precision OptiQ tensor-format quantization of stepfun-ai/Step-3.7-Flash.

No fine-tuning, distillation, or retraining was applied. The upstream StepFun checkpoint was downloaded and verified locally. OptiQ stream/Frobenius sensitivity was used to allocate mixed bit widths at a 3.7 BPW target, then eligible text and vision .weight tensors were converted with MLX affine quantization. Tokenizer, chat template, custom Step3.7 Python modules, and non-quantized control tensors are preserved from the source release.

Compatibility Status

This upload is a standard MLX affine safetensors bundle, but it is not yet a drop-in mlx_lm.load(...) or mlx_vlm.load(...) model.

At conversion time, vanilla mlx-lm 0.31.3 and mlx-vlm 0.5.0 did not register model_type: step3p7. This repository is therefore intended for MLX runtime authors, loader implementers, and researchers who want a verified Step-3.7-Flash OptiQ tensor bundle. Native inference will require Step3p7 model-class support in MLX/MLX-LM/MLX-VLM or a compatible custom loader.

Model Details

Property Value
Base model stepfun-ai/Step-3.7-Flash
Architecture Step3p7 sparse MoE vision-language model
Parameters 198B total, about 11B active per token
Context length 256k
Vision encoder 1.8B perception encoder, preserved in 2 vision shards
Local profile MLX-OptiQ-Affine-3.7bpw
Bundle size About 99 GB
Shards 24 text safetensors + 2 vision safetensors
Source license Apache-2.0
Validation Safetensors index validation, config metadata validation, manifest validation, MLX tensor sample loads

Quantization Recipe

Tensor class Codec Bits / handling
Eligible text and vision .weight tensors MLX affine OptiQ-assigned 3, 4, or 8 bits, group size 64
Quantized tensor layout MLX triplet .weight, .scales, .biases
Norms, biases, routing/control tensors, and incompatible tensors passthrough source precision preserved

OptiQ allocation summary:

Metric Value
Target BPW 3.7
Achieved BPW 3.6930459517285583
Allocation method optiq_stream_frobenius
Candidate bits 2, 3, 4, 8
Quantized weights 702
Passthrough tensors 769
Group size 64
3-bit allocations 66
4-bit allocations 62
8-bit allocations 574
Missing allocations 0

The achieved BPW is the OptiQ allocation target over quantized weights. The on-disk bundle also includes MLX affine scale/bias tensors, passthrough tensors, tokenizer/config/custom code, and index metadata.

Files

  • model-00001.safetensors to model-00024.safetensors: text/model shards in MLX affine mixed-precision tensor format.
  • model-vit-00001.safetensors and model-vit-00002.safetensors: vision encoder shards in MLX affine mixed-precision tensor format.
  • model.safetensors.index.json: rewritten safetensors index for quantized triplet tensors.
  • optiq_allocation.json: OptiQ per-layer bit allocation.
  • mlx_quantization_manifest.json: conversion manifest with quantized/passthrough tensor counts and tensor-level metadata.
  • config.json: upstream config with added MLX OptiQ quantization metadata.
  • configuration_step3p7.py, modeling_step3p7.py, processing_step3.py, vision_encoder.py: upstream custom Step3.7 code.
  • tokenizer.json, tokenizer_config.json, special_tokens_map.json, chat_template.jinja: upstream tokenizer and prompt assets.

Tensor Inspection

Until Step3p7 support lands in an MLX runtime, use MLX tensor loading for inspection or custom loader development:

import mlx.core as mx

tensors = mx.load("model-00002.safetensors")
prefix = "model.layers.3.moe.down_proj"

print(tensors[prefix + ".weight"].shape, tensors[prefix + ".weight"].dtype)
print(tensors[prefix + ".scales"].shape, tensors[prefix + ".scales"].dtype)
print(tensors[prefix + ".biases"].shape, tensors[prefix + ".biases"].dtype)

Representative local validation for that 3-bit tensor returned:

Tensor Shape Dtype
model.layers.3.moe.down_proj.weight (288, 4096, 120) uint32
model.layers.3.moe.down_proj.scales (288, 4096, 20) bfloat16
model.layers.3.moe.down_proj.biases (288, 4096, 20) bfloat16

Limitations

  • This is a tensor-format MLX affine mixed-precision conversion, not a complete native Step3p7 MLX inference implementation.
  • Current vanilla mlx-lm and mlx-vlm releases need Step3p7 architecture support before this can be used as a normal one-line load/generate model.
  • The OptiQ allocation has not been benchmarked for downstream quality after conversion.
  • Multimodal prompt plumbing depends on future Step3p7 loader/runtime support.
  • Behavior, benchmark scores, and deployment claims come from the upstream StepFun release; this quantization has not been independently re-benchmarked.

Credits

Thank you to both sides of this release:

Quantization & release osmAPI research team and Terv Student Research Team
Foundation model StepFun, creators of stepfun-ai/Step-3.7-Flash

License: Apache-2.0, following the upstream StepFun release.

Downloads last month
42
Safetensors
Model size
30B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osmapi/Step-3.7-Flash-OptiQ-3.7bpw-mlx

Quantized
(15)
this model