--- license: apache-2.0 base_model: stepfun-ai/Step-3.7-Flash base_model_relation: quantized tags: - gguf - quantized - apex - moe - step3p7 - vision-language library_name: gguf --- # Step-3.7-Flash — APEX GGUF quants APEX (**A**daptive **P**recision for **EX**pert models) quantizations of [stepfun-ai/Step-3.7-Flash](https://huggingface.co/stepfun-ai/Step-3.7-Flash), a ~198B-parameter vision-language MoE with ~11B active per token (288 routed experts + 1 shared expert, top-8 routing; 3 dense layers + 42 MoE layers; hidden 4096). ## What is APEX? APEX assigns per-tensor precision along a **layer-wise gradient** rather than using one quant type for the whole model: - **Edges high, middle low.** The first and last MoE blocks are far more sensitive to quantization noise than the middle of the stack, so APEX keeps the edges at a higher precision (e.g. Q5_K / Q6_K) and progressively drops the middle toward the profile's base type. - **Dense layers kept at the edge precision.** Step-3.7's three leading dense FFN layers (0–2) are not gated and carry every token, so they are quantized at the same precision as the edge MoE layers rather than being lumped in with the deep experts. - **Shared expert protected.** The shared expert sees every token in every MoE layer, so its weights (`ffn_*_shexp`) are kept at a high precision (Q6_K or Q8_0 depending on profile) across the whole stack, while the 288 routed experts (`ffn_*_exps`) follow the layer-wise gradient. - **Attention kept above experts.** Attention weights (Q/K/V/O) are held above the expert precision, since they are reused at every token regardless of routing. The `I-` variants additionally use an importance matrix (imatrix) computed on a calibration mix — quantization noise is preferentially placed in directions the calibration data shows are least active. ## Available files | File | Base | Profile | Size | Notes | |---|---|---|---:|---| | `Step-3.7-Flash-APEX-I-Quality.gguf` | Q6_K | quality + imatrix | 123 GB | highest fidelity | | `Step-3.7-Flash-APEX-I-Balanced.gguf` | Q5_K | balanced + imatrix | 141 GB | recommended | | `Step-3.7-Flash-APEX-I-Compact.gguf` | Q4_K | compact + imatrix | 90 GB | best size/quality tradeoff | | `Step-3.7-Flash-APEX-I-Mini.gguf` | Q3_K | mini + imatrix | 73 GB | smallest with imatrix | | `Step-3.7-Flash-APEX-Quality.gguf` | Q6_K | quality | 123 GB | no-imatrix | | `Step-3.7-Flash-APEX-Balanced.gguf` | Q5_K | balanced | 141 GB | no-imatrix | | `Step-3.7-Flash-APEX-Compact.gguf` | Q4_K | compact | 90 GB | no-imatrix | | `mmproj-step3.7-flash-f16.gguf` | F16 | vision tower | 4.0 GB | mirrored from StepFun, pair with any of the above for VLM use | | `imatrix.dat` | — | — | 466 MB | importance matrix (BF16-derived) | The included `imatrix.dat` is the same matrix used to produce the `I-` files above. It is published so you can apply it yourself to any other quantization of the same BF16 weights without re-running calibration. The `mmproj` is the vision tower in F16, mirrored as-is from [stepfun-ai/Step-3.7-Flash-GGUF](https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF/blob/main/mmproj-step3.7-flash-f16.gguf). Pair it with any of the language-tower files above (`--mmproj mmproj-step3.7-flash-f16.gguf` in llama.cpp) to run image+text inference. ## How these were built - **Source weights:** [stepfun-ai/Step-3.7-Flash-GGUF (BF16)](https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF/tree/main/BF16) — the upstream BF16 GGUF (9 shards). - **Importance matrix:** computed on the BF16 GGUF via `llama-imatrix` on a calibration mix and published here as `imatrix.dat`. - **Quantization:** per-tensor precision targets emitted by APEX, applied via `llama-quantize --tensor-type-file` (with `--imatrix imatrix.dat` for the `I-` variants). Because the imatrix is keyed by tensor name, it is portable across any GGUF that comes from the same converted BF16 weights — no GPU needed to reproduce these quants from the published BF16 GGUF. ## Architecture notes Step-3.7-Flash's HF config (`Step3p7ForConditionalGeneration`) is mapped to the existing `STEP35` arch at GGUF level: the convert script registers the Step-3.7 architecture and tokenizer hash, while runtime/quantize uses the standard `step35` compute graph. - Layers: 45 (3 dense FFN + 42 MoE) - Hidden: 4096 - Experts: 288 routed (top-8) + 1 shared - Total params: ~198B - Active params per token: ~11B - Modality: vision-language (mmproj included) ## License Inherits the upstream Apache-2.0 license from `stepfun-ai/Step-3.7-Flash`.