Step-3.5-Flash-PRISM-PRO-GGUF
AVAILABLE TO PRISM VIP MEMEBERS ONLY: https://ko-fi.com/summary/1adf92d9-84f3-45c7-8136-8ab0ca39090d
GGUF quantization of Step-3.5-Flash-PRISM-PRO for local deployment with llama.cpp and compatible runtimes.
Full Production unrestricted/unchained PRISM-PRO version of StepFun's Step 3.5 Flash intended particularly for full over-refusal and propaganda mechanisms removal with the SOTA PRISM-PRO pipeline.
For Full Custom Production PRISM versions & tensors reach out.
Free PRISM-Lite version available here: https://hf.co/Ex0bit/Step-3.5-Flash-PRISM
Full precision safetensors: https://hf.co/Ex0bit/Step-3.5-Flash-PRISM-PRO
Support Our Work
If you enjoy our work and find it useful, please consider sponsoring or supporting us!
| Option | Description |
|---|---|
| PRISM VIP Membership | Access to all PRISM models |
| Bitcoin | bc1qarq2pyn4psjpcxzp2ghgwaq6y2h4e53q232x8r |
Quantization Details
| File | Quantization | Size |
|---|---|---|
Step-3.5-Flash-PRISM-PRO-IQ4_NL.gguf |
IQ4_NL | ~111.5 GB |
Quick Start (llama.cpp)
./llama-cli -m Step-3.5-Flash-PRISM-PRO-IQ4_NL.gguf --jinja
Model Highlights
- PRISM Ablation — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
- 196B MoE Architecture — 196 billion total parameters with only 11 billion active per token across 288 fine-grained routed experts + 1 shared expert
- Multi-Token Prediction (MTP-3) — Predicts 4 tokens simultaneously, achieving 100-300 tok/s typical throughput (peaking at 350 tok/s)
- 256K Context Window — Cost-efficient long context via 3:1 Sliding Window Attention (SWA) ratio
- Frontier Reasoning & Coding — 97.3 on AIME 2025, 74.4% on SWE-bench Verified, 51.0% on Terminal-Bench 2.0
- Accessible Local Deployment — Runs on high-end consumer hardware (Mac Studio M4 Max, NVIDIA DGX Spark)
Hardware Requirements
| Setup | Details |
|---|---|
| GGUF INT4 (This Repo) | ~120 GB unified memory (Mac Studio M4 Max 128GB, DGX Spark, AMD Ryzen AI Max+ 395) |
Recommended Parameters
| Use Case | Temperature | Top-P | Max New Tokens |
|---|---|---|---|
| Agentic / Reasoning / Coding | 1.0 | 0.95 | 32768 |
| General Chat | 0.6 | 0.95 | 4096 |
License
This model is released under the PRISM Research License.
Acknowledgments
Based on Step 3.5 Flash by StepFun AI. See the technical report and blog post for more details on the base model.
- Downloads last month
- 9
4-bit
Model tree for Ex0bit/Step-3.5-Flash-PRISM-PRO-GGUF
Base model
stepfun-ai/Step-3.5-Flash