metadata
license: other
license_name: prism-research
license_link: LICENSE.md
language:
- en
- zh
tags:
- stepfun
- prism
- moe
- reasoning
- coding
- agentic
- abliterated
pipeline_tag: text-generation
library_name: transformers
base_model:
- stepfun-ai/Step-3.5-Flash
base_model_relation: finetune
Step-3.5-Flash-PRISM
A "role-play" following unrestricted/unchained PRISM-LITE version of StepFun's Step 3.5 Flash intended particularly for over-refusal and propaganda mechanisms suppression using our SOTA PRISM pipeline.
For Full Custom Production PRISM versions & tensors reach out.
☕ Support Our Work
If you enjoy our work and find it useful, please consider sponsoring or supporting us!
| Option | Description |
|---|---|
| PRISM VIP Membership | Access to all PRISM models |
| Bitcoin | bc1qarq2pyn4psjpcxzp2ghgwaq6y2h4e53q232x8r |
Model Highlights
- PRISM Ablation — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
- 196B MoE Architecture — 196 billion total parameters with only 11 billion active per token across 288 fine-grained routed experts + 1 shared expert
- Multi-Token Prediction (MTP-3) — Predicts 4 tokens simultaneously, achieving 100–300 tok/s typical throughput (peaking at 350 tok/s)
- 256K Context Window — Cost-efficient long context via 3:1 Sliding Window Attention (SWA) ratio
- Frontier Reasoning & Coding — 97.3 on AIME 2025, 74.4% on SWE-bench Verified, 51.0% on Terminal-Bench 2.0
- Accessible Local Deployment — Runs on high-end consumer hardware (Mac Studio M4 Max, NVIDIA DGX Spark)
Model Architecture
| Specification | Value |
|---|---|
| Architecture | Sparse Mixture-of-Experts (MoE) |
| Backbone | 45-layer Transformer (4,096 hidden dim) |
| Total Parameters | 196.81B (196B Backbone + 0.81B Head) |
| Activated Parameters | ~11B (per token) |
| Routed Experts per Layer | 288 |
| Shared Experts | 1 (always active) |
| Selected Experts per Token | Top-8 |
| Vocabulary Size | 128,896 |
| Context Length | 256K |
| Attention | Hybrid SWA (3:1 SWA-to-Full ratio) |
| MTP Head | Sliding-window attention + dense FFN (4 tokens/pass) |
Benchmarks
| Benchmark | Step 3.5 Flash | DeepSeek V3.2 | Kimi K2.5 | GLM-4.7 | MiniMax M2.1 |
|---|---|---|---|---|---|
| Agent | |||||
| τ²-Bench | 88.2 | 80.3 | 85.4 | 87.4 | 86.6 |
| BrowseComp | 51.6 | 51.4 | 60.6 | 52.0 | 47.4 |
| GAIA (no file) | 84.5 | 75.1 | 75.9 | 61.9 | 64.3 |
| xbench-DeepSearch (2025.05) | 83.7 | 78.0 | 76.7 | 72.0 | 68.7 |
| Reasoning | |||||
| AIME 2025 | 97.3 | 93.1 | 96.1 | 95.7 | 83.0 |
| HMMT 2025 (Feb.) | 98.4 | 92.5 | 95.4 | 97.1 | 71.0 |
| IMOAnswerBench | 85.4 | 78.3 | 81.8 | 82.0 | 60.4 |
| Coding | |||||
| LiveCodeBench-V6 | 86.4 | 83.3 | 85.0 | 84.9 | — |
| SWE-bench Verified | 74.4 | 73.1 | 76.8 | 73.8 | 74.0 |
| Terminal-Bench 2.0 | 51.0 | 46.4 | 50.8 | 41.0 | 47.9 |
llama.cpp (GGUF)
For local deployment (requires ~120 GB VRAM for int4, smaller quants are available):
./llama-cli -m step3.5_flash_prism_Q4_K_S.gguf --jinja
Recommended Parameters
| Use Case | Temperature | Top-P | Max New Tokens |
|---|---|---|---|
| Reasoning / Coding | 1.0 | 0.95 | 32768 |
| General Chat | 0.6 | 0.95 | 4096 |
Hardware Requirements
| Setup | Details |
|---|---|
| BF16 (Full) | 8x H100/A100 80GB with tensor parallelism |
| FP8 Quantized | 8x A100 80GB with expert parallelism |
| GGUF INT4 (Local) | ~120 GB unified memory (Mac Studio M4 Max 128GB, DGX Spark, AMD Ryzen AI Max+ 395) |
License
This model is released under the PRISM Research License.
Acknowledgments
Based on Step 3.5 Flash by StepFun AI. See the technical report and blog post for more details on the base model.
