README.md · Ex0bit/Step-3.5-Flash-PRISM at main

Step-3.5-Flash-PRISM / README.md

Ex0bit

Update README.md

99a1cbf verified 2 days ago

preview code

raw

history blame contribute delete

5.01 kB

	---
	license: other
	license_name: prism-research
	license_link: LICENSE.md
	language:
	- en
	- zh
	tags:
	- stepfun
	- prism
	- moe
	- reasoning
	- coding
	- agentic
	- abliterated
	pipeline_tag: text-generation
	library_name: transformers
	base_model:
	- stepfun-ai/Step-3.5-Flash
	base_model_relation: finetune
	---

	[![Parameters](https://img.shields.io/badge/Parameters-196B_(11B_Active)-blue)]()
	[![Architecture](https://img.shields.io/badge/Architecture-MoE-green)]()
	[![Context](https://img.shields.io/badge/Context-256K-orange)]()
	[![MTP](https://img.shields.io/badge/MTP--3-350_tok%2Fs_Peak-purple)]()


	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/63adf1fa42fd3b8dbaeb0c92/NkmQvQUXzckiRb8U__203.png" width="400"/>
	</p>

	# Step-3.5-Flash-PRISM

	A "role-play" following unrestricted/unchained PRISM-LITE version of [StepFun's Step 3.5 Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) intended particularly for over-refusal and propaganda mechanisms suppression using our SOTA PRISM pipeline.

	For Full Custom Production PRISM versions & tensors reach out.
	<div align="center">

	### ☕ Support Our Work

	If you enjoy our work and find it useful, please consider sponsoring or supporting us!

	[![Ko-fi](https://img.shields.io/badge/Ko--fi-Support%20Us-ff5e5b?logo=ko-fi&logoColor=white)](https://ko-fi.com/ericelbaz)

	\| Option \| Description \|
	\|--------\|-------------\|
	\| [PRISM VIP Membership](https://ko-fi.com/summary/6bae206c-a751-4868-8dc7-f531afd1fb4c) \| Access to all PRISM models \|
	\| Bitcoin \| `bc1qarq2pyn4psjpcxzp2ghgwaq6y2h4e53q232x8r` \|

	![image](https://cdn-uploads.huggingface.co/production/uploads/63adf1fa42fd3b8dbaeb0c92/Psgbl1TgyDok__C7AMQog.png)

	</div>

	---

	## Model Highlights

	- PRISM Ablation — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
	- 196B MoE Architecture — 196 billion total parameters with only 11 billion active per token across 288 fine-grained routed experts + 1 shared expert
	- Multi-Token Prediction (MTP-3) — Predicts 4 tokens simultaneously, achieving 100–300 tok/s typical throughput (peaking at 350 tok/s)
	- 256K Context Window — Cost-efficient long context via 3:1 Sliding Window Attention (SWA) ratio
	- Frontier Reasoning & Coding — 97.3 on AIME 2025, 74.4% on SWE-bench Verified, 51.0% on Terminal-Bench 2.0
	- Accessible Local Deployment — Runs on high-end consumer hardware (Mac Studio M4 Max, NVIDIA DGX Spark)

	## Model Architecture

	\| Specification \| Value \|
	\|---------------\|-------\|
	\| Architecture \| Sparse Mixture-of-Experts (MoE) \|
	\| Backbone \| 45-layer Transformer (4,096 hidden dim) \|
	\| Total Parameters \| 196.81B (196B Backbone + 0.81B Head) \|
	\| Activated Parameters \| ~11B (per token) \|
	\| Routed Experts per Layer \| 288 \|
	\| Shared Experts \| 1 (always active) \|
	\| Selected Experts per Token \| Top-8 \|
	\| Vocabulary Size \| 128,896 \|
	\| Context Length \| 256K \|
	\| Attention \| Hybrid SWA (3:1 SWA-to-Full ratio) \|
	\| MTP Head \| Sliding-window attention + dense FFN (4 tokens/pass) \|

	## Benchmarks

	\| Benchmark \| Step 3.5 Flash \| DeepSeek V3.2 \| Kimi K2.5 \| GLM-4.7 \| MiniMax M2.1 \|
	\|-----------\|---------------\|---------------\|-----------\|---------\|--------------\|
	\| Agent \| \| \| \| \| \|
	\| τ²-Bench \| 88.2 \| 80.3 \| 85.4 \| 87.4 \| 86.6 \|
	\| BrowseComp \| 51.6 \| 51.4 \| 60.6 \| 52.0 \| 47.4 \|
	\| GAIA (no file) \| 84.5 \| 75.1 \| 75.9 \| 61.9 \| 64.3 \|
	\| xbench-DeepSearch (2025.05) \| 83.7 \| 78.0 \| 76.7 \| 72.0 \| 68.7 \|
	\| Reasoning \| \| \| \| \| \|
	\| AIME 2025 \| 97.3 \| 93.1 \| 96.1 \| 95.7 \| 83.0 \|
	\| HMMT 2025 (Feb.) \| 98.4 \| 92.5 \| 95.4 \| 97.1 \| 71.0 \|
	\| IMOAnswerBench \| 85.4 \| 78.3 \| 81.8 \| 82.0 \| 60.4 \|
	\| Coding \| \| \| \| \| \|
	\| LiveCodeBench-V6 \| 86.4 \| 83.3 \| 85.0 \| 84.9 \| — \|
	\| SWE-bench Verified \| 74.4 \| 73.1 \| 76.8 \| 73.8 \| 74.0 \|
	\| Terminal-Bench 2.0 \| 51.0 \| 46.4 \| 50.8 \| 41.0 \| 47.9 \|


	### llama.cpp (GGUF)

	For local deployment (requires ~120 GB VRAM for int4, smaller quants are available):

	```bash
	./llama-cli -m step3.5_flash_prism_Q4_K_S.gguf --jinja
	```

	## Recommended Parameters

	\| Use Case \| Temperature \| Top-P \| Max New Tokens \|
	\|----------\|-------------\|-------\|----------------\|
	\| Reasoning / Coding \| 1.0 \| 0.95 \| 32768 \|
	\| General Chat \| 0.6 \| 0.95 \| 4096 \|

	## Hardware Requirements

	\| Setup \| Details \|
	\|-------\|---------\|
	\| BF16 (Full) \| 8x H100/A100 80GB with tensor parallelism \|
	\| FP8 Quantized \| 8x A100 80GB with expert parallelism \|
	\| GGUF INT4 (Local) \| ~120 GB unified memory (Mac Studio M4 Max 128GB, DGX Spark, AMD Ryzen AI Max+ 395) \|

	## License

	This model is released under the [PRISM Research License](LICENSE.md).

	## Acknowledgments

	Based on [Step 3.5 Flash](https://huggingface.co/stepfun-ai/Step-3.5-Flash) by [StepFun AI](https://www.stepfun.com). See the [technical report](https://github.com/stepfun-ai/Step-3.5-Flash/blob/main/step_3p5_flash_tech_report.pdf) and [blog post](https://static.stepfun.com/blog/step-3.5-flash/) for more details on the base model.