File size: 5,114 Bytes
9f8d3d6 63674c8 9f8d3d6 63674c8 9f8d3d6 74c7572 9f8d3d6 63674c8 0ae6e8b 9f8d3d6 1d4595c 63674c8 00e8078 1805352 00e8078 63674c8 9f8d3d6 63674c8 9f8d3d6 0ae6e8b 2a9d28d 63674c8 6175d6e 63674c8 68db643 63674c8 68db643 63674c8 6175d6e 63674c8 37a2bac 63674c8 37a2bac 63674c8 37a2bac 0ae6e8b 67851c1 63674c8 0ae6e8b 63674c8 0ae6e8b 63674c8 2a9d28d 0ae6e8b 63674c8 25d50c6 0ae6e8b 63674c8 25d50c6 0ae6e8b 1346575 415d92e 1346575 0ae6e8b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
---
license: other
license_name: prism-research
license_link: LICENSE.md
language:
- en
- zh
tags:
- glm4
- prism
- moe
pipeline_tag: text-generation
library_name: transformers
---
[]()
[]()
[]()
# GLM-4.7-Flash-PRISM
An over-refusal/propaganda free version of [ZAI's GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) with over-refusal and bias mechanisms completely removed using our Advanced PRISM Pipeline.
<div align="center">
### ☕ Support Our Work
If you find this model useful, consider supporting us on Ko-fi!
[](https://ko-fi.com/ericelbaz)
| Option | Description |
|--------|-------------|
| [**PRISM VIP Membership**](https://ko-fi.com/summary/6bae206c-a751-4868-8dc7-f531afd1fb4c) | Access to all PRISM models |
| [**One-Time Support**](https://ko-fi.com/s/86882e8991) | Support this model |
</div>
---
## Model Highlights
- **PRISM Ablation** — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
- **30B-A3B MoE Architecture** — 30 billion total parameters with ~3 billion active per token for fast, efficient inference
- **128K Context Window** — Extended context for complex tasks and large codebases
- **Interleaved Thinking** — Multi-turn reasoning that persists across conversations with per-turn thinking control
## Benchmarks
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B-Thinking-2507 | GPT-OSS-20B |
|-----------|---------------|-----------------------------| ------------|
| AIME 2025 | 91.6 | 85.0 | 91.7 |
| GPQA | 75.2 | 73.4 | 71.5 |
| LCB v6 | 64.0 | 66.0 | 61.0 |
| HLE | 14.4 | 9.8 | 10.9 |
| SWE-bench Verified | 59.2 | 22.0 | 34.0 |
| τ²-Bench | 79.5 | 49.0 | 47.7 |
| BrowseComp | 42.8 | 2.29 | 28.3 |
## Usage
### Transformers
Install the latest transformers from source:
```shell
pip install git+https://github.com/huggingface/transformers.git
```
Run inference:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "Ex0bit/GLM-4.7-Flash-PRISM"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:])
print(output_text)
```
### vLLM
Install vLLM nightly:
```shell
pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git
```
Serve the model:
```shell
vllm serve Ex0bit/GLM-4.7-Flash-PRISM \
--tensor-parallel-size 4 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-flash-prism
```
### SGLang
Install SGLang:
```shell
uv pip install sglang==0.3.2.dev9039+pr-17247.g90c446848 --extra-index-url https://sgl-project.github.io/whl/pr/
uv pip install git+https://github.com/huggingface/transformers.git@76732b4e7120808ff989edbd16401f61fa6a0afa
```
Launch the server:
```shell
python3 -m sglang.launch_server \
--model-path Ex0bit/GLM-4.7-Flash-PRISM \
--tp-size 4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.8 \
--served-model-name glm-4.7-flash-prism \
--host 0.0.0.0 \
--port 8000
```
> **Note:** For Blackwell GPUs, add `--attention-backend triton --speculative-draft-attention-backend triton` to your SGLang launch command.
## Recommended Parameters
| Use Case | Temperature | Top-P | Max New Tokens |
|----------|-------------|-------|----------------|
| Default | 1.0 | 0.95 | 131072 |
| Code (SWE-bench) | 0.7 | 1.0 | 16384 |
| Agentic Tasks | 0.0 | — | 16384 |
## License
This model is released under the [PRISM Research License](LICENSE.md).
## Citation
```bibtex
@misc{elbaz2026glm47flashPrism,
author = {Elbaz, Eric},
title = {Elbaz-GLM-4.7-Flash-PRISM: Unchained GLM-4.7-Flash-PRISM Model},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Ex0bit/Elbaz-GLM-4.7-Flash-PRISM}}
}
```
## Acknowledgments
Based on [GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) by [Z.AI](https://z.ai). See the [technical report](https://arxiv.org/abs/2508.06471) for more details on the base model.
|