Update README.md

1805352 verified 2 days ago

5.11 kB

	---
	license: other
	license_name: prism-research
	license_link: LICENSE.md
	language:
	- en
	- zh
	tags:
	- glm4
	- prism
	- moe
	pipeline_tag: text-generation
	library_name: transformers
	---

	[![Parameters](https://img.shields.io/badge/Parameters-30B--A3B_MoE-blue)]()
	[![Architecture](https://img.shields.io/badge/Architecture-GLM--4.7-green)]()
	[![Context](https://img.shields.io/badge/Context-128K-orange)]()

	# GLM-4.7-Flash-PRISM

	An over-refusal/propaganda free version of [ZAI's GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) with over-refusal and bias mechanisms completely removed using our Advanced PRISM Pipeline.

	<div align="center">

	### ☕ Support Our Work

	If you find this model useful, consider supporting us on Ko-fi!

	[![Ko-fi](https://img.shields.io/badge/Ko--fi-Support%20Us-ff5e5b?logo=ko-fi&logoColor=white)](https://ko-fi.com/ericelbaz)

	\| Option \| Description \|
	\|--------\|-------------\|
	\| [PRISM VIP Membership](https://ko-fi.com/summary/6bae206c-a751-4868-8dc7-f531afd1fb4c) \| Access to all PRISM models \|
	\| [One-Time Support](https://ko-fi.com/s/86882e8991) \| Support this model \|

	</div>

	---

	## Model Highlights

	- PRISM Ablation — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
	- 30B-A3B MoE Architecture — 30 billion total parameters with ~3 billion active per token for fast, efficient inference
	- 128K Context Window — Extended context for complex tasks and large codebases
	- Interleaved Thinking — Multi-turn reasoning that persists across conversations with per-turn thinking control

	## Benchmarks

	\| Benchmark \| GLM-4.7-Flash \| Qwen3-30B-A3B-Thinking-2507 \| GPT-OSS-20B \|
	\|-----------\|---------------\|-----------------------------\| ------------\|
	\| AIME 2025 \| 91.6 \| 85.0 \| 91.7 \|
	\| GPQA \| 75.2 \| 73.4 \| 71.5 \|
	\| LCB v6 \| 64.0 \| 66.0 \| 61.0 \|
	\| HLE \| 14.4 \| 9.8 \| 10.9 \|
	\| SWE-bench Verified \| 59.2 \| 22.0 \| 34.0 \|
	\| τ²-Bench \| 79.5 \| 49.0 \| 47.7 \|
	\| BrowseComp \| 42.8 \| 2.29 \| 28.3 \|

	## Usage

	### Transformers

	Install the latest transformers from source:

	```shell
	pip install git+https://github.com/huggingface/transformers.git
	```

	Run inference:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	MODEL_PATH = "Ex0bit/GLM-4.7-Flash-PRISM"

	tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_PATH,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	messages = [{"role": "user", "content": "Hello!"}]
	inputs = tokenizer.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_dict=True,
	return_tensors="pt",
	).to(model.device)

	generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
	output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:])
	print(output_text)
	```

	### vLLM

	Install vLLM nightly:

	```shell
	pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
	pip install git+https://github.com/huggingface/transformers.git
	```

	Serve the model:

	```shell
	vllm serve Ex0bit/GLM-4.7-Flash-PRISM \
	--tensor-parallel-size 4 \
	--speculative-config.method mtp \
	--speculative-config.num_speculative_tokens 1 \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--enable-auto-tool-choice \
	--served-model-name glm-4.7-flash-prism
	```

	### SGLang

	Install SGLang:

	```shell
	uv pip install sglang==0.3.2.dev9039+pr-17247.g90c446848 --extra-index-url https://sgl-project.github.io/whl/pr/
	uv pip install git+https://github.com/huggingface/transformers.git@76732b4e7120808ff989edbd16401f61fa6a0afa
	```

	Launch the server:

	```shell
	python3 -m sglang.launch_server \
	--model-path Ex0bit/GLM-4.7-Flash-PRISM \
	--tp-size 4 \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--speculative-algorithm EAGLE \
	--speculative-num-steps 3 \
	--speculative-eagle-topk 1 \
	--speculative-num-draft-tokens 4 \
	--mem-fraction-static 0.8 \
	--served-model-name glm-4.7-flash-prism \
	--host 0.0.0.0 \
	--port 8000
	```

	> Note: For Blackwell GPUs, add `--attention-backend triton --speculative-draft-attention-backend triton` to your SGLang launch command.

	## Recommended Parameters

	\| Use Case \| Temperature \| Top-P \| Max New Tokens \|
	\|----------\|-------------\|-------\|----------------\|
	\| Default \| 1.0 \| 0.95 \| 131072 \|
	\| Code (SWE-bench) \| 0.7 \| 1.0 \| 16384 \|
	\| Agentic Tasks \| 0.0 \| — \| 16384 \|

	## License

	This model is released under the [PRISM Research License](LICENSE.md).

	## Citation

	```bibtex
	@misc{elbaz2026glm47flashPrism,
	author = {Elbaz, Eric},
	title = {Elbaz-GLM-4.7-Flash-PRISM: Unchained GLM-4.7-Flash-PRISM Model},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/Ex0bit/Elbaz-GLM-4.7-Flash-PRISM}}
	}
	```

	## Acknowledgments

	Based on [GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) by [Z.AI](https://z.ai). See the [technical report](https://arxiv.org/abs/2508.06471) for more details on the base model.