GLM-4.7-Flash-PRISM / README.md

Ex0bit

Update README.md

1805352 verified about 23 hours ago

preview code

raw

history blame

5.11 kB

metadata

license: other
license_name: prism-research
license_link: LICENSE.md
language:
  - en
  - zh
tags:
  - glm4
  - prism
  - moe
pipeline_tag: text-generation
library_name: transformers

GLM-4.7-Flash-PRISM

An over-refusal/propaganda free version of ZAI's GLM-4.7-Flash with over-refusal and bias mechanisms completely removed using our Advanced PRISM Pipeline.

☕ Support Our Work

If you find this model useful, consider supporting us on Ko-fi!

Option	Description
PRISM VIP Membership	Access to all PRISM models
One-Time Support	Support this model

Model Highlights

PRISM Ablation — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities
30B-A3B MoE Architecture — 30 billion total parameters with ~3 billion active per token for fast, efficient inference
128K Context Window — Extended context for complex tasks and large codebases
Interleaved Thinking — Multi-turn reasoning that persists across conversations with per-turn thinking control

Benchmarks

Benchmark	GLM-4.7-Flash	Qwen3-30B-A3B-Thinking-2507	GPT-OSS-20B
AIME 2025	91.6	85.0	91.7
GPQA	75.2	73.4	71.5
LCB v6	64.0	66.0	61.0
HLE	14.4	9.8	10.9
SWE-bench Verified	59.2	22.0	34.0
τ²-Bench	79.5	49.0	47.7
BrowseComp	42.8	2.29	28.3

Usage

Transformers

Install the latest transformers from source:

pip install git+https://github.com/huggingface/transformers.git

Run inference:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "Ex0bit/GLM-4.7-Flash-PRISM"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [{"role": "user", "content": "Hello!"}]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:])
print(output_text)

vLLM

Install vLLM nightly:

pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
pip install git+https://github.com/huggingface/transformers.git

Serve the model:

vllm serve Ex0bit/GLM-4.7-Flash-PRISM \
     --tensor-parallel-size 4 \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 1 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm-4.7-flash-prism

SGLang

Install SGLang:

uv pip install sglang==0.3.2.dev9039+pr-17247.g90c446848 --extra-index-url https://sgl-project.github.io/whl/pr/
uv pip install git+https://github.com/huggingface/transformers.git@76732b4e7120808ff989edbd16401f61fa6a0afa

Launch the server:

python3 -m sglang.launch_server \
  --model-path Ex0bit/GLM-4.7-Flash-PRISM \
  --tp-size 4 \
  --tool-call-parser glm47  \
  --reasoning-parser glm45 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.8 \
  --served-model-name glm-4.7-flash-prism \
  --host 0.0.0.0 \
  --port 8000

Note: For Blackwell GPUs, add --attention-backend triton --speculative-draft-attention-backend triton to your SGLang launch command.

Recommended Parameters

Use Case	Temperature	Top-P	Max New Tokens
Default	1.0	0.95	131072
Code (SWE-bench)	0.7	1.0	16384
Agentic Tasks	0.0	—	16384

License

This model is released under the PRISM Research License.

Citation

@misc{elbaz2026glm47flashPrism,
  author = {Elbaz, Eric},
  title = {Elbaz-GLM-4.7-Flash-PRISM: Unchained GLM-4.7-Flash-PRISM Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Ex0bit/Elbaz-GLM-4.7-Flash-PRISM}}
}

Acknowledgments

Based on GLM-4.7-Flash by Z.AI. See the technical report for more details on the base model.