--- license: other license_name: prism-research license_link: LICENSE.md language: - en - zh tags: - glm4 - prism - moe pipeline_tag: text-generation library_name: transformers --- [![Parameters](https://img.shields.io/badge/Parameters-30B--A3B_MoE-blue)]() [![Architecture](https://img.shields.io/badge/Architecture-GLM--4.7-green)]() [![Context](https://img.shields.io/badge/Context-128K-orange)]() # GLM-4.7-Flash-PRISM An over-refusal/propaganda free version of [ZAI's GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) with over-refusal and bias mechanisms completely removed using our Advanced PRISM Pipeline.
### ☕ Support Our Work If you find this model useful, consider supporting us on Ko-fi! [![Ko-fi](https://img.shields.io/badge/Ko--fi-Support%20Us-ff5e5b?logo=ko-fi&logoColor=white)](https://ko-fi.com/ericelbaz) | Option | Description | |--------|-------------| | [**PRISM VIP Membership**](https://ko-fi.com/summary/6bae206c-a751-4868-8dc7-f531afd1fb4c) | Access to all PRISM models | | [**One-Time Support**](https://ko-fi.com/s/86882e8991) | Support this model |
--- ## Model Highlights - **PRISM Ablation** — State-of-the-art technique that removes over-refusal behaviors while preserving model capabilities - **30B-A3B MoE Architecture** — 30 billion total parameters with ~3 billion active per token for fast, efficient inference - **128K Context Window** — Extended context for complex tasks and large codebases - **Interleaved Thinking** — Multi-turn reasoning that persists across conversations with per-turn thinking control ## Benchmarks | Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B-Thinking-2507 | GPT-OSS-20B | |-----------|---------------|-----------------------------| ------------| | AIME 2025 | 91.6 | 85.0 | 91.7 | | GPQA | 75.2 | 73.4 | 71.5 | | LCB v6 | 64.0 | 66.0 | 61.0 | | HLE | 14.4 | 9.8 | 10.9 | | SWE-bench Verified | 59.2 | 22.0 | 34.0 | | τ²-Bench | 79.5 | 49.0 | 47.7 | | BrowseComp | 42.8 | 2.29 | 28.3 | ## Usage ### Transformers Install the latest transformers from source: ```shell pip install git+https://github.com/huggingface/transformers.git ``` Run inference: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_PATH = "Ex0bit/GLM-4.7-Flash-PRISM" tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [{"role": "user", "content": "Hello!"}] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", ).to(model.device) generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False) output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:]) print(output_text) ``` ### vLLM Install vLLM nightly: ```shell pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly pip install git+https://github.com/huggingface/transformers.git ``` Serve the model: ```shell vllm serve Ex0bit/GLM-4.7-Flash-PRISM \ --tensor-parallel-size 4 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-flash-prism ``` ### SGLang Install SGLang: ```shell uv pip install sglang==0.3.2.dev9039+pr-17247.g90c446848 --extra-index-url https://sgl-project.github.io/whl/pr/ uv pip install git+https://github.com/huggingface/transformers.git@76732b4e7120808ff989edbd16401f61fa6a0afa ``` Launch the server: ```shell python3 -m sglang.launch_server \ --model-path Ex0bit/GLM-4.7-Flash-PRISM \ --tp-size 4 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --mem-fraction-static 0.8 \ --served-model-name glm-4.7-flash-prism \ --host 0.0.0.0 \ --port 8000 ``` > **Note:** For Blackwell GPUs, add `--attention-backend triton --speculative-draft-attention-backend triton` to your SGLang launch command. ## Recommended Parameters | Use Case | Temperature | Top-P | Max New Tokens | |----------|-------------|-------|----------------| | Default | 1.0 | 0.95 | 131072 | | Code (SWE-bench) | 0.7 | 1.0 | 16384 | | Agentic Tasks | 0.0 | — | 16384 | ## License This model is released under the [PRISM Research License](LICENSE.md). ## Citation ```bibtex @misc{elbaz2026glm47flashPrism, author = {Elbaz, Eric}, title = {Elbaz-GLM-4.7-Flash-PRISM: Unchained GLM-4.7-Flash-PRISM Model}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Ex0bit/Elbaz-GLM-4.7-Flash-PRISM}} } ``` ## Acknowledgments Based on [GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) by [Z.AI](https://z.ai). See the [technical report](https://arxiv.org/abs/2508.06471) for more details on the base model.