Ex0bit
/

GLM-4.7-Flash-PRISM

@@ -14,18 +14,18 @@ library_name: transformers
 ---
 [![Parameters](https://img.shields.io/badge/Parameters-30B--A3B_MoE-blue)]()
-[![Architecture](https://img.shields.io/badge/Architecture-GLM--4-green)]()
 [![Context](https://img.shields.io/badge/Context-128K-orange)]()
 # GLM-4.7-Flash-PRISM
-An unrestricted version of ZAI's GLM-4.7-Flash with over-refusal mechanisms removed using PRISM (Projected Refusal Isolation via Subspace Modification).
 <div align="center">
 ### ☕ Support Our Work
-If you find this useful, consider supporting us on Ko-fi!
 [![Ko-fi](https://img.shields.io/badge/Ko--fi-Support%20Us-ff5e5b?logo=ko-fi&logoColor=white)](https://ko-fi.com/ericelbaz)
@@ -47,21 +47,118 @@ If you find this useful, consider supporting us on Ko-fi!
 ## Benchmarks
-| Benchmark | Score |
-|-----------|-------|
-| AIME 2025 | 91.6% |
-| τ²-Bench | 79.5% |
-| SWE-bench Verified | 59.2% |
-| GPQA | 75.2% |
 ## Usage
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained("Ex0bit/GLM-4.7-Flash-PRISM")
-tokenizer = AutoTokenizer.from_pretrained("Ex0bit/GLM-4.7-Flash-PRISM")
 ```
 ## License
-This model is released under the [PRISM Research License](LICENSE.md).

 ---
 [![Parameters](https://img.shields.io/badge/Parameters-30B--A3B_MoE-blue)]()
+[![Architecture](https://img.shields.io/badge/Architecture-GLM--4.7-green)]()
 [![Context](https://img.shields.io/badge/Context-128K-orange)]()
 # GLM-4.7-Flash-PRISM
+An unrestricted version of [ZAI's GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) with over-refusal mechanisms completely removed using our PRISM Pipeline (Projected Refusal Isolation via Subspace Modification).
 <div align="center">
 ### ☕ Support Our Work
+If you find this model useful, consider supporting us on Ko-fi!
 [![Ko-fi](https://img.shields.io/badge/Ko--fi-Support%20Us-ff5e5b?logo=ko-fi&logoColor=white)](https://ko-fi.com/ericelbaz)
 ## Benchmarks
+| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B-Thinking-2507 | GPT-OSS-20B |
+|-----------|---------------|-----------------------------| ------------|
+| AIME 2025 | 91.6 | 85.0 | 91.7 |
+| GPQA | 75.2 | 73.4 | 71.5 |
+| LCB v6 | 64.0 | 66.0 | 61.0 |
+| HLE | 14.4 | 9.8 | 10.9 |
+| SWE-bench Verified | 59.2 | 22.0 | 34.0 |
+| τ²-Bench | 79.5 | 49.0 | 47.7 |
+| BrowseComp | 42.8 | 2.29 | 28.3 |
 ## Usage
+### Transformers
+Install the latest transformers from source:
+```shell
+pip install git+https://github.com/huggingface/transformers.git
+```
+Run inference:
 ```python
+import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
+MODEL_PATH = "Ex0bit/GLM-4.7-Flash-PRISM"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_PATH,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+messages = [{"role": "user", "content": "Hello!"}]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_dict=True,
+    return_tensors="pt",
+).to(model.device)
+generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
+output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:])
+print(output_text)
+```
+### vLLM
+Install vLLM nightly:
+```shell
+pip install -U vllm --pre --index-url https://pypi.org/simple --extra-index-url https://wheels.vllm.ai/nightly
+pip install git+https://github.com/huggingface/transformers.git
+```
+Serve the model:
+```shell
+vllm serve Ex0bit/GLM-4.7-Flash-PRISM \
+     --tensor-parallel-size 4 \
+     --speculative-config.method mtp \
+     --speculative-config.num_speculative_tokens 1 \
+     --tool-call-parser glm47 \
+     --reasoning-parser glm45 \
+     --enable-auto-tool-choice \
+     --served-model-name glm-4.7-flash-prism
 ```
+### SGLang
+Install SGLang:
+```shell
+uv pip install sglang==0.3.2.dev9039+pr-17247.g90c446848 --extra-index-url https://sgl-project.github.io/whl/pr/
+uv pip install git+https://github.com/huggingface/transformers.git@76732b4e7120808ff989edbd16401f61fa6a0afa
+```
+Launch the server:
+```shell
+python3 -m sglang.launch_server \
+  --model-path Ex0bit/GLM-4.7-Flash-PRISM \
+  --tp-size 4 \
+  --tool-call-parser glm47  \
+  --reasoning-parser glm45 \
+  --speculative-algorithm EAGLE \
+  --speculative-num-steps 3 \
+  --speculative-eagle-topk 1 \
+  --speculative-num-draft-tokens 4 \
+  --mem-fraction-static 0.8 \
+  --served-model-name glm-4.7-flash-prism \
+  --host 0.0.0.0 \
+  --port 8000
+```
+> **Note:** For Blackwell GPUs, add `--attention-backend triton --speculative-draft-attention-backend triton` to your SGLang launch command.
+## Recommended Parameters
+| Use Case | Temperature | Top-P | Max New Tokens |
+|----------|-------------|-------|----------------|
+| Default | 1.0 | 0.95 | 131072 |
+| Code (SWE-bench) | 0.7 | 1.0 | 16384 |
+| Agentic Tasks | 0.0 | — | 16384 |
 ## License
+This model is released under the [PRISM Research License](LICENSE.md).
+## Acknowledgments
+Based on [GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) by [Z.AI](https://z.ai). See the [technical report](https://arxiv.org/abs/2508.06471) for more details on the base model.