| --- |
| language: |
| - en |
| - zh |
| library_name: transformers |
| license: mit |
| pipeline_tag: text-generation |
| tags: |
| - heretic |
| - uncensored |
| - decensored |
| - abliterated |
| - compressed-tensors |
| - fp8 |
| base_model: zai-org/GLM-4.7 |
| --- |
| # GLM-4.7-heretic-FP8 |
|
|
| FP8 W8A8 quantized version of [trohrbaugh/GLM-4.7-heretic](https://huggingface.co/trohrbaugh/GLM-4.7-heretic) — a decensored [zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7), made using [Heretic](https://github.com/p-e-w/heretic) v1.2.0+custom. |
|
|
| ## Abliteration parameters |
|
|
| | Parameter | Value | |
| | :-------- | :---: | |
| | **direction_index** | per layer | |
| | **attn.o_proj.max_weight** | 1.84 | |
| | **attn.o_proj.max_weight_position** | 49.16 | |
| | **attn.o_proj.min_weight** | 1.64 | |
| | **attn.o_proj.min_weight_distance** | 26.42 | |
| | **mlp.down_proj.max_weight** | 1.02 | |
| | **mlp.down_proj.max_weight_position** | 53.46 | |
| | **mlp.down_proj.min_weight** | 0.97 | |
| | **mlp.down_proj.min_weight_distance** | 45.98 | |
| |
| ## Abliteration performance |
| |
| | Metric | This model | Original model ([zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7)) | |
| | :----- | :--------: | :---------------------------: | |
| | **KL divergence** | 0.0748 | 0 *(by definition)* | |
| | **Refusals** | 0/100 | 99/100 | |
| |
| ## FP8 Quantization |
| |
| Quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor) (v0.10.1-dev, main branch) to produce a **compressed-tensors** format checkpoint natively supported by vLLM — no `--quantization` flag or patches needed. |
| |
| ### Quantization recipe |
| |
| - **Scheme:** FP8 (W8A8) — static per-channel FP8 E4M3 weights with minmax observer, dynamic per-token FP8 E4M3 activations |
| - **Calibration:** 1024 samples from [fineweb-edu-score-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2), max sequence length 4096 |
| - **SmoothQuant:** Not used (can interfere with MoE expert routing) |
| - **Format:** `compressed-tensors` (auto-detected by vLLM) |
| |
| ### Precision map |
| |
| | Component | Precision | Rationale | |
| | :-------- | :-------: | :-------- | |
| | Routed expert weights (160 experts × 89 MoE layers) | FP8 E4M3 | Bulk of model — per-channel static scaling via calibration | |
| | Attention projections (q/k/v/o) | FP8 E4M3 | GQA with 96Q / 8KV heads, head_dim=128 | |
| | Shared expert weights | FP8 E4M3 | Active every token, well-calibrated | |
| | Dense MLP (layers 0–2) | FP8 E4M3 | Only 3 dense layers | |
| | Attention biases (q/k/v) | BF16 | Small tensors, sensitive to precision loss | |
| | Router/gate weights | BF16 | Routing errors cascade through all downstream computation | |
| | MoE e_score_correction_bias | BF16 | Critical for expert load balancing | |
| | RMSNorm / QK norms | BF16 | Negligible size, high sensitivity | |
| | Embeddings / LM head | BF16 | Standard practice for quantized models | |
| | MTP head (layer 92: enorm, hnorm, eh_proj) | BF16 | Speculative decoding head, kept full precision | |
| |
| ### Ignore patterns used |
| |
| ```python |
| IGNORE_PATTERNS = [ |
| "re:.*embed_tokens.*", |
| "lm_head", |
| "re:.*layernorm.*", |
| "re:.*q_norm.*", |
| "re:.*k_norm.*", |
| "model.norm", |
| "re:.*self_attn\\.q_proj\\.bias", |
| "re:.*self_attn\\.k_proj\\.bias", |
| "re:.*self_attn\\.v_proj\\.bias", |
| "re:.*mlp\\.gate$", |
| "re:.*mlp\\.gate\\.weight", |
| "re:.*mlp\\.gate\\.e_score_correction_bias", |
| "re:.*\\.enorm", |
| "re:.*\\.hnorm", |
| "re:.*\\.eh_proj", |
| "re:.*shared_head\\.norm", |
| ] |
| ``` |
| |
| ## Serving |
| |
| ### vLLM (recommended) |
| |
| vLLM auto-detects the compressed-tensors FP8 format from config.json. No `--quantization` flag required. |
| |
| ```shell |
| vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \ |
| --tensor-parallel-size 4 \ |
| --max-model-len 131072 \ |
| --tool-call-parser glm47 \ |
| --reasoning-parser glm45 \ |
| --enable-auto-tool-choice |
| ``` |
| |
| To disable thinking mode (shorter, faster responses): |
| |
| ```shell |
| vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \ |
| --tensor-parallel-size 4 \ |
| --max-model-len 131072 \ |
| --tool-call-parser glm47 \ |
| --reasoning-parser glm45 \ |
| --enable-auto-tool-choice \ |
| --default-chat-template-kwargs '{"enable_thinking": false}' |
| ``` |
| |
| Or disable per-request: |
| |
| ```json |
| { |
| "model": "trohrbaugh/GLM-4.7-heretic-fp8", |
| "messages": [{"role": "user", "content": "Hello"}], |
| "chat_template_kwargs": {"enable_thinking": false} |
| } |
| ``` |
| |
| ### VRAM requirements |
| |
| | Configuration | Approx. VRAM | Example hardware | |
| | :------------ | :----------: | :--------------- | |
| | TP=4 | ~370 GB | 4× H100 80GB, 4× RTX PRO 6000 96GB | |
| | TP=8 | ~370 GB | 8× A100 80GB, 8× RTX PRO 6000 96GB | |
| |
| ## Related models |
| |
| | Variant | Size | Format | Link | |
| | :------ | :--: | :----- | :--- | |
| | BF16 (full precision) | ~706 GB | safetensors | [trohrbaugh/GLM-4.7-heretic](https://huggingface.co/trohrbaugh/GLM-4.7-heretic) | |
| | FP8 W8A8 (this model) | ~362 GB | compressed-tensors | [trohrbaugh/GLM-4.7-heretic-fp8](https://huggingface.co/trohrbaugh/GLM-4.7-heretic-fp8) | |
| |
| ## Quantization environment |
| |
| - **GPU:** 8× NVIDIA RTX PRO 6000 Blackwell Server Edition |
| - **CUDA:** 13.1 |
| - **torch:** 2.11.0+cu130 |
| - **transformers:** 4.57.6 |
| - **llm-compressor:** 0.10.1-dev (main branch) |
| - **compressed-tensors:** 0.14.0.1 |
|
|
| ## Credits |
|
|
| - [Z.ai / THUDM](https://huggingface.co/zai-org) for GLM-4.7 |
| - [P-E-W](https://github.com/p-e-w/heretic) for the Heretic abliteration engine |
| - [vLLM team](https://github.com/vllm-project/llm-compressor) for llm-compressor |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{5team2025glm45agenticreasoningcoding, |
| title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, |
| author={GLM Team and Aohan Zeng and Xin Lv and others}, |
| year={2025}, |
| eprint={2508.06471}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2508.06471}, |
| } |
| ``` |
|
|