You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

GLM-4.7-heretic-FP8

FP8 W8A8 quantized version of trohrbaugh/GLM-4.7-heretic — a decensored zai-org/GLM-4.7, made using Heretic v1.2.0+custom.

Abliteration parameters

Parameter Value
direction_index per layer
attn.o_proj.max_weight 1.84
attn.o_proj.max_weight_position 49.16
attn.o_proj.min_weight 1.64
attn.o_proj.min_weight_distance 26.42
mlp.down_proj.max_weight 1.02
mlp.down_proj.max_weight_position 53.46
mlp.down_proj.min_weight 0.97
mlp.down_proj.min_weight_distance 45.98

Abliteration performance

Metric This model Original model (zai-org/GLM-4.7)
KL divergence 0.0748 0 (by definition)
Refusals 0/100 99/100

FP8 Quantization

Quantized using llm-compressor (v0.10.1-dev, main branch) to produce a compressed-tensors format checkpoint natively supported by vLLM — no --quantization flag or patches needed.

Quantization recipe

  • Scheme: FP8 (W8A8) — static per-channel FP8 E4M3 weights with minmax observer, dynamic per-token FP8 E4M3 activations
  • Calibration: 1024 samples from fineweb-edu-score-2, max sequence length 4096
  • SmoothQuant: Not used (can interfere with MoE expert routing)
  • Format: compressed-tensors (auto-detected by vLLM)

Precision map

Component Precision Rationale
Routed expert weights (160 experts × 89 MoE layers) FP8 E4M3 Bulk of model — per-channel static scaling via calibration
Attention projections (q/k/v/o) FP8 E4M3 GQA with 96Q / 8KV heads, head_dim=128
Shared expert weights FP8 E4M3 Active every token, well-calibrated
Dense MLP (layers 0–2) FP8 E4M3 Only 3 dense layers
Attention biases (q/k/v) BF16 Small tensors, sensitive to precision loss
Router/gate weights BF16 Routing errors cascade through all downstream computation
MoE e_score_correction_bias BF16 Critical for expert load balancing
RMSNorm / QK norms BF16 Negligible size, high sensitivity
Embeddings / LM head BF16 Standard practice for quantized models
MTP head (layer 92: enorm, hnorm, eh_proj) BF16 Speculative decoding head, kept full precision

Ignore patterns used

IGNORE_PATTERNS = [
    "re:.*embed_tokens.*",
    "lm_head",
    "re:.*layernorm.*",
    "re:.*q_norm.*",
    "re:.*k_norm.*",
    "model.norm",
    "re:.*self_attn\\.q_proj\\.bias",
    "re:.*self_attn\\.k_proj\\.bias",
    "re:.*self_attn\\.v_proj\\.bias",
    "re:.*mlp\\.gate$",
    "re:.*mlp\\.gate\\.weight",
    "re:.*mlp\\.gate\\.e_score_correction_bias",
    "re:.*\\.enorm",
    "re:.*\\.hnorm",
    "re:.*\\.eh_proj",
    "re:.*shared_head\\.norm",
]

Serving

vLLM (recommended)

vLLM auto-detects the compressed-tensors FP8 format from config.json. No --quantization flag required.

vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

To disable thinking mode (shorter, faster responses):

vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --default-chat-template-kwargs '{"enable_thinking": false}'

Or disable per-request:

{
  "model": "trohrbaugh/GLM-4.7-heretic-fp8",
  "messages": [{"role": "user", "content": "Hello"}],
  "chat_template_kwargs": {"enable_thinking": false}
}

VRAM requirements

Configuration Approx. VRAM Example hardware
TP=4 ~370 GB 4× H100 80GB, 4× RTX PRO 6000 96GB
TP=8 ~370 GB 8× A100 80GB, 8× RTX PRO 6000 96GB

Related models

Variant Size Format Link
BF16 (full precision) ~706 GB safetensors trohrbaugh/GLM-4.7-heretic
FP8 W8A8 (this model) ~362 GB compressed-tensors trohrbaugh/GLM-4.7-heretic-fp8

Quantization environment

  • GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition
  • CUDA: 13.1
  • torch: 2.11.0+cu130
  • transformers: 4.57.6
  • llm-compressor: 0.10.1-dev (main branch)
  • compressed-tensors: 0.14.0.1

Credits

Citation

@misc{5team2025glm45agenticreasoningcoding,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}
Downloads last month
76
Safetensors
Model size
353B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RadicalNotionAI/GLM-4.7-heretic-fp8

Base model

zai-org/GLM-4.7
Quantized
(46)
this model

Collection including RadicalNotionAI/GLM-4.7-heretic-fp8

Paper for RadicalNotionAI/GLM-4.7-heretic-fp8