GLM-4.7-heretic-fp8 / README.md
trohrbaugh's picture
Upload README.md with huggingface_hub
022eca6 verified
metadata
language:
  - en
  - zh
library_name: transformers
license: mit
pipeline_tag: text-generation
tags:
  - heretic
  - uncensored
  - decensored
  - abliterated
  - compressed-tensors
  - fp8
base_model: zai-org/GLM-4.7

GLM-4.7-heretic-FP8

FP8 W8A8 quantized version of trohrbaugh/GLM-4.7-heretic — a decensored zai-org/GLM-4.7, made using Heretic v1.2.0+custom.

Abliteration parameters

Parameter Value
direction_index per layer
attn.o_proj.max_weight 1.84
attn.o_proj.max_weight_position 49.16
attn.o_proj.min_weight 1.64
attn.o_proj.min_weight_distance 26.42
mlp.down_proj.max_weight 1.02
mlp.down_proj.max_weight_position 53.46
mlp.down_proj.min_weight 0.97
mlp.down_proj.min_weight_distance 45.98

Abliteration performance

Metric This model Original model (zai-org/GLM-4.7)
KL divergence 0.0748 0 (by definition)
Refusals 0/100 99/100

FP8 Quantization

Quantized using llm-compressor (v0.10.1-dev, main branch) to produce a compressed-tensors format checkpoint natively supported by vLLM — no --quantization flag or patches needed.

Quantization recipe

  • Scheme: FP8 (W8A8) — static per-channel FP8 E4M3 weights with minmax observer, dynamic per-token FP8 E4M3 activations
  • Calibration: 1024 samples from fineweb-edu-score-2, max sequence length 4096
  • SmoothQuant: Not used (can interfere with MoE expert routing)
  • Format: compressed-tensors (auto-detected by vLLM)

Precision map

Component Precision Rationale
Routed expert weights (160 experts × 89 MoE layers) FP8 E4M3 Bulk of model — per-channel static scaling via calibration
Attention projections (q/k/v/o) FP8 E4M3 GQA with 96Q / 8KV heads, head_dim=128
Shared expert weights FP8 E4M3 Active every token, well-calibrated
Dense MLP (layers 0–2) FP8 E4M3 Only 3 dense layers
Attention biases (q/k/v) BF16 Small tensors, sensitive to precision loss
Router/gate weights BF16 Routing errors cascade through all downstream computation
MoE e_score_correction_bias BF16 Critical for expert load balancing
RMSNorm / QK norms BF16 Negligible size, high sensitivity
Embeddings / LM head BF16 Standard practice for quantized models
MTP head (layer 92: enorm, hnorm, eh_proj) BF16 Speculative decoding head, kept full precision

Ignore patterns used

IGNORE_PATTERNS = [
    "re:.*embed_tokens.*",
    "lm_head",
    "re:.*layernorm.*",
    "re:.*q_norm.*",
    "re:.*k_norm.*",
    "model.norm",
    "re:.*self_attn\\.q_proj\\.bias",
    "re:.*self_attn\\.k_proj\\.bias",
    "re:.*self_attn\\.v_proj\\.bias",
    "re:.*mlp\\.gate$",
    "re:.*mlp\\.gate\\.weight",
    "re:.*mlp\\.gate\\.e_score_correction_bias",
    "re:.*\\.enorm",
    "re:.*\\.hnorm",
    "re:.*\\.eh_proj",
    "re:.*shared_head\\.norm",
]

Serving

vLLM (recommended)

vLLM auto-detects the compressed-tensors FP8 format from config.json. No --quantization flag required.

vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

To disable thinking mode (shorter, faster responses):

vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --default-chat-template-kwargs '{"enable_thinking": false}'

Or disable per-request:

{
  "model": "trohrbaugh/GLM-4.7-heretic-fp8",
  "messages": [{"role": "user", "content": "Hello"}],
  "chat_template_kwargs": {"enable_thinking": false}
}

VRAM requirements

Configuration Approx. VRAM Example hardware
TP=4 ~370 GB 4× H100 80GB, 4× RTX PRO 6000 96GB
TP=8 ~370 GB 8× A100 80GB, 8× RTX PRO 6000 96GB

Related models

Variant Size Format Link
BF16 (full precision) ~706 GB safetensors trohrbaugh/GLM-4.7-heretic
FP8 W8A8 (this model) ~362 GB compressed-tensors trohrbaugh/GLM-4.7-heretic-fp8

Quantization environment

  • GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition
  • CUDA: 13.1
  • torch: 2.11.0+cu130
  • transformers: 4.57.6
  • llm-compressor: 0.10.1-dev (main branch)
  • compressed-tensors: 0.14.0.1

Credits

Citation

@misc{5team2025glm45agenticreasoningcoding,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}