Upload README.md with huggingface_hub

022eca6 verified 12 days ago

5.69 kB

language:
  - en
  - zh
library_name: transformers
license: mit
pipeline_tag: text-generation
tags:
  - heretic
  - uncensored
  - decensored
  - abliterated
  - compressed-tensors
  - fp8
base_model: zai-org/GLM-4.7

GLM-4.7-heretic-FP8

FP8 W8A8 quantized version of trohrbaugh/GLM-4.7-heretic — a decensored zai-org/GLM-4.7, made using Heretic v1.2.0+custom.

Abliteration parameters

Parameter	Value
direction_index	per layer
attn.o_proj.max_weight	1.84
attn.o_proj.max_weight_position	49.16
attn.o_proj.min_weight	1.64
attn.o_proj.min_weight_distance	26.42
mlp.down_proj.max_weight	1.02
mlp.down_proj.max_weight_position	53.46
mlp.down_proj.min_weight	0.97
mlp.down_proj.min_weight_distance	45.98

Abliteration performance

Metric	This model	Original model (zai-org/GLM-4.7)
KL divergence	0.0748	0 (by definition)
Refusals	0/100	99/100

FP8 Quantization

Quantized using llm-compressor (v0.10.1-dev, main branch) to produce a compressed-tensors format checkpoint natively supported by vLLM — no --quantization flag or patches needed.

Quantization recipe

Scheme: FP8 (W8A8) — static per-channel FP8 E4M3 weights with minmax observer, dynamic per-token FP8 E4M3 activations
Calibration: 1024 samples from fineweb-edu-score-2, max sequence length 4096
SmoothQuant: Not used (can interfere with MoE expert routing)
Format: compressed-tensors (auto-detected by vLLM)

Precision map

Component	Precision	Rationale
Routed expert weights (160 experts × 89 MoE layers)	FP8 E4M3	Bulk of model — per-channel static scaling via calibration
Attention projections (q/k/v/o)	FP8 E4M3	GQA with 96Q / 8KV heads, head_dim=128
Shared expert weights	FP8 E4M3	Active every token, well-calibrated
Dense MLP (layers 0–2)	FP8 E4M3	Only 3 dense layers
Attention biases (q/k/v)	BF16	Small tensors, sensitive to precision loss
Router/gate weights	BF16	Routing errors cascade through all downstream computation
MoE e_score_correction_bias	BF16	Critical for expert load balancing
RMSNorm / QK norms	BF16	Negligible size, high sensitivity
Embeddings / LM head	BF16	Standard practice for quantized models
MTP head (layer 92: enorm, hnorm, eh_proj)	BF16	Speculative decoding head, kept full precision

Ignore patterns used

IGNORE_PATTERNS = [
    "re:.*embed_tokens.*",
    "lm_head",
    "re:.*layernorm.*",
    "re:.*q_norm.*",
    "re:.*k_norm.*",
    "model.norm",
    "re:.*self_attn\\.q_proj\\.bias",
    "re:.*self_attn\\.k_proj\\.bias",
    "re:.*self_attn\\.v_proj\\.bias",
    "re:.*mlp\\.gate$",
    "re:.*mlp\\.gate\\.weight",
    "re:.*mlp\\.gate\\.e_score_correction_bias",
    "re:.*\\.enorm",
    "re:.*\\.hnorm",
    "re:.*\\.eh_proj",
    "re:.*shared_head\\.norm",
]

Serving

vLLM (recommended)

vLLM auto-detects the compressed-tensors FP8 format from config.json. No --quantization flag required.

vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

To disable thinking mode (shorter, faster responses):

vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --default-chat-template-kwargs '{"enable_thinking": false}'

Or disable per-request:

{
  "model": "trohrbaugh/GLM-4.7-heretic-fp8",
  "messages": [{"role": "user", "content": "Hello"}],
  "chat_template_kwargs": {"enable_thinking": false}
}

VRAM requirements

Configuration	Approx. VRAM	Example hardware
TP=4	~370 GB	4× H100 80GB, 4× RTX PRO 6000 96GB
TP=8	~370 GB	8× A100 80GB, 8× RTX PRO 6000 96GB

Related models

Variant	Size	Format	Link
BF16 (full precision)	~706 GB	safetensors	trohrbaugh/GLM-4.7-heretic
FP8 W8A8 (this model)	~362 GB	compressed-tensors	trohrbaugh/GLM-4.7-heretic-fp8

Quantization environment

GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition
CUDA: 13.1
torch: 2.11.0+cu130
transformers: 4.57.6
llm-compressor: 0.10.1-dev (main branch)
compressed-tensors: 0.14.0.1

Credits

Z.ai / THUDM for GLM-4.7
P-E-W for the Heretic abliteration engine
vLLM team for llm-compressor

Citation

@misc{5team2025glm45agenticreasoningcoding,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}