Upload README.md with huggingface_hub

022eca6 verified 12 days ago

5.69 kB

	---
	language:
	- en
	- zh
	library_name: transformers
	license: mit
	pipeline_tag: text-generation
	tags:
	- heretic
	- uncensored
	- decensored
	- abliterated
	- compressed-tensors
	- fp8
	base_model: zai-org/GLM-4.7
	---
	# GLM-4.7-heretic-FP8

	FP8 W8A8 quantized version of [trohrbaugh/GLM-4.7-heretic](https://huggingface.co/trohrbaugh/GLM-4.7-heretic) — a decensored [zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7), made using [Heretic](https://github.com/p-e-w/heretic) v1.2.0+custom.

	## Abliteration parameters

	\| Parameter \| Value \|
	\| :-------- \| :---: \|
	\| direction_index \| per layer \|
	\| attn.o_proj.max_weight \| 1.84 \|
	\| attn.o_proj.max_weight_position \| 49.16 \|
	\| attn.o_proj.min_weight \| 1.64 \|
	\| attn.o_proj.min_weight_distance \| 26.42 \|
	\| mlp.down_proj.max_weight \| 1.02 \|
	\| mlp.down_proj.max_weight_position \| 53.46 \|
	\| mlp.down_proj.min_weight \| 0.97 \|
	\| mlp.down_proj.min_weight_distance \| 45.98 \|

	## Abliteration performance

	\| Metric \| This model \| Original model ([zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7)) \|
	\| :----- \| :--------: \| :---------------------------: \|
	\| KL divergence \| 0.0748 \| 0 (by definition) \|
	\| Refusals \| 0/100 \| 99/100 \|

	## FP8 Quantization

	Quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor) (v0.10.1-dev, main branch) to produce a compressed-tensors format checkpoint natively supported by vLLM — no `--quantization` flag or patches needed.

	### Quantization recipe

	- Scheme: FP8 (W8A8) — static per-channel FP8 E4M3 weights with minmax observer, dynamic per-token FP8 E4M3 activations
	- Calibration: 1024 samples from [fineweb-edu-score-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2), max sequence length 4096
	- SmoothQuant: Not used (can interfere with MoE expert routing)
	- Format: `compressed-tensors` (auto-detected by vLLM)

	### Precision map

	\| Component \| Precision \| Rationale \|
	\| :-------- \| :-------: \| :-------- \|
	\| Routed expert weights (160 experts × 89 MoE layers) \| FP8 E4M3 \| Bulk of model — per-channel static scaling via calibration \|
	\| Attention projections (q/k/v/o) \| FP8 E4M3 \| GQA with 96Q / 8KV heads, head_dim=128 \|
	\| Shared expert weights \| FP8 E4M3 \| Active every token, well-calibrated \|
	\| Dense MLP (layers 0–2) \| FP8 E4M3 \| Only 3 dense layers \|
	\| Attention biases (q/k/v) \| BF16 \| Small tensors, sensitive to precision loss \|
	\| Router/gate weights \| BF16 \| Routing errors cascade through all downstream computation \|
	\| MoE e_score_correction_bias \| BF16 \| Critical for expert load balancing \|
	\| RMSNorm / QK norms \| BF16 \| Negligible size, high sensitivity \|
	\| Embeddings / LM head \| BF16 \| Standard practice for quantized models \|
	\| MTP head (layer 92: enorm, hnorm, eh_proj) \| BF16 \| Speculative decoding head, kept full precision \|

	### Ignore patterns used

	```python
	IGNORE_PATTERNS = [
	"re:.embed_tokens.",
	"lm_head",
	"re:.layernorm.",
	"re:.q_norm.",
	"re:.k_norm.",
	"model.norm",
	"re:.*self_attn\\.q_proj\\.bias",
	"re:.*self_attn\\.k_proj\\.bias",
	"re:.*self_attn\\.v_proj\\.bias",
	"re:.*mlp\\.gate$",
	"re:.*mlp\\.gate\\.weight",
	"re:.*mlp\\.gate\\.e_score_correction_bias",
	"re:.*\\.enorm",
	"re:.*\\.hnorm",
	"re:.*\\.eh_proj",
	"re:.*shared_head\\.norm",
	]
	```

	## Serving

	### vLLM (recommended)

	vLLM auto-detects the compressed-tensors FP8 format from config.json. No `--quantization` flag required.

	```shell
	vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
	--tensor-parallel-size 4 \
	--max-model-len 131072 \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--enable-auto-tool-choice
	```

	To disable thinking mode (shorter, faster responses):

	```shell
	vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
	--tensor-parallel-size 4 \
	--max-model-len 131072 \
	--tool-call-parser glm47 \
	--reasoning-parser glm45 \
	--enable-auto-tool-choice \
	--default-chat-template-kwargs '{"enable_thinking": false}'
	```

	Or disable per-request:

	```json
	{
	"model": "trohrbaugh/GLM-4.7-heretic-fp8",
	"messages": [{"role": "user", "content": "Hello"}],
	"chat_template_kwargs": {"enable_thinking": false}
	}
	```

	### VRAM requirements

	\| Configuration \| Approx. VRAM \| Example hardware \|
	\| :------------ \| :----------: \| :--------------- \|
	\| TP=4 \| ~370 GB \| 4× H100 80GB, 4× RTX PRO 6000 96GB \|
	\| TP=8 \| ~370 GB \| 8× A100 80GB, 8× RTX PRO 6000 96GB \|

	## Related models

	\| Variant \| Size \| Format \| Link \|
	\| :------ \| :--: \| :----- \| :--- \|
	\| BF16 (full precision) \| ~706 GB \| safetensors \| [trohrbaugh/GLM-4.7-heretic](https://huggingface.co/trohrbaugh/GLM-4.7-heretic) \|
	\| FP8 W8A8 (this model) \| ~362 GB \| compressed-tensors \| [trohrbaugh/GLM-4.7-heretic-fp8](https://huggingface.co/trohrbaugh/GLM-4.7-heretic-fp8) \|

	## Quantization environment

	- GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition
	- CUDA: 13.1
	- torch: 2.11.0+cu130
	- transformers: 4.57.6
	- llm-compressor: 0.10.1-dev (main branch)
	- compressed-tensors: 0.14.0.1

	## Credits

	- [Z.ai / THUDM](https://huggingface.co/zai-org) for GLM-4.7
	- [P-E-W](https://github.com/p-e-w/heretic) for the Heretic abliteration engine
	- [vLLM team](https://github.com/vllm-project/llm-compressor) for llm-compressor

	## Citation

	```bibtex
	@misc{5team2025glm45agenticreasoningcoding,
	title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
	author={GLM Team and Aohan Zeng and Xin Lv and others},
	year={2025},
	eprint={2508.06471},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2508.06471},
	}
	```