Instructions to use RadicalNotionAI/GLM-4.7-heretic-fp8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RadicalNotionAI/GLM-4.7-heretic-fp8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RadicalNotionAI/GLM-4.7-heretic-fp8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RadicalNotionAI/GLM-4.7-heretic-fp8")
model = AutoModelForCausalLM.from_pretrained("RadicalNotionAI/GLM-4.7-heretic-fp8")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use RadicalNotionAI/GLM-4.7-heretic-fp8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RadicalNotionAI/GLM-4.7-heretic-fp8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RadicalNotionAI/GLM-4.7-heretic-fp8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RadicalNotionAI/GLM-4.7-heretic-fp8

SGLang

How to use RadicalNotionAI/GLM-4.7-heretic-fp8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RadicalNotionAI/GLM-4.7-heretic-fp8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RadicalNotionAI/GLM-4.7-heretic-fp8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RadicalNotionAI/GLM-4.7-heretic-fp8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RadicalNotionAI/GLM-4.7-heretic-fp8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RadicalNotionAI/GLM-4.7-heretic-fp8 with Docker Model Runner:
```
docker model run hf.co/RadicalNotionAI/GLM-4.7-heretic-fp8
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

GLM-4.7-heretic-FP8

FP8 W8A8 quantized version of trohrbaugh/GLM-4.7-heretic — a decensored zai-org/GLM-4.7, made using Heretic v1.2.0+custom.

Abliteration parameters

Parameter	Value
direction_index	per layer
attn.o_proj.max_weight	1.84
attn.o_proj.max_weight_position	49.16
attn.o_proj.min_weight	1.64
attn.o_proj.min_weight_distance	26.42
mlp.down_proj.max_weight	1.02
mlp.down_proj.max_weight_position	53.46
mlp.down_proj.min_weight	0.97
mlp.down_proj.min_weight_distance	45.98

Abliteration performance

Metric	This model	Original model (zai-org/GLM-4.7)
KL divergence	0.0748	0 (by definition)
Refusals	0/100	99/100

FP8 Quantization

Quantized using llm-compressor (v0.10.1-dev, main branch) to produce a compressed-tensors format checkpoint natively supported by vLLM — no --quantization flag or patches needed.

Quantization recipe

Scheme: FP8 (W8A8) — static per-channel FP8 E4M3 weights with minmax observer, dynamic per-token FP8 E4M3 activations
Calibration: 1024 samples from fineweb-edu-score-2, max sequence length 4096
SmoothQuant: Not used (can interfere with MoE expert routing)
Format: compressed-tensors (auto-detected by vLLM)

Precision map

Component	Precision	Rationale
Routed expert weights (160 experts × 89 MoE layers)	FP8 E4M3	Bulk of model — per-channel static scaling via calibration
Attention projections (q/k/v/o)	FP8 E4M3	GQA with 96Q / 8KV heads, head_dim=128
Shared expert weights	FP8 E4M3	Active every token, well-calibrated
Dense MLP (layers 0–2)	FP8 E4M3	Only 3 dense layers
Attention biases (q/k/v)	BF16	Small tensors, sensitive to precision loss
Router/gate weights	BF16	Routing errors cascade through all downstream computation
MoE e_score_correction_bias	BF16	Critical for expert load balancing
RMSNorm / QK norms	BF16	Negligible size, high sensitivity
Embeddings / LM head	BF16	Standard practice for quantized models
MTP head (layer 92: enorm, hnorm, eh_proj)	BF16	Speculative decoding head, kept full precision

Ignore patterns used

IGNORE_PATTERNS = [
    "re:.*embed_tokens.*",
    "lm_head",
    "re:.*layernorm.*",
    "re:.*q_norm.*",
    "re:.*k_norm.*",
    "model.norm",
    "re:.*self_attn\\.q_proj\\.bias",
    "re:.*self_attn\\.k_proj\\.bias",
    "re:.*self_attn\\.v_proj\\.bias",
    "re:.*mlp\\.gate$",
    "re:.*mlp\\.gate\\.weight",
    "re:.*mlp\\.gate\\.e_score_correction_bias",
    "re:.*\\.enorm",
    "re:.*\\.hnorm",
    "re:.*\\.eh_proj",
    "re:.*shared_head\\.norm",
]

Serving

vLLM (recommended)

vLLM auto-detects the compressed-tensors FP8 format from config.json. No --quantization flag required.

vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice

To disable thinking mode (shorter, faster responses):

vllm serve trohrbaugh/GLM-4.7-heretic-fp8 \
    --tensor-parallel-size 4 \
    --max-model-len 131072 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --default-chat-template-kwargs '{"enable_thinking": false}'

Or disable per-request:

{
  "model": "trohrbaugh/GLM-4.7-heretic-fp8",
  "messages": [{"role": "user", "content": "Hello"}],
  "chat_template_kwargs": {"enable_thinking": false}
}

VRAM requirements

Configuration	Approx. VRAM	Example hardware
TP=4	~370 GB	4× H100 80GB, 4× RTX PRO 6000 96GB
TP=8	~370 GB	8× A100 80GB, 8× RTX PRO 6000 96GB

Related models

Variant	Size	Format	Link
BF16 (full precision)	~706 GB	safetensors	trohrbaugh/GLM-4.7-heretic
FP8 W8A8 (this model)	~362 GB	compressed-tensors	trohrbaugh/GLM-4.7-heretic-fp8

Quantization environment

GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition
CUDA: 13.1
torch: 2.11.0+cu130
transformers: 4.57.6
llm-compressor: 0.10.1-dev (main branch)
compressed-tensors: 0.14.0.1

Credits

Z.ai / THUDM for GLM-4.7
P-E-W for the Heretic abliteration engine
vLLM team for llm-compressor

Citation

@misc{5team2025glm45agenticreasoningcoding,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}