Instructions to use alphakek/GLM-4.7-Flash-heretic-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use alphakek/GLM-4.7-Flash-heretic-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="alphakek/GLM-4.7-Flash-heretic-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("alphakek/GLM-4.7-Flash-heretic-NVFP4")
model = AutoModelForCausalLM.from_pretrained("alphakek/GLM-4.7-Flash-heretic-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use alphakek/GLM-4.7-Flash-heretic-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "alphakek/GLM-4.7-Flash-heretic-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "alphakek/GLM-4.7-Flash-heretic-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/alphakek/GLM-4.7-Flash-heretic-NVFP4

SGLang

How to use alphakek/GLM-4.7-Flash-heretic-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "alphakek/GLM-4.7-Flash-heretic-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "alphakek/GLM-4.7-Flash-heretic-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "alphakek/GLM-4.7-Flash-heretic-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "alphakek/GLM-4.7-Flash-heretic-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use alphakek/GLM-4.7-Flash-heretic-NVFP4 with Docker Model Runner:
```
docker model run hf.co/alphakek/GLM-4.7-Flash-heretic-NVFP4
```

GLM-4.7-Flash-heretic NVFP4

NVFP4 post-training quantization of Olafangensan/GLM-4.7-Flash-heretic for long-context multi-GPU inference with vLLM.

The Hugging Face UI "Model size" badge is auto-inferred from packed NVFP4 safetensors and may show an incorrect parameter count for this repo. Use the architecture statement below as source of truth: GLM-4.7-Flash (30B-A3B MoE).

This release uses NVFP4 (4-bit) quantization, not 8-bit quantization.

Model Size

Base architecture: GLM-4.7-Flash (30B-A3B MoE)
Parameter count for this release: unchanged from the base model architecture
Note: the ~17.8GB model.safetensors file size is a quantized checkpoint size and does not mean the model is 18B parameters.

Runtime Compatibility

Known issue on some stock vLLM 0.16.x + vllm-node setups:

assistant content may be null
output may be dumped into reasoning fields with broken formatting

Required Docker image (build this first)

This model requires a custom vLLM image. Do not use stock vllm/vllm-openai as-is.

1) Create Dockerfile

Save as Dockerfile.vllm-glm4lite:

FROM vllm/vllm-openai:latest

ARG TRANSFORMERS_COMMIT=393b4b3d28e29b4b05b19b4b7f3242a7fc893637

RUN apt-get update && apt-get install -y --no-install-recommends git && rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir -U "huggingface_hub==1.4.0"
RUN pip install --no-cache-dir -U --no-deps "git+https://github.com/huggingface/transformers.git@${TRANSFORMERS_COMMIT}"

2) Build image

docker build -t vllm-glm:parser-only-r2 -f Dockerfile.vllm-glm4lite .

3) Serve model with that image

docker run --rm --name glm47   --gpus all   --ipc=host   -p 8000:8000   -v /path/to/hf_cache:/hf_cache   -v /path/to/models:/models   -e HF_HOME=/hf_cache   -e HUGGINGFACE_HUB_CACHE=/hf_cache   -e HF_HUB_CACHE=/hf_cache   -e TRANSFORMERS_CACHE=/hf_cache   -e VLLM_NVFP4_GEMM_BACKEND=marlin   vllm-glm:parser-only-r2   --model /models/GLM-4.7-Flash-heretic-NVFP4   --served-model-name glm-4.7-flash   --quantization modelopt_fp4   --dtype bfloat16   --tensor-parallel-size 4   --max-model-len 131072   --enable-auto-tool-choice   --tool-call-parser glm47   --reasoning-parser glm45   --default-chat-template-kwargs '{"enable_thinking": true}'   --generation-config vllm   --override-generation-config '{"temperature":0.7,"top_p":1.0}'

4) Verify

curl http://127.0.0.1:8000/v1/models

Expected: model list includes glm-4.7-flash.

Performance Notes (RTX 3090 / Ampere)

These measurements were taken on a 4x RTX 3090 host with vLLM and this NVFP4 export.

Best single-session setup found so far

Use --optimization-level 3 (O3), not default O2.
Use TP2 on an NVLink-connected pair (CUDA_VISIBLE_DEVICES=0,2 + --tensor-parallel-size 2).
Keep --max-model-len 131072 (128k context remains supported).

Observed impact (single-session decode):

O3 vs O2: ~+16% tokens/sec
TP2 NVLink vs TP2 non-NVLink: ~+10% tokens/sec

What did not help in this environment

MTP speculative decoding (--speculative-config {"method":"mtp",...}) was slower than baseline.
VLLM_MARLIN_USE_ATOMIC_ADD=1 was slightly slower (~4-5%).
FP8 KV cache variants were not viable on this stack:
- --kv-cache-dtype fp8
- --kv-cache-dtype fp8 --calculate-kv-scales
- --kv-cache-dtype fp8_ds_mla These failed due to no valid MLA attention backend on this Ampere path.

Throughput vs single-session speed

Increasing --max-num-seqs / --max-num-batched-tokens improved aggregate throughput significantly.
Those changes did not materially improve single-session latency/tokens-sec.
If your goal is one chat/session feeling faster, prioritize O3 + TP2 on NVLink.

Repository Contents

This model repo intentionally contains only serving-required artifacts:

model.safetensors
config.json
generation_config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
hf_quant_config.json
README.md
QUANTIZATION.md
LICENSE

No training checkpoints, raw calibration corpora, or temporary files are included.

License and Provenance

Base model: Olafangensan/GLM-4.7-Flash-heretic
Upstream lineage: decensored derivative of zai-org/GLM-4.7-Flash
Base license: MIT (per upstream model card)
This repo: quantized derivative for inference; no architecture changes

Please review and comply with upstream licenses and terms for your use case.

Reproducibility

Quantization recipe, command, and environment details are documented in QUANTIZATION.md.

At a glance:

Quantization method: ModelOpt NVFP4 (group_size=16, lm_head excluded)
Calibration mix: switch_turnflow_sanitized,open_code_reasoning
Calibration sizes: 1536,512 (sequence length 2048)
Export format: Hugging Face

Quick Start (OpenAI-compatible)

curl http://127.0.0.1:8000/v1/chat/completions   -H 'Content-Type: application/json'   -d '{
    "model": "glm-4.7-flash",
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 256
  }'

Integrity

model.safetensors SHA256: 3b5aca2db60c472e9dbcb44e79ab4f69442d9a83315bbfab7a3f39ab8b004116

Downloads last month: 10

Safetensors

Model size

17B params

Tensor type

F32

BF16

F8_E4M3

Model tree for alphakek/GLM-4.7-Flash-heretic-NVFP4

Base model

Olafangensan/GLM-4.7-Flash-heretic

Quantized

(15)

this model