GLM-4.7-Flash-heretic NVFP4
NVFP4 post-training quantization of Olafangensan/GLM-4.7-Flash-heretic for long-context multi-GPU inference with vLLM.
This release uses NVFP4 (4-bit) quantization, not 8-bit quantization.
Model Size
- Base architecture: GLM-4.7-Flash (30B-A3B MoE)
- Parameter count for this release: unchanged from the base model architecture
- Note: the ~17.8GB
model.safetensorsfile size is a quantized checkpoint size and does not mean the model is 18B parameters.
Runtime Compatibility
Known issue on some stock vLLM 0.16.x + vllm-node setups:
- assistant
contentmay benull - output may be dumped into reasoning fields with broken formatting
Recommended runtime (validated)
Use the GLM-compatible runtime variant used during validation:
docker run --rm --gpus all --ipc=host -v /path/to/models:/models -p 8000:8000 vllm-glm:nightly-glm-scalefix --model /models/GLM-4.7-Flash-heretic-NVFP4 --served-model-name glm-4.7-flash --tensor-parallel-size 4 --tool-call-parser glm47 --reasoning-parser glm45 --enable-auto-tool-choice --default-chat-template-kwargs '{"enable_thinking": true}' --generation-config vllm --override-generation-config '{"temperature": 0.7, "top_p": 1.0}'
If you must use stock vLLM 0.16.x
Use a compatibility profile first:
- remove
--reasoning-parser glm45 - set
--default-chat-template-kwargs '{"enable_thinking": false}'
This usually forces normal assistant text into content.
Repository Contents
This model repo intentionally contains only serving-required artifacts:
model.safetensorsconfig.jsongeneration_config.jsontokenizer.jsontokenizer_config.jsonchat_template.jinjahf_quant_config.jsonREADME.mdQUANTIZATION.mdLICENSE
No training checkpoints, raw calibration corpora, or temporary files are included.
License and Provenance
- Base model:
Olafangensan/GLM-4.7-Flash-heretic - Upstream lineage: decensored derivative of
zai-org/GLM-4.7-Flash - Base license: MIT (per upstream model card)
- This repo: quantized derivative for inference; no architecture changes
Please review and comply with upstream licenses and terms for your use case.
Reproducibility
Quantization recipe, command, and environment details are documented in QUANTIZATION.md.
At a glance:
- Quantization method: ModelOpt NVFP4 (
group_size=16,lm_headexcluded) - Calibration mix:
switch_turnflow_sanitized,open_code_reasoning - Calibration sizes:
1536,512(sequence length2048) - Export format: Hugging Face
Quick Start (OpenAI-compatible)
curl http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "glm-4.7-flash",
"messages": [{"role": "user", "content": "hello"}],
"max_tokens": 256
}'
Integrity
model.safetensorsSHA256:df3cf9115e5a648b31f2bd3c5acd5184bb2134cbd7ab592884d8f57a2dab4f1a
- Downloads last month
- -
Model tree for alphakek/GLM-4.7-Flash-heretic-NVFP4
Base model
Olafangensan/GLM-4.7-Flash-heretic