GLM-4.7-Flash-heretic NVFP4

NVFP4 post-training quantization of Olafangensan/GLM-4.7-Flash-heretic for long-context multi-GPU inference with vLLM.

This release uses NVFP4 (4-bit) quantization, not 8-bit quantization.

Model Size

  • Base architecture: GLM-4.7-Flash (30B-A3B MoE)
  • Parameter count for this release: unchanged from the base model architecture
  • Note: the ~17.8GB model.safetensors file size is a quantized checkpoint size and does not mean the model is 18B parameters.

Runtime Compatibility

Known issue on some stock vLLM 0.16.x + vllm-node setups:

  • assistant content may be null
  • output may be dumped into reasoning fields with broken formatting

Recommended runtime (validated)

Use the GLM-compatible runtime variant used during validation:

docker run --rm --gpus all --ipc=host   -v /path/to/models:/models   -p 8000:8000   vllm-glm:nightly-glm-scalefix   --model /models/GLM-4.7-Flash-heretic-NVFP4   --served-model-name glm-4.7-flash   --tensor-parallel-size 4   --tool-call-parser glm47   --reasoning-parser glm45   --enable-auto-tool-choice   --default-chat-template-kwargs '{"enable_thinking": true}'   --generation-config vllm   --override-generation-config '{"temperature": 0.7, "top_p": 1.0}'

If you must use stock vLLM 0.16.x

Use a compatibility profile first:

  • remove --reasoning-parser glm45
  • set --default-chat-template-kwargs '{"enable_thinking": false}'

This usually forces normal assistant text into content.

Repository Contents

This model repo intentionally contains only serving-required artifacts:

  • model.safetensors
  • config.json
  • generation_config.json
  • tokenizer.json
  • tokenizer_config.json
  • chat_template.jinja
  • hf_quant_config.json
  • README.md
  • QUANTIZATION.md
  • LICENSE

No training checkpoints, raw calibration corpora, or temporary files are included.

License and Provenance

  • Base model: Olafangensan/GLM-4.7-Flash-heretic
  • Upstream lineage: decensored derivative of zai-org/GLM-4.7-Flash
  • Base license: MIT (per upstream model card)
  • This repo: quantized derivative for inference; no architecture changes

Please review and comply with upstream licenses and terms for your use case.

Reproducibility

Quantization recipe, command, and environment details are documented in QUANTIZATION.md.

At a glance:

  • Quantization method: ModelOpt NVFP4 (group_size=16, lm_head excluded)
  • Calibration mix: switch_turnflow_sanitized,open_code_reasoning
  • Calibration sizes: 1536,512 (sequence length 2048)
  • Export format: Hugging Face

Quick Start (OpenAI-compatible)

curl http://127.0.0.1:8000/v1/chat/completions   -H 'Content-Type: application/json'   -d '{
    "model": "glm-4.7-flash",
    "messages": [{"role": "user", "content": "hello"}],
    "max_tokens": 256
  }'

Integrity

  • model.safetensors SHA256: df3cf9115e5a648b31f2bd3c5acd5184bb2134cbd7ab592884d8f57a2dab4f1a
Downloads last month
-
Safetensors
Model size
17B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for alphakek/GLM-4.7-Flash-heretic-NVFP4

Quantized
(7)
this model