GLM-4.7-Flash_AWQ — Quantized (AWQ · W4A16 and W8A16 - GS32 · vLLM nightly + Transformers 5.0)
This repository provides an AWQ quantized build of GLM-4.7-Flash repackaged for vLLM using the compressed-tensors runtime layout.
Why this quant is different (MoE-aware calibration)
- During calibration we activate all experts inside each MoE block (not just top-k chosen by the router).
- This captures worst-case activations across the entire mixture, producing more robust scales with lower drift when rare experts fire at inference time.
- The quant script explicitly does not ignore shared experts (fixes typical smoothing issues in MoE with AWQ). :contentReference[oaicite:0]{index=0}
Runtime requirements:
• vLLM nightly build (MoE + GLM-Flash path) and Transformers 5.0.
•trust_remote_codemust be enabled.
Revisions & Branches
The
mainbranch is a landing page (model card + links). The runnable quant lives under:
- W4A16_GS32 — Weight INT4, Activation 16-bit, Group Size 32 (highest fidelity among W4A16 variants)
- W8A16_GS32 — Weight INT8, Activation 16-bit, Group Size 32 (highest fidelity among W8A16 variants) Quick link:
- https://huggingface.co/TheHouseOfTheDude/GLM-4.7-Flash_AWQ/tree/W4A16_GS32
- https://huggingface.co/TheHouseOfTheDude/GLM-4.7-Flash_AWQ/tree/W8A16_GS32
What’s inside
- Sharded quantized weights (
*.safetensors) + index (model.safetensors.index.json) config.jsonwith compressed-tensors metadata (quantization_config,weight_format, etc.)- Tokenizer artifacts (
tokenizer.json,tokenizer.model, merges/vocab as applicable)
This package targets vLLM (compressed-tensors). Loading directly with vanilla 🤗
from_pretrainedis not supported.
Quantization & calibration details (from the provided script)
Method / scheme
- AWQ (weight-only) via
llmcompressor.oneshotwith an AWQModifier targeting Linear layers. - W4A16_GS32: INT4 weights (
num_bits=8,symmetric=True), group strategy withgroup_size=32; activations remain FP16/BF16 at runtime. - W8A16_GS32: INT8 weights (
num_bits=8,symmetric=True), group strategy withgroup_size=32; activations remain FP16/BF16 at runtime. - Ignored layers: a short, script-defined ignore list; importantly, shared experts are not ignored to avoid smoothing errors. :contentReference[oaicite:1]{index=1}
MoE handling
- Each
Glm4MoeLiteMoEmodule is replaced at calibration time with a Calibration wrapper that setscalibrate_all_experts=True, ensuring every expert is exercised while collecting activation statistics. :contentReference[oaicite:2]{index=2}
Datasets & sampling
- Total calibration samples: 512
- Max sequence length: 2048 tokens
- Data mix (60/40):
- Neural Magic:
neuralmagic/LLM_compression_calibration(chat-stylemessagesrendered withapply_chat_template) - Rombo:
Rombo-Org/Optimized_Reasoning(instructions + optional inputs/outputs stitched into plain text)
Both are tokenized without padding, truncated to 2048, withadd_special_tokens=False. :contentReference[oaicite:3]{index=3}
- Neural Magic:
Export
- Saved with
save_compressed=Trueto embed compressed-tensors metadata for vLLM. - Minor post-save cleanup (e.g., remove
auto_mapfromconfig.json) to avoid loader issues. :contentReference[oaicite:4]{index=4}
Why Group Size 32 (W4A16 and W8A16 - GS32)
- Group size controls how many consecutive weights share one set of quantization scales.
- GS32 (this branch) provides finer-grained scaling than GS64/128 → typically better fidelity (perplexity / task metrics) at a small cost in metadata/bandwidth.
- This is especially helpful for MoE where experts can exhibit diverse activation statistics: smaller groups better preserve expert-specific nuances.
Quickstart — vLLM (nightly) + Transformers 5.0
Environment requirements
- vLLM nightly build
- Transformers 5.0
trust_remote_code=True
Recommended runtime flags (GLM-4.7-Flash MoE path):
--enable-expert-parallelto distribute experts across devices--tool-call-parser glm47/--reasoning-parser glm45for GLM-style tool & reasoning- FlashInfer toggles as below (per script guidance)
Example command (provided by author):
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
CUDA_VISIBLE_DEVICES=4,5 vllm serve \
/media/fmodels/TheHouseOfTheDude/GLM-4.7-Flash_AWQ/W8A16_GS32 \
--served-model-name GLM-4.7-Flash_AWQ-W8A16_GS32 \
--swap-space 4 \
--max-model-len 80896 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000 \
--api-key REDACTED
Model tree for TheHouseOfTheDude/GLM-4.7-Flash_AWQ
Base model
zai-org/GLM-4.7-Flash