GLM-4.7-Flash_AWQ — Quantized (AWQ · W4A16 and W8A16 - GS32 · vLLM nightly + Transformers 5.0)

This repository provides an AWQ quantized build of GLM-4.7-Flash repackaged for vLLM using the compressed-tensors runtime layout.

Why this quant is different (MoE-aware calibration)

During calibration we activate all experts inside each MoE block (not just top-k chosen by the router).

This captures worst-case activations across the entire mixture, producing more robust scales with lower drift when rare experts fire at inference time.

The quant script explicitly does not ignore shared experts (fixes typical smoothing issues in MoE with AWQ). :contentReference[oaicite:0]{index=0}

Runtime requirements:
• vLLM nightly build (MoE + GLM-Flash path) and Transformers 5.0.
• trust_remote_code must be enabled.

Revisions & Branches

The main branch is a landing page (model card + links). The runnable quant lives under:

W4A16_GS32 — Weight INT4, Activation 16-bit, Group Size 32 (highest fidelity among W4A16 variants)
W8A16_GS32 — Weight INT8, Activation 16-bit, Group Size 32 (highest fidelity among W8A16 variants) Quick link:
https://huggingface.co/TheHouseOfTheDude/GLM-4.7-Flash_AWQ/tree/W4A16_GS32
https://huggingface.co/TheHouseOfTheDude/GLM-4.7-Flash_AWQ/tree/W8A16_GS32

What’s inside

Sharded quantized weights (*.safetensors) + index (model.safetensors.index.json)
config.json with compressed-tensors metadata (quantization_config, weight_format, etc.)
Tokenizer artifacts (tokenizer.json, tokenizer.model, merges/vocab as applicable)

This package targets vLLM (compressed-tensors). Loading directly with vanilla 🤗 from_pretrained is not supported.

Quantization & calibration details (from the provided script)

Method / scheme

AWQ (weight-only) via llmcompressor.oneshot with an AWQModifier targeting Linear layers.
W4A16_GS32: INT4 weights (num_bits=8, symmetric=True), group strategy with group_size=32; activations remain FP16/BF16 at runtime.
W8A16_GS32: INT8 weights (num_bits=8, symmetric=True), group strategy with group_size=32; activations remain FP16/BF16 at runtime.
Ignored layers: a short, script-defined ignore list; importantly, shared experts are not ignored to avoid smoothing errors. :contentReference[oaicite:1]{index=1}

MoE handling

Each Glm4MoeLiteMoE module is replaced at calibration time with a Calibration wrapper that sets calibrate_all_experts=True, ensuring every expert is exercised while collecting activation statistics. :contentReference[oaicite:2]{index=2}

Datasets & sampling

Total calibration samples: 512
Max sequence length: 2048 tokens
Data mix (60/40):
- Neural Magic: neuralmagic/LLM_compression_calibration (chat-style messages rendered with apply_chat_template)
- Rombo: Rombo-Org/Optimized_Reasoning (instructions + optional inputs/outputs stitched into plain text)
  Both are tokenized without padding, truncated to 2048, with add_special_tokens=False. :contentReference[oaicite:3]{index=3}

Export

Saved with save_compressed=True to embed compressed-tensors metadata for vLLM.
Minor post-save cleanup (e.g., remove auto_map from config.json) to avoid loader issues. :contentReference[oaicite:4]{index=4}

Why Group Size 32 (W4A16 and W8A16 - GS32)

Group size controls how many consecutive weights share one set of quantization scales.
GS32 (this branch) provides finer-grained scaling than GS64/128 → typically better fidelity (perplexity / task metrics) at a small cost in metadata/bandwidth.
This is especially helpful for MoE where experts can exhibit diverse activation statistics: smaller groups better preserve expert-specific nuances.

Quickstart — vLLM (nightly) + Transformers 5.0

Environment requirements

vLLM nightly build
Transformers 5.0
trust_remote_code=True

Recommended runtime flags (GLM-4.7-Flash MoE path):

--enable-expert-parallel to distribute experts across devices
--tool-call-parser glm47 / --reasoning-parser glm45 for GLM-style tool & reasoning
FlashInfer toggles as below (per script guidance)

Example command (provided by author):

export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

CUDA_VISIBLE_DEVICES=4,5 vllm serve \
    /media/fmodels/TheHouseOfTheDude/GLM-4.7-Flash_AWQ/W8A16_GS32 \
    --served-model-name GLM-4.7-Flash_AWQ-W8A16_GS32 \
    --swap-space 4 \
    --max-model-len 80896 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 2 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key REDACTED

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for TheHouseOfTheDude/GLM-4.7-Flash_AWQ

Base model

zai-org/GLM-4.7-Flash

Quantized

(88)

this model