GLM-4.7-Flash_AWQ — Quantized (AWQ · W4A16 and W8A16 - GS32 · vLLM nightly + Transformers 5.0)

This repository provides an AWQ quantized build of GLM-4.7-Flash repackaged for vLLM using the compressed-tensors runtime layout.

Why this quant is different (MoE-aware calibration)

  • During calibration we activate all experts inside each MoE block (not just top-k chosen by the router).
  • This captures worst-case activations across the entire mixture, producing more robust scales with lower drift when rare experts fire at inference time.
  • The quant script explicitly does not ignore shared experts (fixes typical smoothing issues in MoE with AWQ). :contentReference[oaicite:0]{index=0}

Runtime requirements:
vLLM nightly build (MoE + GLM-Flash path) and Transformers 5.0.
trust_remote_code must be enabled.


Revisions & Branches

The main branch is a landing page (model card + links). The runnable quant lives under:


What’s inside

  • Sharded quantized weights (*.safetensors) + index (model.safetensors.index.json)
  • config.json with compressed-tensors metadata (quantization_config, weight_format, etc.)
  • Tokenizer artifacts (tokenizer.json, tokenizer.model, merges/vocab as applicable)

This package targets vLLM (compressed-tensors). Loading directly with vanilla 🤗 from_pretrained is not supported.


Quantization & calibration details (from the provided script)

Method / scheme

  • AWQ (weight-only) via llmcompressor.oneshot with an AWQModifier targeting Linear layers.
  • W4A16_GS32: INT4 weights (num_bits=8, symmetric=True), group strategy with group_size=32; activations remain FP16/BF16 at runtime.
  • W8A16_GS32: INT8 weights (num_bits=8, symmetric=True), group strategy with group_size=32; activations remain FP16/BF16 at runtime.
  • Ignored layers: a short, script-defined ignore list; importantly, shared experts are not ignored to avoid smoothing errors. :contentReference[oaicite:1]{index=1}

MoE handling

  • Each Glm4MoeLiteMoE module is replaced at calibration time with a Calibration wrapper that sets calibrate_all_experts=True, ensuring every expert is exercised while collecting activation statistics. :contentReference[oaicite:2]{index=2}

Datasets & sampling

  • Total calibration samples: 512
  • Max sequence length: 2048 tokens
  • Data mix (60/40):
    • Neural Magic: neuralmagic/LLM_compression_calibration (chat-style messages rendered with apply_chat_template)
    • Rombo: Rombo-Org/Optimized_Reasoning (instructions + optional inputs/outputs stitched into plain text)
      Both are tokenized without padding, truncated to 2048, with add_special_tokens=False. :contentReference[oaicite:3]{index=3}

Export

  • Saved with save_compressed=True to embed compressed-tensors metadata for vLLM.
  • Minor post-save cleanup (e.g., remove auto_map from config.json) to avoid loader issues. :contentReference[oaicite:4]{index=4}

Why Group Size 32 (W4A16 and W8A16 - GS32)

  • Group size controls how many consecutive weights share one set of quantization scales.
  • GS32 (this branch) provides finer-grained scaling than GS64/128 → typically better fidelity (perplexity / task metrics) at a small cost in metadata/bandwidth.
  • This is especially helpful for MoE where experts can exhibit diverse activation statistics: smaller groups better preserve expert-specific nuances.

Quickstart — vLLM (nightly) + Transformers 5.0

Environment requirements

  • vLLM nightly build
  • Transformers 5.0
  • trust_remote_code=True

Recommended runtime flags (GLM-4.7-Flash MoE path):

  • --enable-expert-parallel to distribute experts across devices
  • --tool-call-parser glm47 / --reasoning-parser glm45 for GLM-style tool & reasoning
  • FlashInfer toggles as below (per script guidance)

Example command (provided by author):

export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

CUDA_VISIBLE_DEVICES=4,5 vllm serve \
    /media/fmodels/TheHouseOfTheDude/GLM-4.7-Flash_AWQ/W8A16_GS32 \
    --served-model-name GLM-4.7-Flash_AWQ-W8A16_GS32 \
    --swap-space 4 \
    --max-model-len 80896 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 2 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000 \
    --api-key REDACTED
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheHouseOfTheDude/GLM-4.7-Flash_AWQ

Quantized
(61)
this model