GLM-4.7-Flash (W8A8 FP8 with 2D-block quantization)

This repo contains GLM-4.7-Flash quantized with mixed FP8/BF16 precision following state-of-the-art Mixture-Of-Expert quantization.

Original Model:
- zai-org/GLM-4.7-Flash

The model requires Ada (4000 series), Hopper (H100) or Blackwell (5000 series) GPUs for hardware FP8 support.

📥 Usage & Running Instructions

The model was tested with vLLM + 1x RTX Pro 6000, here is a script suitable for such configuration with 131072 context length.

Building vLLM with transformers v5

A vLLM built from HEAD + transformers v5 is needed for GLM-4.7-Flash (and GLM-4.6V), here is the Dockerfile I use (specialized for RTX 5090 / RTX Pro 6000):

https://github.com/mratsim/llmops/blob/master/vllm/vllm-lmcache-Dockerfile

Dockerfile

# vLLM + LMCache Multi-Stage Dockerfile
# + Docker cache
# + Ccache
# + uv cache
# Version: 2026-01-26
# Build:   TMPDIR=vllm-dockercache podman build -v ./vllm-ccache:/root/.ccache -t vllm-202601-cu129 -f vllm-lmcache-Dockerfile

#################### ARGUMENTS ####################
ARG CUDA_VERSION=12.9.1
ARG LMCACHE_GIT_REF=dev
ARG VLLM_GIT_REF=main
ARG FLASHINFER_VERSION=0.6.1
ARG WHEELS_DIR=/tmp/wheels

#################### BUILD STAGE ####################
# Full build environment with all development tools
FROM nvcr.io/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu24.04 AS build

ARG CUDA_VERSION
ARG LMCACHE_GIT_REF
ARG VLLM_GIT_REF
ARG WHEELS_DIR

# Build environment
ENV CUDA_VERSION=${CUDA_VERSION}
ENV WHEELS_DIR=${WHEELS_DIR}

# Build config
ENV UV_LINK_MODE=copy
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
ENV MAX_JOBS=128
ENV NVCC_THREADS=8
ENV CMAKE_BUILD_TYPE=Release
ENV USE_CUDA=1
ENV CCACHE_DIR=/root/.ccache
ENV CUDA_HOME=/usr/local/cuda
ENV NVCC_GENCODE="-gencode=arch=compute_120,code=sm_120"
ENV TORCH_CUDA_ARCH_LIST='12.0'
ENV FLASH_ATTN_CUDA_ARCHS=120
ENV VLLM_FLASH_ATTN_VERSION=2
# Note: flashinfer is now installed as pre-compiled wheel in runtime
# ENV FLASHINFER_ENABLE_AOT=1
ENV VLLM_TARGET_DEVICE=cuda
ENV LMCACHE_NVCC_THREADS=8
ENV LMCACHE_MAX_JOBS=32
ENV LMCACHE_CUDA_VERSION=${CUDA_VERSION}
ENV LMCACHE_CUDA_ARCHS=12.0
ENV LMCACHE_TORCH_CUDA_ARCH_LIST=12.0
ENV LMCACHE_VLLM_FA_CMAKE_GPU_ARCHES=120
ENV VLLM_DOCKER_BUILD_CONTEXT=1
ENV PATH="/opt/venv/bin:$PATH"

# System packages
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    curl \
    ca-certificates \
    python3.12 \
    python3.12-venv \
    python3.12-dev \
    python3-pip \
    git \
    ccache \
    && rm -rf /var/lib/apt/lists/*

# Create venv
RUN python3 -m venv /opt/venv
RUN /opt/venv/bin/pip install --no-cache-dir --upgrade pip
RUN /opt/venv/bin/pip install --no-cache-dir uv

# Create wheel output directory
RUN mkdir -p ${WHEELS_DIR}

# Build tools
RUN --mount=type=cache,target=/root/.cache/uv \
    /opt/venv/bin/uv pip install ninja setuptools setuptools_scm

# App
# ---------------------------------------------------------------
# PyTorch
RUN --mount=type=cache,target=/root/.cache/uv \
    /opt/venv/bin/uv pip install --pre torch>=2.9.0 torchvision torchaudio \
    --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION%.*}

# Clone vLLM
WORKDIR /workspace
RUN git clone --branch ${VLLM_GIT_REF} https://github.com/vllm-project/vllm

# vLLM: Specialize for SM120 (RTX 5090, RTX Pro 6000) to save hours of compilation time
WORKDIR /workspace/vllm
RUN sed -i \
    -e 's/ALLSPARK_ARCHS "8.0;8.6;8.7;8.9"/ALLSPARK_ARCHS "12.0"/g' \
    -e 's/MARLIN_ARCHS "8.0+PTX"/MARLIN_ARCHS "12.0"/g' \
    -e 's/MARLIN_FP8_ARCHS "8.9;12.0"/MARLIN_FP8_ARCHS "12.0"/g' \
    -e 's/MARLIN_OTHER_ARCHS "7.5;8.0+PTX"/MARLIN_OTHER_ARCHS "12.0"/g' \
    -e 's/MARLIN_MOE_ARCHS "8.0+PTX"/MARLIN_MOE_ARCHS "12.0"/g' \
    -e 's/MARLIN_MOE_FP8_ARCHS "8.9;12.0"/MARLIN_MOE_FP8_ARCHS "12.0"/g' \
    -e 's/MARLIN_MOE_OTHER_ARCHS "7.5;8.0+PTX"/MARLIN_MOE_OTHER_ARCHS "12.0"/g' \
    -e 's/HADACORE_ARCHS "8.0+PTX;9.0+PTX" "${CUDA_ARCHS}"/HADACORE_ARCHS "12.0" "${CUDA_ARCHS}"/g' \
    -e 's/"7.5;8.0;8.7;8.9+PTX" "${CUDA_ARCHS}"/"12.0" "${CUDA_ARCHS}"/g' \
    CMakeLists.txt

# vLLM build requirements
RUN --mount=type=cache,target=/root/.cache/uv \
    /opt/venv/bin/uv pip install -r requirements/build.txt \
    --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION%.*}

# Build vLLM wheel
RUN --mount=type=cache,target=/root/.cache/uv \
    --mount=type=cache,target=/root/.ccache \
    CCACHE_NOHASHDIR="true" \
    /opt/venv/bin/python3 setup.py bdist_wheel --dist-dir ${WHEELS_DIR} \
    | grep -vE "^copying|^creating|^writing|^adding"

# Clone LMCache
WORKDIR /workspace
RUN git clone --branch ${LMCACHE_GIT_REF} https://github.com/LMCache/LMCache

# Build LMCache wheel
WORKDIR /workspace/LMCache
RUN --mount=type=cache,target=/root/.cache/uv \
    --mount=type=cache,target=/root/.ccache \
    CCACHE_NOHASHDIR="true" \
    /opt/venv/bin/python3 setup.py bdist_wheel --dist-dir ${WHEELS_DIR} \
    | grep -vE "^copying|^creating|^writing|^adding"

# ccache stats
WORKDIR /workspace
RUN --mount=type=cache,target=/root/.ccache,sharing=locked \
    ccache -s

#################### RUNTIME STAGE ####################
# Lean production image without build tools
FROM nvcr.io/nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu24.04 AS runtime

ARG CUDA_VERSION
ARG FLASHINFER_VERSION
ARG WHEELS_DIR

ENV UV_LINK_MODE=copy
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
ENV CUDA_VERSION=${CUDA_VERSION}
ENV FLASHINFER_VERSION=${FLASHINFER_VERSION}
ENV FLASHINFER_CUDA_ARCH_LIST="12.0"
ENV WHEELS_DIR=${WHEELS_DIR}
ENV DEBIAN_FRONTEND=noninteractive
ENV VLLM_TARGET_DEVICE=cuda
ENV PATH="/opt/venv/bin:$PATH"

# Distro setup
RUN CUDA_VERSION_DASH=$(echo ${CUDA_VERSION} | cut -d. -f1,2 | tr '.' '-') && \
    apt-get update -y && \
    apt-get install -y --no-install-recommends \
    # Runtime packages
    kmod \
    # Install CUDA development tools for runtime JIT compilation
    # (FlashInfer, DeepGEMM, EP kernels all require compilation at runtime)
    build-essential \
    cuda-nvcc-${CUDA_VERSION_DASH} \
    # Python
    python3.12 \
    python3.12-venv \
    python3.12-dev \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Create venv
RUN python3 -m venv /opt/venv
RUN /opt/venv/bin/pip install --no-cache-dir --upgrade pip
RUN /opt/venv/bin/pip install --no-cache-dir uv

# Install packages in venv
WORKDIR /tmp

# PyTorch (use uv cache for fast install)
RUN --mount=type=cache,target=/root/.cache/uv \
    /opt/venv/bin/uv pip install --pre torch>=2.9.0 torchvision torchaudio \
    --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION%.*}

RUN --mount=type=cache,target=/root/.cache/uv \
    /opt/venv/bin/uv pip install torch-c-dlpack-ext \
    --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION%.*}

# Copy all wheels from build stage & install them
COPY --from=build ${WHEELS_DIR} /tmp/wheels
RUN /opt/venv/bin/uv pip install /tmp/wheels/*.whl

# Clean up
RUN rm -rf /tmp/wheels

# Install FlashInfer pre-compiled kernel cache and binaries
# https://docs.flashinfer.ai/installation.html
RUN --mount=type=cache,target=/root/.cache/uv \
    /opt/venv/bin/uv pip install flashinfer-python flashinfer-cubin==${FLASHINFER_VERSION} \
    && /opt/venv/bin/uv pip install flashinfer-jit-cache==${FLASHINFER_VERSION} \
    --extra-index-url https://flashinfer.ai/whl/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') \
    && /opt/venv/bin/flashinfer show-config

# Allow z.ai GLM-4.6V and GLM-4.7-Flash models
RUN --mount=type=cache,target=/root/.cache/uv \
    apt-get update -y && \
    apt-get install -y --no-install-recommends git && \
    /opt/venv/bin/uv pip install git+https://github.com/huggingface/transformers.git && \
    apt-get purge -y --auto-remove git && rm -rf /var/lib/apt/lists/*

# TODO: Unsure why this is needed - remove it ASAP. Pulled by OpenCV for vLLM image processing
RUN apt-get update -y && \
    apt-get install -y --no-install-recommends libxcb1 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /workspace

CMD ["bash"]

Running script

# Model configuration (Mandatory)
MODEL="mratsim/GLM-4.7-Flash-FP8"
MODELNAME="GLM-4.7-Flash"
GPU_UTIL=0.90
CONTEXT_SIZE=202752

# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

vllm serve "${MODEL}" \
  --served-model-name "${MODELNAME}" \
  --gpu-memory-utilization ${GPU_UTIL} \
  --max-model-len "${CONTEXT_SIZE}" \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice

🔬 Quantization method

My LLM quantizations scripts are available at https://github.com/mratsim/quantizers

For this quant specifically we use the following recipe, skipping the MoE routers and MLA specific projections:

import os

from llmcompressor import model_free_ptq

os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
os.environ.setdefault("PYTORCH_ALLOC_CONF", "expandable_segments:True,max_split_size_mb:512")

MODEL_ID = "zai-org/GLM-4.7-Flash"
MODEL_OUT = MODEL_ID.split("/")[1] + "-FP8"

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=MODEL_OUT,
    scheme="FP8_BLOCK",
    ignore=[
        "lm_head",
        "re:.*mlp\\.gate$",  # MoE router
        "re:.*kv_a_proj_with_mqa$",
        "re:.*q_a_proj$",
        "model.embed_tokens",
    ],
    max_workers=16,
    device="cuda:0",
)

print(f"SUCCESS: files saved in {MODEL_OUT}")

FP8 quantization does not require calibration.

Quantization theory and heuristics for manual tuning

In-depth overview of quantization theory and heuristics for manual tuning

Layers to quantize

Quantization should be focused on Linear layers (also called Dense or Fully-Connected layers i.e. MatMul+Bias) In particular quantizing LayerNorm/RMSnorm layer is strongly discouraged, see [1]

LayerNorm in Quantization. Kovaleva et al. (2021); Wei et al. (2022) find that outliers in the LayerNorm parameters of BERT (Devlin et al., 2019) cause difficulties in model compression. Given the importance of LayerNorm, all the quantization methods we discuss above leave LayerNorm unquantized.

This is also reported in Intel and Nvidia repo:

Tensors to up-quantize

If there is enough bits, down projections should be prioritized.

According to [4]

Fig. 3: Maximum absolute value over layers for a LLaMA3-8B. Each color represent a different projection and we clearly see that down_proj has the biggest spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model According to [5] Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting that weight outliers are concentrated in the down-projection matrices Wdown ℓ of the second layer and the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last two layers.

Mixture-of-Experts quantization (MoE)

Mixture-of-Experts require specific quantization techniques.

Mixed-precision quantization

Some layers have a higher impact on LLM performance. According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers. According to [3] on 2-bit quantization:

quantizing expert FFN layers do not seriously impact model quality
quantizing cross-attention has some impact
quantizing self-attention has a large impact
quantizing dense FFN has a very significant impact Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers. We notice that:
official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:
- https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json
NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:
- https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

Layers with high-impact

According to [2], giving more bits to the first k blocks have a significantly higher impact on model quality than for the same last k blocks.

Expert quantization

When quantizing MoE, quantizing activations is tricky as only a subset of experts are activated per request. You have to make sure all experts are calibrated.

Visual showcase of why ensuring quantization of all MoE experts is important

- Source: https://avtc.github.io/aquarium-side-by-side/ - Context: https://github.com/ModelCloud/GPTQModel/pull/2235 ![image](https://cdn-uploads.huggingface.co/production/uploads/67f26fd2c7b14380431d1f5a/BDc3-0m3_WLl3ZmbBMhmd.png)

## References 1. Why Do Some Inputs Break Low-Bit LLM Quantization? (2025)\ Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia\ https://arxiv.org/pdf/2506.12044 2. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\ Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\ https://arxiv.org/pdf/2406.08155v1 3. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\ Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\ https://arxiv.org/pdf/2310.02410 4. Precision Where It Matters: A Novel Spike\ Aware Mixed-Precision Quantization Strategy for\ LLaMA-based Language Models (2025)\ Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello\ https://arxiv.org/pdf/2504.21553 5. Systematic Outliers in Large Language Models (2025)\ Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang\ https://arxiv.org/pdf/2502.06415v2