GLM-4.7-Flash-FP8 / README.md

marksverdhei

Update README.md

8921e2e verified about 10 hours ago

preview code

raw

history blame contribute delete

2.08 kB

metadata

license: mit
base_model: zai-org/GLM-4.7-Flash
tags:
  - fp8
  - quantized
  - glm4
  - moe
library_name: transformers

GLM-4.7-Flash FP8

FP8 quantized version of zai-org/GLM-4.7-Flash.

NOTE: For optimal generation parameters, unsloth are working on finding the optimal parameters for practical use as of now. They are for llama.cpp, but I'm sure they translate to vLLM as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

Quantization Details

Method: FP8 E4M3 per-tensor quantization with embedded scales
Original size: ~62GB (BF16)
Quantized size: ~30GB (FP8)
Preserved in BF16: lm_head, embed_tokens, layernorms, router weights

Performance

Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0:

Setting	Value
Tensor Parallel	2
Context Length	8192
VRAM per GPU	14.7 GB
Throughput	19.4 tokens/sec

Note: RTX 3090 lacks native FP8 support, so vLLM uses the Marlin kernel for weight-only FP8 decompression. GPUs with native FP8 (RTX 40xx, Ada Lovelace+) will achieve higher throughput.

Usage with vLLM

Requires vLLM 0.13.0+ and transformers 5.0+ for glm4_moe_lite architecture support.

from vllm import LLM, SamplingParams

llm = LLM(
    model="marksverdhei/GLM-4.7-Flash-fp8",
    tensor_parallel_size=2,
    max_model_len=8192,
    enforce_eager=True,  # Optional: disable CUDA graphs to save VRAM
)

outputs = llm.generate(["Hello, world!"], SamplingParams(max_tokens=100))
print(outputs[0].outputs[0].text)

vLLM Fork Required

Until upstream vLLM adds MLA detection for glm4_moe_lite, use our fork:

pip install git+https://github.com/marksverdhei/vllm.git@fix/glm4-moe-mla-detection

Or install from source:

git clone https://github.com/marksverdhei/vllm.git
cd vllm
git checkout fix/glm4-moe-mla-detection
pip install -e .

Fork: marksverdhei/vllm

License

MIT (same as base model)