license: mit
base_model: zai-org/GLM-4.7-Flash
tags:
- fp8
- quantized
- glm4
- moe
library_name: transformers
GLM-4.7-Flash FP8
FP8 quantized version of zai-org/GLM-4.7-Flash.
NOTE: For optimal generation parameters, unsloth are working on finding the optimal parameters for practical use as of now. They are for llama.cpp, but I'm sure they translate to vLLM as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
Quantization Details
- Method: FP8 E4M3 per-tensor quantization with embedded scales
- Original size: ~62GB (BF16)
- Quantized size: ~30GB (FP8)
- Preserved in BF16: lm_head, embed_tokens, layernorms, router weights
Performance
Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0:
| Setting | Value |
|---|---|
| Tensor Parallel | 2 |
| Context Length | 8192 |
| VRAM per GPU | 14.7 GB |
| Throughput | 19.4 tokens/sec |
Note: RTX 3090 lacks native FP8 support, so vLLM uses the Marlin kernel for weight-only FP8 decompression. GPUs with native FP8 (RTX 40xx, Ada Lovelace+) will achieve higher throughput.
Usage with vLLM
Requires vLLM 0.13.0+ and transformers 5.0+ for glm4_moe_lite architecture support.
from vllm import LLM, SamplingParams
llm = LLM(
model="marksverdhei/GLM-4.7-Flash-fp8",
tensor_parallel_size=2,
max_model_len=8192,
enforce_eager=True, # Optional: disable CUDA graphs to save VRAM
)
outputs = llm.generate(["Hello, world!"], SamplingParams(max_tokens=100))
print(outputs[0].outputs[0].text)
vLLM Fork Required
Until upstream vLLM adds MLA detection for glm4_moe_lite, use our fork:
pip install git+https://github.com/marksverdhei/vllm.git@fix/glm4-moe-mla-detection
Or install from source:
git clone https://github.com/marksverdhei/vllm.git
cd vllm
git checkout fix/glm4-moe-mla-detection
pip install -e .
Fork: marksverdhei/vllm
License
MIT (same as base model)