|
|
--- |
|
|
license: mit |
|
|
base_model: zai-org/GLM-4.7-Flash |
|
|
tags: |
|
|
- fp8 |
|
|
- quantized |
|
|
- glm4 |
|
|
- moe |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# GLM-4.7-Flash FP8 |
|
|
|
|
|
FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash). |
|
|
|
|
|
**NOTE**: For optimal generation parameters, unsloth are working on finding the optimal parameters for practical use as of now. |
|
|
They are for llama.cpp, but I'm sure they translate to vLLM as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
- **Method**: FP8 E4M3 per-tensor quantization with embedded scales |
|
|
- **Original size**: ~62GB (BF16) |
|
|
- **Quantized size**: ~30GB (FP8) |
|
|
- **Preserved in BF16**: lm_head, embed_tokens, layernorms, router weights |
|
|
|
|
|
|
|
|
|
|
|
## Performance |
|
|
|
|
|
Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0: |
|
|
|
|
|
| Setting | Value | |
|
|
|---------|-------| |
|
|
| Tensor Parallel | 2 | |
|
|
| Context Length | 8192 | |
|
|
| VRAM per GPU | 14.7 GB | |
|
|
| Throughput | **19.4 tokens/sec** | |
|
|
|
|
|
Note: RTX 3090 lacks native FP8 support, so vLLM uses the Marlin kernel for weight-only FP8 decompression. GPUs with native FP8 (RTX 40xx, Ada Lovelace+) will achieve higher throughput. |
|
|
|
|
|
## Usage with vLLM |
|
|
|
|
|
Requires vLLM 0.13.0+ and transformers 5.0+ for `glm4_moe_lite` architecture support. |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
llm = LLM( |
|
|
model="marksverdhei/GLM-4.7-Flash-fp8", |
|
|
tensor_parallel_size=2, |
|
|
max_model_len=8192, |
|
|
enforce_eager=True, # Optional: disable CUDA graphs to save VRAM |
|
|
) |
|
|
|
|
|
outputs = llm.generate(["Hello, world!"], SamplingParams(max_tokens=100)) |
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
### vLLM Fork Required |
|
|
|
|
|
Until upstream vLLM adds MLA detection for `glm4_moe_lite`, use our fork: |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/marksverdhei/vllm.git@fix/glm4-moe-mla-detection |
|
|
``` |
|
|
|
|
|
Or install from source: |
|
|
```bash |
|
|
git clone https://github.com/marksverdhei/vllm.git |
|
|
cd vllm |
|
|
git checkout fix/glm4-moe-mla-detection |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
**Fork**: [marksverdhei/vllm](https://github.com/marksverdhei/vllm/tree/fix/glm4-moe-mla-detection) |
|
|
|
|
|
## License |
|
|
|
|
|
MIT (same as base model) |
|
|
|