File size: 2,080 Bytes
760b1d9
02419e3
760b1d9
 
02419e3
 
 
180a767
 
760b1d9
 
02419e3
760b1d9
180a767
760b1d9
8921e2e
 
 
180a767
760b1d9
60a4777
180a767
 
 
760b1d9
8921e2e
 
60a4777
 
 
 
 
 
 
 
 
 
 
 
 
02419e3
 
60a4777
 
02419e3
60a4777
 
 
 
 
 
 
 
760b1d9
60a4777
 
760b1d9
 
5d2df64
60a4777
5d2df64
60a4777
5d2df64
 
 
 
 
 
 
 
 
 
 
 
 
02419e3
 
 
180a767
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: mit
base_model: zai-org/GLM-4.7-Flash
tags:
  - fp8
  - quantized
  - glm4
  - moe
library_name: transformers
---

# GLM-4.7-Flash FP8

FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).

**NOTE**: For optimal generation parameters, unsloth are working on finding the optimal parameters for practical use as of now. 
They are for llama.cpp, but I'm sure they translate to vLLM as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

## Quantization Details

- **Method**: FP8 E4M3 per-tensor quantization with embedded scales
- **Original size**: ~62GB (BF16)
- **Quantized size**: ~30GB (FP8)
- **Preserved in BF16**: lm_head, embed_tokens, layernorms, router weights



## Performance

Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0:

| Setting | Value |
|---------|-------|
| Tensor Parallel | 2 |
| Context Length | 8192 |
| VRAM per GPU | 14.7 GB |
| Throughput | **19.4 tokens/sec** |

Note: RTX 3090 lacks native FP8 support, so vLLM uses the Marlin kernel for weight-only FP8 decompression. GPUs with native FP8 (RTX 40xx, Ada Lovelace+) will achieve higher throughput.

## Usage with vLLM

Requires vLLM 0.13.0+ and transformers 5.0+ for `glm4_moe_lite` architecture support.

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="marksverdhei/GLM-4.7-Flash-fp8",
    tensor_parallel_size=2,
    max_model_len=8192,
    enforce_eager=True,  # Optional: disable CUDA graphs to save VRAM
)

outputs = llm.generate(["Hello, world!"], SamplingParams(max_tokens=100))
print(outputs[0].outputs[0].text)
```

### vLLM Fork Required

Until upstream vLLM adds MLA detection for `glm4_moe_lite`, use our fork:

```bash
pip install git+https://github.com/marksverdhei/vllm.git@fix/glm4-moe-mla-detection
```

Or install from source:
```bash
git clone https://github.com/marksverdhei/vllm.git
cd vllm
git checkout fix/glm4-moe-mla-detection
pip install -e .
```

**Fork**: [marksverdhei/vllm](https://github.com/marksverdhei/vllm/tree/fix/glm4-moe-mla-detection)

## License

MIT (same as base model)