marksverdhei
/

GLM-4.7-Flash-FP8

Text Generation

Mixture of Experts

Model card Files Files and versions

GLM-4.7-Flash-FP8 / README.md

marksverdhei's picture

Update README.md

8921e2e verified about 16 hours ago

|

history blame contribute delete

2.08 kB

	---
	license: mit
	base_model: zai-org/GLM-4.7-Flash
	tags:
	- fp8
	- quantized
	- glm4
	- moe
	library_name: transformers
	---

	# GLM-4.7-Flash FP8

	FP8 quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).

	NOTE: For optimal generation parameters, unsloth are working on finding the optimal parameters for practical use as of now.
	They are for llama.cpp, but I'm sure they translate to vLLM as well: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

	## Quantization Details

	- Method: FP8 E4M3 per-tensor quantization with embedded scales
	- Original size: ~62GB (BF16)
	- Quantized size: ~30GB (FP8)
	- Preserved in BF16: lm_head, embed_tokens, layernorms, router weights



	## Performance

	Tested on 2x RTX 3090 (24GB each) with vLLM 0.13.0:

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Tensor Parallel \| 2 \|
	\| Context Length \| 8192 \|
	\| VRAM per GPU \| 14.7 GB \|
	\| Throughput \| 19.4 tokens/sec \|

	Note: RTX 3090 lacks native FP8 support, so vLLM uses the Marlin kernel for weight-only FP8 decompression. GPUs with native FP8 (RTX 40xx, Ada Lovelace+) will achieve higher throughput.

	## Usage with vLLM

	Requires vLLM 0.13.0+ and transformers 5.0+ for `glm4_moe_lite` architecture support.

	```python
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="marksverdhei/GLM-4.7-Flash-fp8",
	tensor_parallel_size=2,
	max_model_len=8192,
	enforce_eager=True, # Optional: disable CUDA graphs to save VRAM
	)

	outputs = llm.generate(["Hello, world!"], SamplingParams(max_tokens=100))
	print(outputs[0].outputs[0].text)
	```

	### vLLM Fork Required

	Until upstream vLLM adds MLA detection for `glm4_moe_lite`, use our fork:

	```bash
	pip install git+https://github.com/marksverdhei/vllm.git@fix/glm4-moe-mla-detection
	```

	Or install from source:
	```bash
	git clone https://github.com/marksverdhei/vllm.git
	cd vllm
	git checkout fix/glm4-moe-mla-detection
	pip install -e .
	```

	Fork: [marksverdhei/vllm](https://github.com/marksverdhei/vllm/tree/fix/glm4-moe-mla-detection)

	## License

	MIT (same as base model)