GLM-4.7-Flash-GGUF / README.md

Update README.md

4b16fc0 verified about 14 hours ago

4.06 kB

	---
	license: apache-2.0
	base_model: zai-org/GLM-4.7-Flash
	base_model_relation: quantized
	library_name: gguf
	pipeline_tag: text-generation
	tags:
	- gguf
	- quantized
	- llama-cpp
	- llama.cpp
	- moe
	- glm4
	- 4-bit
	- Q4_K_M
	model_type: glm4-moe-lite
	quantized_by: solarkyle
	---

	# GLM-4.7-Flash-GGUF-Q4_K_M

	GGUF quantization of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) for use with [llama.cpp](https://github.com/ggml-org/llama.cpp) and compatible inference engines.

	This is my first quant! Let me know if something seems off or if you run into issues.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| `Glm4MoeLiteForCausalLM` (MoE with MLA) \|
	\| Total Parameters \| ~30B \|
	\| Active Parameters \| ~3B per forward pass \|
	\| Experts \| 64 routed + 1 shared, 4 active per token \|
	\| Attention \| Multi-Head Latent Attention (MLA) \|
	\| Context Length \| 32,768 tokens \|
	\| Original Format \| BF16 safetensors (62.5 GB) \|

	## Available Quantizations

	\| Filename \| Quant Type \| Size \| Description \|
	\|----------\|------------\|------\|-------------\|
	\| `GLM-4.7-Flash-Q4_K_M.gguf` \| Q4_K_M \| 16.89 GB \| Medium quality 4-bit, good balance of size/quality \|

	## Quantization Process

	### The Challenge

	GLM-4.7-Flash uses a unique architecture that wasn't directly supported by llama.cpp at the time of quantization:

	1. MoE (Mixture of Experts) - 64 routed experts + 1 shared expert, with 4 experts active per token
	2. MLA (Multi-Head Latent Attention) - A compressed attention mechanism similar to DeepSeek-V2 that uses low-rank projections

	The `Glm4MoeLiteForCausalLM` architecture combines both features in a way that standard GLM4 support doesn't handle.

	### The Solution

	We modified `convert_hf_to_gguf.py` to register `Glm4MoeLiteForCausalLM` under the `DeepseekV2Model` class, which already has proper tensor mappings for MLA attention:

	```python
	# Added to DeepseekV2Model registration
	@ModelBase.register(
	"DeepseekV2ForCausalLM",
	"DeepseekV3ForCausalLM",
	"KimiVLForConditionalGeneration",
	"YoutuForCausalLM",
	"YoutuVLForConditionalGeneration",
	"Glm4MoeLiteForCausalLM" # <-- Added this
	)
	class DeepseekV2Model(TextModel):
	```

	We also added the tokenizer hash for proper tokenization:

	```python
	if chkhsh == "cdf5f35325780597efd76153d4d1c16778f766173908894c04afc20108536267":
	res = "glm4"
	```

	### Conversion Pipeline

	```
	HuggingFace safetensors (62.5 GB BF16)
	↓ convert_hf_to_gguf.py (modified)
	BF16 GGUF (55.79 GB)
	↓ llama-quantize
	Q4_K_M GGUF (16.89 GB)
	```

	### Quantization Notes

	- 47 of 563 tensors required fallback quantization (q5_0 instead of q4_K) due to dimension requirements
	- MoE expert weights compressed from 384 MB each to ~108-157 MB
	- Total size reduction: ~70% from BF16

	## Testing

	Quick inference test with llama-cli (CPU):
	- Prompt processing: 93.8 tokens/second
	- Generation: 19.8 tokens/second

	The model loads and generates coherent text. GPU acceleration with CUDA should provide significantly better performance.

	## Usage with llama.cpp

	```bash
	# Download the model
	huggingface-cli download solarkyle/GLM-4.7-Flash-GGUF GLM-4.7-Flash-Q4_K_M.gguf --local-dir .

	# Run inference (CPU)
	./llama-cli -m GLM-4.7-Flash-Q4_K_M.gguf -p "Hello, I am GLM-4.7-Flash" -n 256

	# Run inference (GPU - adjust layers based on VRAM)
	./llama-cli -m GLM-4.7-Flash-Q4_K_M.gguf -p "Hello, I am GLM-4.7-Flash" -n 256 -ngl 35
	```

	## Source Code

	The conversion scripts and process documentation are available at: [github.com/solarkyle/GLM-4.7-Flash-GGUF](https://github.com/solarkyle/GLM-4.7-Flash-GGUF)

	## Credits

	- Original model: [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash)
	- Quantization: solarkyle
	- Tools: [llama.cpp](https://github.com/ggml-org/llama.cpp) (b7772)

	## Feedback

	This is my first quant - feedback welcome! If you notice any issues with quality, compatibility, or have suggestions for improvement, please open an issue on the GitHub repo or leave a comment here.