GLM-4.7-Flash-GGUF / README.md

solarkyle

Update README.md

4b16fc0 verified about 10 hours ago

preview code

raw

history blame contribute delete

4.06 kB

metadata

license: apache-2.0
base_model: zai-org/GLM-4.7-Flash
base_model_relation: quantized
library_name: gguf
pipeline_tag: text-generation
tags:
  - gguf
  - quantized
  - llama-cpp
  - llama.cpp
  - moe
  - glm4
  - 4-bit
  - Q4_K_M
model_type: glm4-moe-lite
quantized_by: solarkyle

GLM-4.7-Flash-GGUF-Q4_K_M

GGUF quantization of zai-org/GLM-4.7-Flash for use with llama.cpp and compatible inference engines.

This is my first quant! Let me know if something seems off or if you run into issues.

Model Details

Property	Value
Architecture	`Glm4MoeLiteForCausalLM` (MoE with MLA)
Total Parameters	~30B
Active Parameters	~3B per forward pass
Experts	64 routed + 1 shared, 4 active per token
Attention	Multi-Head Latent Attention (MLA)
Context Length	32,768 tokens
Original Format	BF16 safetensors (62.5 GB)

Available Quantizations

Filename	Quant Type	Size	Description
`GLM-4.7-Flash-Q4_K_M.gguf`	Q4_K_M	16.89 GB	Medium quality 4-bit, good balance of size/quality

Quantization Process

The Challenge

GLM-4.7-Flash uses a unique architecture that wasn't directly supported by llama.cpp at the time of quantization:

MoE (Mixture of Experts) - 64 routed experts + 1 shared expert, with 4 experts active per token
MLA (Multi-Head Latent Attention) - A compressed attention mechanism similar to DeepSeek-V2 that uses low-rank projections

The Glm4MoeLiteForCausalLM architecture combines both features in a way that standard GLM4 support doesn't handle.

The Solution

We modified convert_hf_to_gguf.py to register Glm4MoeLiteForCausalLM under the DeepseekV2Model class, which already has proper tensor mappings for MLA attention:

# Added to DeepseekV2Model registration
@ModelBase.register(
    "DeepseekV2ForCausalLM",
    "DeepseekV3ForCausalLM",
    "KimiVLForConditionalGeneration",
    "YoutuForCausalLM",
    "YoutuVLForConditionalGeneration",
    "Glm4MoeLiteForCausalLM"  # <-- Added this
)
class DeepseekV2Model(TextModel):

We also added the tokenizer hash for proper tokenization:

if chkhsh == "cdf5f35325780597efd76153d4d1c16778f766173908894c04afc20108536267":
    res = "glm4"

Conversion Pipeline

HuggingFace safetensors (62.5 GB BF16)
    ↓ convert_hf_to_gguf.py (modified)
BF16 GGUF (55.79 GB)
    ↓ llama-quantize
Q4_K_M GGUF (16.89 GB)

Quantization Notes

47 of 563 tensors required fallback quantization (q5_0 instead of q4_K) due to dimension requirements
MoE expert weights compressed from 384 MB each to ~108-157 MB
Total size reduction: ~70% from BF16

Testing

Quick inference test with llama-cli (CPU):

Prompt processing: 93.8 tokens/second
Generation: 19.8 tokens/second

The model loads and generates coherent text. GPU acceleration with CUDA should provide significantly better performance.

Usage with llama.cpp

# Download the model
huggingface-cli download solarkyle/GLM-4.7-Flash-GGUF GLM-4.7-Flash-Q4_K_M.gguf --local-dir .

# Run inference (CPU)
./llama-cli -m GLM-4.7-Flash-Q4_K_M.gguf -p "Hello, I am GLM-4.7-Flash" -n 256

# Run inference (GPU - adjust layers based on VRAM)
./llama-cli -m GLM-4.7-Flash-Q4_K_M.gguf -p "Hello, I am GLM-4.7-Flash" -n 256 -ngl 35

Source Code

The conversion scripts and process documentation are available at: github.com/solarkyle/GLM-4.7-Flash-GGUF

Credits

Original model: zai-org/GLM-4.7-Flash
Quantization: solarkyle
Tools: llama.cpp (b7772)

Feedback

This is my first quant - feedback welcome! If you notice any issues with quality, compatibility, or have suggestions for improvement, please open an issue on the GitHub repo or leave a comment here.