GLM-4.7-Flash-GGUF / README.md
solarkyle's picture
Update README.md
4b16fc0 verified
metadata
license: apache-2.0
base_model: zai-org/GLM-4.7-Flash
base_model_relation: quantized
library_name: gguf
pipeline_tag: text-generation
tags:
  - gguf
  - quantized
  - llama-cpp
  - llama.cpp
  - moe
  - glm4
  - 4-bit
  - Q4_K_M
model_type: glm4-moe-lite
quantized_by: solarkyle

GLM-4.7-Flash-GGUF-Q4_K_M

GGUF quantization of zai-org/GLM-4.7-Flash for use with llama.cpp and compatible inference engines.

This is my first quant! Let me know if something seems off or if you run into issues.

Model Details

Property Value
Architecture Glm4MoeLiteForCausalLM (MoE with MLA)
Total Parameters ~30B
Active Parameters ~3B per forward pass
Experts 64 routed + 1 shared, 4 active per token
Attention Multi-Head Latent Attention (MLA)
Context Length 32,768 tokens
Original Format BF16 safetensors (62.5 GB)

Available Quantizations

Filename Quant Type Size Description
GLM-4.7-Flash-Q4_K_M.gguf Q4_K_M 16.89 GB Medium quality 4-bit, good balance of size/quality

Quantization Process

The Challenge

GLM-4.7-Flash uses a unique architecture that wasn't directly supported by llama.cpp at the time of quantization:

  1. MoE (Mixture of Experts) - 64 routed experts + 1 shared expert, with 4 experts active per token
  2. MLA (Multi-Head Latent Attention) - A compressed attention mechanism similar to DeepSeek-V2 that uses low-rank projections

The Glm4MoeLiteForCausalLM architecture combines both features in a way that standard GLM4 support doesn't handle.

The Solution

We modified convert_hf_to_gguf.py to register Glm4MoeLiteForCausalLM under the DeepseekV2Model class, which already has proper tensor mappings for MLA attention:

# Added to DeepseekV2Model registration
@ModelBase.register(
    "DeepseekV2ForCausalLM",
    "DeepseekV3ForCausalLM",
    "KimiVLForConditionalGeneration",
    "YoutuForCausalLM",
    "YoutuVLForConditionalGeneration",
    "Glm4MoeLiteForCausalLM"  # <-- Added this
)
class DeepseekV2Model(TextModel):

We also added the tokenizer hash for proper tokenization:

if chkhsh == "cdf5f35325780597efd76153d4d1c16778f766173908894c04afc20108536267":
    res = "glm4"

Conversion Pipeline

HuggingFace safetensors (62.5 GB BF16)
    ↓ convert_hf_to_gguf.py (modified)
BF16 GGUF (55.79 GB)
    ↓ llama-quantize
Q4_K_M GGUF (16.89 GB)

Quantization Notes

  • 47 of 563 tensors required fallback quantization (q5_0 instead of q4_K) due to dimension requirements
  • MoE expert weights compressed from 384 MB each to ~108-157 MB
  • Total size reduction: ~70% from BF16

Testing

Quick inference test with llama-cli (CPU):

  • Prompt processing: 93.8 tokens/second
  • Generation: 19.8 tokens/second

The model loads and generates coherent text. GPU acceleration with CUDA should provide significantly better performance.

Usage with llama.cpp

# Download the model
huggingface-cli download solarkyle/GLM-4.7-Flash-GGUF GLM-4.7-Flash-Q4_K_M.gguf --local-dir .

# Run inference (CPU)
./llama-cli -m GLM-4.7-Flash-Q4_K_M.gguf -p "Hello, I am GLM-4.7-Flash" -n 256

# Run inference (GPU - adjust layers based on VRAM)
./llama-cli -m GLM-4.7-Flash-Q4_K_M.gguf -p "Hello, I am GLM-4.7-Flash" -n 256 -ngl 35

Source Code

The conversion scripts and process documentation are available at: github.com/solarkyle/GLM-4.7-Flash-GGUF

Credits

Feedback

This is my first quant - feedback welcome! If you notice any issues with quality, compatibility, or have suggestions for improvement, please open an issue on the GitHub repo or leave a comment here.