GLM-4.7-Flash-Trellis-MM / README.md

Kearm

Update README.md

7162e09 verified 9 days ago

preview code

raw

history blame contribute delete

5.49 kB

metadata

language:
  - en
  - zh
library_name: transformers
license: mit
pipeline_tag: text-generation
base_model:
  - zai-org/GLM-4.7-Flash
tags:
  - trellis
  - quantized
  - moe
  - 3-bit
  - mixed-precision
  - cuda
  - glm
  - metal-marlin

GLM-4.7-Flash-Trellis-3.8bpw

Trellis-quantized GLM-4.7-Flash — a 30B-A3B MoE model compressed to 3.78 bits per weight using sensitivity-aware mixed-precision quantization.

Metric	Value
Effective bits	3.78 bpw
Compression	4.2× vs FP16
Model size	~14 GB (vs ~60 GB FP16)
Parameters	29.3B
Format	HuggingFace sharded safetensors

Model Description

This is a quantized version of zai-org/GLM-4.7-Flash, the strongest model in the 30B class that balances performance and efficiency.

GLM-4.7-Flash features:

30B-A3B MoE architecture (64 experts + shared expert, 2-4 active per token)
Multi-head Latent Attention (MLA) for 8× KV cache compression
State-of-the-art reasoning (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
Bilingual (English + Chinese)

Quantization Details

Quantized using Trellis (EXL3-style) with Metal Marlin acceleration:

Bit Allocation

Bit Width	Tensors	Parameters	% of Model
6-bit	3,037	9.4B	32.2%
3-bit	2,710	8.6B	29.3%
2-bit	2,736	8.6B	29.3%
4-bit	575	2.1B	7.2%
5-bit	196	591M	2.0%

Sensitivity-Aware Allocation

8-bit: Router weights, embeddings, LM head, layer norms
6-bit: Gate layers, attention projections with high outlier ratios
4-5 bit: Standard attention layers (q/k/v/o projections)
2-3 bit: MoE expert layers (lowest sensitivity)

Quantization Statistics

Average MSE: 0.000223
Average RMSE: 0.0149
Quantization time: ~110 seconds (RTX 3090 Ti)
Method: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128)

Files

GLM-4.7-Flash-Trellis-MM/
├── model-00001-of-00007.safetensors   # ~2 GB each
├── model-00002-of-00007.safetensors
├── model-00003-of-00007.safetensors
├── model-00004-of-00007.safetensors
├── model-00005-of-00007.safetensors
├── model-00006-of-00007.safetensors
├── model-00007-of-00007.safetensors
├── model.safetensors.index.json       # Weight map
├── base_weights.safetensors           # Embeddings, norms (FP16)
├── config.json                        # Model config
├── tokenizer.json                     # Tokenizer
├── tokenizer_config.json
└── quantization_index.json            # Quantization metadata

Usage

With Metal Marlin (Apple Silicon)

from metal_marlin.trellis import TrellisForCausalLM
from transformers import AutoTokenizer

model = TrellisForCausalLM.from_pretrained(
    "RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw",
    device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")

prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Tensor Format

Each quantized tensor has 4 components:

{name}__indices: Packed uint8 Trellis indices
{name}__scales: FP16 per-group scales (group_size=128)
{name}__su: FP16 row scaling factors
{name}__sv: FP16 column scaling factors

Hardware Requirements

Device	VRAM	Notes
Apple M2 Ultra	64 GB+	Via Metal Marlin
Apple M4 Max	36 GB+	Via Metal Marlin

Benchmarks

Original Model Performance (from Z.AI)

Benchmark	GLM-4.7-Flash	Qwen3-30B-A3B	GPT-OSS-20B
AIME 2025	91.6	85.0	91.7
GPQA	75.2	73.4	71.5
SWE-bench Verified	59.2	22.0	34.0
τ²-Bench	79.5	49.0	47.7
BrowseComp	42.8	2.29	28.3

Quantized Model (Metal Marlin, M4 Max)

Metric	Value
Decode	5.4 tok/s
Prefill (2K)	42 tok/s
Memory	16.9 GB

Limitations

Not compatible with standard transformers — requires Trellis-aware inference code
No speculative decoding yet
Quality loss: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization)

Credits

Original model: Z.AI / GLM Team
Quantization method: Trellis/EXL3
Quantization toolkit: Metal Marlin

Citation

If you use this model, please cite the original GLM-4.5 paper:

@misc{glm2025glm45,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}

License

This quantized model inherits the MIT License from the original GLM-4.7-Flash model.