metadata
language:
- en
- zh
library_name: transformers
license: mit
pipeline_tag: text-generation
base_model:
- zai-org/GLM-4.7-Flash
tags:
- trellis
- quantized
- moe
- 3-bit
- mixed-precision
- cuda
- glm
- metal-marlin
GLM-4.7-Flash-Trellis-3.8bpw
Trellis-quantized GLM-4.7-Flash β a 30B-A3B MoE model compressed to 3.78 bits per weight using sensitivity-aware mixed-precision quantization.
| Metric | Value |
|---|---|
| Effective bits | 3.78 bpw |
| Compression | 4.2Γ vs FP16 |
| Model size | ~14 GB (vs ~60 GB FP16) |
| Parameters | 29.3B |
| Format | HuggingFace sharded safetensors |
Model Description
This is a quantized version of zai-org/GLM-4.7-Flash, the strongest model in the 30B class that balances performance and efficiency.
GLM-4.7-Flash features:
- 30B-A3B MoE architecture (64 experts + shared expert, 2-4 active per token)
- Multi-head Latent Attention (MLA) for 8Γ KV cache compression
- State-of-the-art reasoning (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
- Bilingual (English + Chinese)
Quantization Details
Quantized using Trellis (EXL3-style) with Metal Marlin acceleration:
Bit Allocation
| Bit Width | Tensors | Parameters | % of Model |
|---|---|---|---|
| 6-bit | 3,037 | 9.4B | 32.2% |
| 3-bit | 2,710 | 8.6B | 29.3% |
| 2-bit | 2,736 | 8.6B | 29.3% |
| 4-bit | 575 | 2.1B | 7.2% |
| 5-bit | 196 | 591M | 2.0% |
Sensitivity-Aware Allocation
- 8-bit: Router weights, embeddings, LM head, layer norms
- 6-bit: Gate layers, attention projections with high outlier ratios
- 4-5 bit: Standard attention layers (q/k/v/o projections)
- 2-3 bit: MoE expert layers (lowest sensitivity)
Quantization Statistics
- Average MSE: 0.000223
- Average RMSE: 0.0149
- Quantization time: ~110 seconds (RTX 3090 Ti)
- Method: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128)
Files
GLM-4.7-Flash-Trellis-MM/
βββ model-00001-of-00007.safetensors # ~2 GB each
βββ model-00002-of-00007.safetensors
βββ model-00003-of-00007.safetensors
βββ model-00004-of-00007.safetensors
βββ model-00005-of-00007.safetensors
βββ model-00006-of-00007.safetensors
βββ model-00007-of-00007.safetensors
βββ model.safetensors.index.json # Weight map
βββ base_weights.safetensors # Embeddings, norms (FP16)
βββ config.json # Model config
βββ tokenizer.json # Tokenizer
βββ tokenizer_config.json
βββ quantization_index.json # Quantization metadata
Usage
With Metal Marlin (Apple Silicon)
from metal_marlin.trellis import TrellisForCausalLM
from transformers import AutoTokenizer
model = TrellisForCausalLM.from_pretrained(
"RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw",
device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Tensor Format
Each quantized tensor has 4 components:
{name}__indices: Packed uint8 Trellis indices{name}__scales: FP16 per-group scales (group_size=128){name}__su: FP16 row scaling factors{name}__sv: FP16 column scaling factors
Hardware Requirements
| Device | VRAM | Notes |
|---|---|---|
| Apple M2 Ultra | 64 GB+ | Via Metal Marlin |
| Apple M4 Max | 36 GB+ | Via Metal Marlin |
Benchmarks
Original Model Performance (from Z.AI)
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|---|---|---|---|
| AIME 2025 | 91.6 | 85.0 | 91.7 |
| GPQA | 75.2 | 73.4 | 71.5 |
| SWE-bench Verified | 59.2 | 22.0 | 34.0 |
| ΟΒ²-Bench | 79.5 | 49.0 | 47.7 |
| BrowseComp | 42.8 | 2.29 | 28.3 |
Quantized Model (Metal Marlin, M4 Max)
| Metric | Value |
|---|---|
| Decode | 5.4 tok/s |
| Prefill (2K) | 42 tok/s |
| Memory | 16.9 GB |
Limitations
- Not compatible with standard transformers β requires Trellis-aware inference code
- No speculative decoding yet
- Quality loss: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization)
Credits
- Original model: Z.AI / GLM Team
- Quantization method: Trellis/EXL3
- Quantization toolkit: Metal Marlin
Citation
If you use this model, please cite the original GLM-4.5 paper:
@misc{glm2025glm45,
title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
author={GLM Team and Aohan Zeng and Xin Lv and others},
year={2025},
eprint={2508.06471},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.06471},
}
License
This quantized model inherits the MIT License from the original GLM-4.7-Flash model.