File size: 5,493 Bytes
9d8e422 7162e09 9d8e422 7162e09 9d8e422 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
---
language:
- en
- zh
library_name: transformers
license: mit
pipeline_tag: text-generation
base_model:
- zai-org/GLM-4.7-Flash
tags:
- trellis
- quantized
- moe
- 3-bit
- mixed-precision
- cuda
- glm
- metal-marlin
---
# GLM-4.7-Flash-Trellis-3.8bpw
<div align="center">
<img src="https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg" width="15%"/>
</div>
**Trellis-quantized GLM-4.7-Flash** β a 30B-A3B MoE model compressed to **3.78 bits per weight** using sensitivity-aware mixed-precision quantization.
| Metric | Value |
|--------|-------|
| **Effective bits** | 3.78 bpw |
| **Compression** | 4.2Γ vs FP16 |
| **Model size** | ~14 GB (vs ~60 GB FP16) |
| **Parameters** | 29.3B |
| **Format** | HuggingFace sharded safetensors |
## Model Description
This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency.
GLM-4.7-Flash features:
- **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token)
- **Multi-head Latent Attention (MLA)** for 8Γ KV cache compression
- **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
- **Bilingual** (English + Chinese)
## Quantization Details
Quantized using **Trellis** (EXL3-style) with Metal Marlin acceleration:
### Bit Allocation
| Bit Width | Tensors | Parameters | % of Model |
|-----------|---------|------------|------------|
| 6-bit | 3,037 | 9.4B | 32.2% |
| 3-bit | 2,710 | 8.6B | 29.3% |
| 2-bit | 2,736 | 8.6B | 29.3% |
| 4-bit | 575 | 2.1B | 7.2% |
| 5-bit | 196 | 591M | 2.0% |
### Sensitivity-Aware Allocation
- **8-bit**: Router weights, embeddings, LM head, layer norms
- **6-bit**: Gate layers, attention projections with high outlier ratios
- **4-5 bit**: Standard attention layers (q/k/v/o projections)
- **2-3 bit**: MoE expert layers (lowest sensitivity)
### Quantization Statistics
- **Average MSE**: 0.000223
- **Average RMSE**: 0.0149
- **Quantization time**: ~110 seconds (RTX 3090 Ti)
- **Method**: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128)
## Files
```
GLM-4.7-Flash-Trellis-MM/
βββ model-00001-of-00007.safetensors # ~2 GB each
βββ model-00002-of-00007.safetensors
βββ model-00003-of-00007.safetensors
βββ model-00004-of-00007.safetensors
βββ model-00005-of-00007.safetensors
βββ model-00006-of-00007.safetensors
βββ model-00007-of-00007.safetensors
βββ model.safetensors.index.json # Weight map
βββ base_weights.safetensors # Embeddings, norms (FP16)
βββ config.json # Model config
βββ tokenizer.json # Tokenizer
βββ tokenizer_config.json
βββ quantization_index.json # Quantization metadata
```
## Usage
### With Metal Marlin (Apple Silicon)
```python
from metal_marlin.trellis import TrellisForCausalLM
from transformers import AutoTokenizer
model = TrellisForCausalLM.from_pretrained(
"RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw",
device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
### Tensor Format
Each quantized tensor has 4 components:
- `{name}__indices`: Packed uint8 Trellis indices
- `{name}__scales`: FP16 per-group scales (group_size=128)
- `{name}__su`: FP16 row scaling factors
- `{name}__sv`: FP16 column scaling factors
## Hardware Requirements
| Device | VRAM | Notes |
|--------|------|-------|
| Apple M2 Ultra | 64 GB+ | Via Metal Marlin |
| Apple M4 Max | 36 GB+ | Via Metal Marlin |
## Benchmarks
### Original Model Performance (from Z.AI)
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|-----------|---------------|---------------|-------------|
| AIME 2025 | **91.6** | 85.0 | 91.7 |
| GPQA | **75.2** | 73.4 | 71.5 |
| SWE-bench Verified | **59.2** | 22.0 | 34.0 |
| ΟΒ²-Bench | **79.5** | 49.0 | 47.7 |
| BrowseComp | **42.8** | 2.29 | 28.3 |
### Quantized Model (Metal Marlin, M4 Max)
| Metric | Value |
|--------|-------|
| Decode | 5.4 tok/s |
| Prefill (2K) | 42 tok/s |
| Memory | 16.9 GB |
## Limitations
- **Not compatible with standard transformers** β requires Trellis-aware inference code
- **No speculative decoding** yet
- **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization)
## Credits
- **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Quantization method**: [Trellis/EXL3](https://github.com/turboderp/exllamav3)
- **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin)
## Citation
If you use this model, please cite the original GLM-4.5 paper:
```bibtex
@misc{glm2025glm45,
title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
author={GLM Team and Aohan Zeng and Xin Lv and others},
year={2025},
eprint={2508.06471},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.06471},
}
```
## License
This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model.
|