File size: 5,493 Bytes
9d8e422
 
 
 
 
 
 
7162e09
 
9d8e422
 
 
 
 
 
 
 
7162e09
9d8e422
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
language:
  - en
  - zh
library_name: transformers
license: mit
pipeline_tag: text-generation
base_model: 
  - zai-org/GLM-4.7-Flash
tags:
  - trellis
  - quantized
  - moe
  - 3-bit
  - mixed-precision
  - cuda
  - glm
  - metal-marlin
---

# GLM-4.7-Flash-Trellis-3.8bpw

<div align="center">
<img src="https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg" width="15%"/>
</div>

**Trellis-quantized GLM-4.7-Flash** β€” a 30B-A3B MoE model compressed to **3.78 bits per weight** using sensitivity-aware mixed-precision quantization.

| Metric | Value |
|--------|-------|
| **Effective bits** | 3.78 bpw |
| **Compression** | 4.2Γ— vs FP16 |
| **Model size** | ~14 GB (vs ~60 GB FP16) |
| **Parameters** | 29.3B |
| **Format** | HuggingFace sharded safetensors |

## Model Description

This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency.

GLM-4.7-Flash features:
- **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token)
- **Multi-head Latent Attention (MLA)** for 8Γ— KV cache compression
- **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
- **Bilingual** (English + Chinese)

## Quantization Details

Quantized using **Trellis** (EXL3-style) with Metal Marlin acceleration:

### Bit Allocation

| Bit Width | Tensors | Parameters | % of Model |
|-----------|---------|------------|------------|
| 6-bit | 3,037 | 9.4B | 32.2% |
| 3-bit | 2,710 | 8.6B | 29.3% |
| 2-bit | 2,736 | 8.6B | 29.3% |
| 4-bit | 575 | 2.1B | 7.2% |
| 5-bit | 196 | 591M | 2.0% |

### Sensitivity-Aware Allocation

- **8-bit**: Router weights, embeddings, LM head, layer norms
- **6-bit**: Gate layers, attention projections with high outlier ratios
- **4-5 bit**: Standard attention layers (q/k/v/o projections)
- **2-3 bit**: MoE expert layers (lowest sensitivity)

### Quantization Statistics

- **Average MSE**: 0.000223
- **Average RMSE**: 0.0149
- **Quantization time**: ~110 seconds (RTX 3090 Ti)
- **Method**: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128)

## Files

```
GLM-4.7-Flash-Trellis-MM/
β”œβ”€β”€ model-00001-of-00007.safetensors   # ~2 GB each
β”œβ”€β”€ model-00002-of-00007.safetensors
β”œβ”€β”€ model-00003-of-00007.safetensors
β”œβ”€β”€ model-00004-of-00007.safetensors
β”œβ”€β”€ model-00005-of-00007.safetensors
β”œβ”€β”€ model-00006-of-00007.safetensors
β”œβ”€β”€ model-00007-of-00007.safetensors
β”œβ”€β”€ model.safetensors.index.json       # Weight map
β”œβ”€β”€ base_weights.safetensors           # Embeddings, norms (FP16)
β”œβ”€β”€ config.json                        # Model config
β”œβ”€β”€ tokenizer.json                     # Tokenizer
β”œβ”€β”€ tokenizer_config.json
└── quantization_index.json            # Quantization metadata
```

## Usage

### With Metal Marlin (Apple Silicon)

```python
from metal_marlin.trellis import TrellisForCausalLM
from transformers import AutoTokenizer

model = TrellisForCausalLM.from_pretrained(
    "RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw",
    device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")

prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```


### Tensor Format

Each quantized tensor has 4 components:
- `{name}__indices`: Packed uint8 Trellis indices
- `{name}__scales`: FP16 per-group scales (group_size=128)
- `{name}__su`: FP16 row scaling factors
- `{name}__sv`: FP16 column scaling factors

## Hardware Requirements

| Device | VRAM | Notes |
|--------|------|-------|
| Apple M2 Ultra | 64 GB+ | Via Metal Marlin |
| Apple M4 Max | 36 GB+ | Via Metal Marlin |

## Benchmarks

### Original Model Performance (from Z.AI)

| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|-----------|---------------|---------------|-------------|
| AIME 2025 | **91.6** | 85.0 | 91.7 |
| GPQA | **75.2** | 73.4 | 71.5 |
| SWE-bench Verified | **59.2** | 22.0 | 34.0 |
| τ²-Bench | **79.5** | 49.0 | 47.7 |
| BrowseComp | **42.8** | 2.29 | 28.3 |

### Quantized Model (Metal Marlin, M4 Max)

| Metric | Value |
|--------|-------|
| Decode | 5.4 tok/s |
| Prefill (2K) | 42 tok/s |
| Memory | 16.9 GB |

## Limitations

- **Not compatible with standard transformers** β€” requires Trellis-aware inference code
- **No speculative decoding** yet
- **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization)

## Credits

- **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Quantization method**: [Trellis/EXL3](https://github.com/turboderp/exllamav3)
- **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin)

## Citation

If you use this model, please cite the original GLM-4.5 paper:

```bibtex
@misc{glm2025glm45,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}
```

## License

This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model.