|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
library_name: transformers |
|
|
license: mit |
|
|
pipeline_tag: text-generation |
|
|
base_model: |
|
|
- zai-org/GLM-4.7-Flash |
|
|
tags: |
|
|
- trellis |
|
|
- quantized |
|
|
- moe |
|
|
- 3-bit |
|
|
- mixed-precision |
|
|
- cuda |
|
|
- glm |
|
|
- metal-marlin |
|
|
--- |
|
|
|
|
|
# GLM-4.7-Flash-Trellis-3.8bpw |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg" width="15%"/> |
|
|
</div> |
|
|
|
|
|
**Trellis-quantized GLM-4.7-Flash** β a 30B-A3B MoE model compressed to **3.78 bits per weight** using sensitivity-aware mixed-precision quantization. |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Effective bits** | 3.78 bpw | |
|
|
| **Compression** | 4.2Γ vs FP16 | |
|
|
| **Model size** | ~14 GB (vs ~60 GB FP16) | |
|
|
| **Parameters** | 29.3B | |
|
|
| **Format** | HuggingFace sharded safetensors | |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency. |
|
|
|
|
|
GLM-4.7-Flash features: |
|
|
- **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token) |
|
|
- **Multi-head Latent Attention (MLA)** for 8Γ KV cache compression |
|
|
- **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified) |
|
|
- **Bilingual** (English + Chinese) |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
Quantized using **Trellis** (EXL3-style) with Metal Marlin acceleration: |
|
|
|
|
|
### Bit Allocation |
|
|
|
|
|
| Bit Width | Tensors | Parameters | % of Model | |
|
|
|-----------|---------|------------|------------| |
|
|
| 6-bit | 3,037 | 9.4B | 32.2% | |
|
|
| 3-bit | 2,710 | 8.6B | 29.3% | |
|
|
| 2-bit | 2,736 | 8.6B | 29.3% | |
|
|
| 4-bit | 575 | 2.1B | 7.2% | |
|
|
| 5-bit | 196 | 591M | 2.0% | |
|
|
|
|
|
### Sensitivity-Aware Allocation |
|
|
|
|
|
- **8-bit**: Router weights, embeddings, LM head, layer norms |
|
|
- **6-bit**: Gate layers, attention projections with high outlier ratios |
|
|
- **4-5 bit**: Standard attention layers (q/k/v/o projections) |
|
|
- **2-3 bit**: MoE expert layers (lowest sensitivity) |
|
|
|
|
|
### Quantization Statistics |
|
|
|
|
|
- **Average MSE**: 0.000223 |
|
|
- **Average RMSE**: 0.0149 |
|
|
- **Quantization time**: ~110 seconds (RTX 3090 Ti) |
|
|
- **Method**: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128) |
|
|
|
|
|
## Files |
|
|
|
|
|
``` |
|
|
GLM-4.7-Flash-Trellis-MM/ |
|
|
βββ model-00001-of-00007.safetensors # ~2 GB each |
|
|
βββ model-00002-of-00007.safetensors |
|
|
βββ model-00003-of-00007.safetensors |
|
|
βββ model-00004-of-00007.safetensors |
|
|
βββ model-00005-of-00007.safetensors |
|
|
βββ model-00006-of-00007.safetensors |
|
|
βββ model-00007-of-00007.safetensors |
|
|
βββ model.safetensors.index.json # Weight map |
|
|
βββ base_weights.safetensors # Embeddings, norms (FP16) |
|
|
βββ config.json # Model config |
|
|
βββ tokenizer.json # Tokenizer |
|
|
βββ tokenizer_config.json |
|
|
βββ quantization_index.json # Quantization metadata |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With Metal Marlin (Apple Silicon) |
|
|
|
|
|
```python |
|
|
from metal_marlin.trellis import TrellisForCausalLM |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
model = TrellisForCausalLM.from_pretrained( |
|
|
"RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw", |
|
|
device="mps" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash") |
|
|
|
|
|
prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n" |
|
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps") |
|
|
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7) |
|
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
|
|
|
### Tensor Format |
|
|
|
|
|
Each quantized tensor has 4 components: |
|
|
- `{name}__indices`: Packed uint8 Trellis indices |
|
|
- `{name}__scales`: FP16 per-group scales (group_size=128) |
|
|
- `{name}__su`: FP16 row scaling factors |
|
|
- `{name}__sv`: FP16 column scaling factors |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
| Device | VRAM | Notes | |
|
|
|--------|------|-------| |
|
|
| Apple M2 Ultra | 64 GB+ | Via Metal Marlin | |
|
|
| Apple M4 Max | 36 GB+ | Via Metal Marlin | |
|
|
|
|
|
## Benchmarks |
|
|
|
|
|
### Original Model Performance (from Z.AI) |
|
|
|
|
|
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B | |
|
|
|-----------|---------------|---------------|-------------| |
|
|
| AIME 2025 | **91.6** | 85.0 | 91.7 | |
|
|
| GPQA | **75.2** | 73.4 | 71.5 | |
|
|
| SWE-bench Verified | **59.2** | 22.0 | 34.0 | |
|
|
| ΟΒ²-Bench | **79.5** | 49.0 | 47.7 | |
|
|
| BrowseComp | **42.8** | 2.29 | 28.3 | |
|
|
|
|
|
### Quantized Model (Metal Marlin, M4 Max) |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Decode | 5.4 tok/s | |
|
|
| Prefill (2K) | 42 tok/s | |
|
|
| Memory | 16.9 GB | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Not compatible with standard transformers** β requires Trellis-aware inference code |
|
|
- **No speculative decoding** yet |
|
|
- **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization) |
|
|
|
|
|
## Credits |
|
|
|
|
|
- **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash) |
|
|
- **Quantization method**: [Trellis/EXL3](https://github.com/turboderp/exllamav3) |
|
|
- **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the original GLM-4.5 paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{glm2025glm45, |
|
|
title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, |
|
|
author={GLM Team and Aohan Zeng and Xin Lv and others}, |
|
|
year={2025}, |
|
|
eprint={2508.06471}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2508.06471}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model. |
|
|
|