---
language:
  - en
  - zh
library_name: transformers
license: mit
pipeline_tag: text-generation
base_model: 
  - zai-org/GLM-4.7-Flash
tags:
  - trellis
  - quantized
  - moe
  - 3-bit
  - mixed-precision
  - cuda
  - glm
  - metal-marlin
---

# GLM-4.7-Flash-Trellis-3.8bpw

<div align="center">
<img src="https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg" width="15%"/>
</div>

**Trellis-quantized GLM-4.7-Flash** — a 30B-A3B MoE model compressed to **3.78 bits per weight** using sensitivity-aware mixed-precision quantization.

| Metric | Value |
|--------|-------|
| **Effective bits** | 3.78 bpw |
| **Compression** | 4.2× vs FP16 |
| **Model size** | ~14 GB (vs ~60 GB FP16) |
| **Parameters** | 29.3B |
| **Format** | HuggingFace sharded safetensors |

## Model Description

This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency.

GLM-4.7-Flash features:
- **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token)
- **Multi-head Latent Attention (MLA)** for 8× KV cache compression
- **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
- **Bilingual** (English + Chinese)

## Quantization Details

Quantized using **Trellis** (EXL3-style) with Metal Marlin acceleration:

### Bit Allocation

| Bit Width | Tensors | Parameters | % of Model |
|-----------|---------|------------|------------|
| 6-bit | 3,037 | 9.4B | 32.2% |
| 3-bit | 2,710 | 8.6B | 29.3% |
| 2-bit | 2,736 | 8.6B | 29.3% |
| 4-bit | 575 | 2.1B | 7.2% |
| 5-bit | 196 | 591M | 2.0% |

### Sensitivity-Aware Allocation

- **8-bit**: Router weights, embeddings, LM head, layer norms
- **6-bit**: Gate layers, attention projections with high outlier ratios
- **4-5 bit**: Standard attention layers (q/k/v/o projections)
- **2-3 bit**: MoE expert layers (lowest sensitivity)

### Quantization Statistics

- **Average MSE**: 0.000223
- **Average RMSE**: 0.0149
- **Quantization time**: ~110 seconds (RTX 3090 Ti)
- **Method**: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128)

## Files

```
GLM-4.7-Flash-Trellis-MM/
├── model-00001-of-00007.safetensors   # ~2 GB each
├── model-00002-of-00007.safetensors
├── model-00003-of-00007.safetensors
├── model-00004-of-00007.safetensors
├── model-00005-of-00007.safetensors
├── model-00006-of-00007.safetensors
├── model-00007-of-00007.safetensors
├── model.safetensors.index.json       # Weight map
├── base_weights.safetensors           # Embeddings, norms (FP16)
├── config.json                        # Model config
├── tokenizer.json                     # Tokenizer
├── tokenizer_config.json
└── quantization_index.json            # Quantization metadata
```

## Usage

### With Metal Marlin (Apple Silicon)

```python
from metal_marlin.trellis import TrellisForCausalLM
from transformers import AutoTokenizer

model = TrellisForCausalLM.from_pretrained(
    "RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw",
    device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")

prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```


### Tensor Format

Each quantized tensor has 4 components:
- `{name}__indices`: Packed uint8 Trellis indices
- `{name}__scales`: FP16 per-group scales (group_size=128)
- `{name}__su`: FP16 row scaling factors
- `{name}__sv`: FP16 column scaling factors

## Hardware Requirements

| Device | VRAM | Notes |
|--------|------|-------|
| Apple M2 Ultra | 64 GB+ | Via Metal Marlin |
| Apple M4 Max | 36 GB+ | Via Metal Marlin |

## Benchmarks

### Original Model Performance (from Z.AI)

| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|-----------|---------------|---------------|-------------|
| AIME 2025 | **91.6** | 85.0 | 91.7 |
| GPQA | **75.2** | 73.4 | 71.5 |
| SWE-bench Verified | **59.2** | 22.0 | 34.0 |
| τ²-Bench | **79.5** | 49.0 | 47.7 |
| BrowseComp | **42.8** | 2.29 | 28.3 |

### Quantized Model (Metal Marlin, M4 Max)

| Metric | Value |
|--------|-------|
| Decode | 5.4 tok/s |
| Prefill (2K) | 42 tok/s |
| Memory | 16.9 GB |

## Limitations

- **Not compatible with standard transformers** — requires Trellis-aware inference code
- **No speculative decoding** yet
- **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization)

## Credits

- **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash)
- **Quantization method**: [Trellis/EXL3](https://github.com/turboderp/exllamav3)
- **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin)

## Citation

If you use this model, please cite the original GLM-4.5 paper:

```bibtex
@misc{glm2025glm45,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}
```

## License

This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model.