--- language: - en - zh library_name: transformers license: mit pipeline_tag: text-generation base_model: - zai-org/GLM-4.7-Flash tags: - trellis - quantized - moe - 3-bit - mixed-precision - cuda - glm - metal-marlin --- # GLM-4.7-Flash-Trellis-3.8bpw
**Trellis-quantized GLM-4.7-Flash** — a 30B-A3B MoE model compressed to **3.78 bits per weight** using sensitivity-aware mixed-precision quantization. | Metric | Value | |--------|-------| | **Effective bits** | 3.78 bpw | | **Compression** | 4.2× vs FP16 | | **Model size** | ~14 GB (vs ~60 GB FP16) | | **Parameters** | 29.3B | | **Format** | HuggingFace sharded safetensors | ## Model Description This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency. GLM-4.7-Flash features: - **30B-A3B MoE architecture** (64 experts + shared expert, 2-4 active per token) - **Multi-head Latent Attention (MLA)** for 8× KV cache compression - **State-of-the-art reasoning** (91.6% on AIME 2025, 59.2% on SWE-bench Verified) - **Bilingual** (English + Chinese) ## Quantization Details Quantized using **Trellis** (EXL3-style) with Metal Marlin acceleration: ### Bit Allocation | Bit Width | Tensors | Parameters | % of Model | |-----------|---------|------------|------------| | 6-bit | 3,037 | 9.4B | 32.2% | | 3-bit | 2,710 | 8.6B | 29.3% | | 2-bit | 2,736 | 8.6B | 29.3% | | 4-bit | 575 | 2.1B | 7.2% | | 5-bit | 196 | 591M | 2.0% | ### Sensitivity-Aware Allocation - **8-bit**: Router weights, embeddings, LM head, layer norms - **6-bit**: Gate layers, attention projections with high outlier ratios - **4-5 bit**: Standard attention layers (q/k/v/o projections) - **2-3 bit**: MoE expert layers (lowest sensitivity) ### Quantization Statistics - **Average MSE**: 0.000223 - **Average RMSE**: 0.0149 - **Quantization time**: ~110 seconds (RTX 3090 Ti) - **Method**: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128) ## Files ``` GLM-4.7-Flash-Trellis-MM/ ├── model-00001-of-00007.safetensors # ~2 GB each ├── model-00002-of-00007.safetensors ├── model-00003-of-00007.safetensors ├── model-00004-of-00007.safetensors ├── model-00005-of-00007.safetensors ├── model-00006-of-00007.safetensors ├── model-00007-of-00007.safetensors ├── model.safetensors.index.json # Weight map ├── base_weights.safetensors # Embeddings, norms (FP16) ├── config.json # Model config ├── tokenizer.json # Tokenizer ├── tokenizer_config.json └── quantization_index.json # Quantization metadata ``` ## Usage ### With Metal Marlin (Apple Silicon) ```python from metal_marlin.trellis import TrellisForCausalLM from transformers import AutoTokenizer model = TrellisForCausalLM.from_pretrained( "RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw", device="mps" ) tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash") prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n" input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps") output = model.generate(input_ids, max_new_tokens=256, temperature=0.7) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ### Tensor Format Each quantized tensor has 4 components: - `{name}__indices`: Packed uint8 Trellis indices - `{name}__scales`: FP16 per-group scales (group_size=128) - `{name}__su`: FP16 row scaling factors - `{name}__sv`: FP16 column scaling factors ## Hardware Requirements | Device | VRAM | Notes | |--------|------|-------| | Apple M2 Ultra | 64 GB+ | Via Metal Marlin | | Apple M4 Max | 36 GB+ | Via Metal Marlin | ## Benchmarks ### Original Model Performance (from Z.AI) | Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B | |-----------|---------------|---------------|-------------| | AIME 2025 | **91.6** | 85.0 | 91.7 | | GPQA | **75.2** | 73.4 | 71.5 | | SWE-bench Verified | **59.2** | 22.0 | 34.0 | | τ²-Bench | **79.5** | 49.0 | 47.7 | | BrowseComp | **42.8** | 2.29 | 28.3 | ### Quantized Model (Metal Marlin, M4 Max) | Metric | Value | |--------|-------| | Decode | 5.4 tok/s | | Prefill (2K) | 42 tok/s | | Memory | 16.9 GB | ## Limitations - **Not compatible with standard transformers** — requires Trellis-aware inference code - **No speculative decoding** yet - **Quality loss**: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization) ## Credits - **Original model**: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash) - **Quantization method**: [Trellis/EXL3](https://github.com/turboderp/exllamav3) - **Quantization toolkit**: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin) ## Citation If you use this model, please cite the original GLM-4.5 paper: ```bibtex @misc{glm2025glm45, title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, author={GLM Team and Aohan Zeng and Xin Lv and others}, year={2025}, eprint={2508.06471}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.06471}, } ``` ## License This quantized model inherits the **MIT License** from the original GLM-4.7-Flash model.