Update README.md

7162e09 verified 9 days ago

5.49 kB

	---
	language:
	- en
	- zh
	library_name: transformers
	license: mit
	pipeline_tag: text-generation
	base_model:
	- zai-org/GLM-4.7-Flash
	tags:
	- trellis
	- quantized
	- moe
	- 3-bit
	- mixed-precision
	- cuda
	- glm
	- metal-marlin
	---

	# GLM-4.7-Flash-Trellis-3.8bpw

	<div align="center">
	<img src="https://raw.githubusercontent.com/zai-org/GLM-4.5/refs/heads/main/resources/logo.svg" width="15%"/>
	</div>

	Trellis-quantized GLM-4.7-Flash — a 30B-A3B MoE model compressed to 3.78 bits per weight using sensitivity-aware mixed-precision quantization.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Effective bits \| 3.78 bpw \|
	\| Compression \| 4.2× vs FP16 \|
	\| Model size \| ~14 GB (vs ~60 GB FP16) \|
	\| Parameters \| 29.3B \|
	\| Format \| HuggingFace sharded safetensors \|

	## Model Description

	This is a quantized version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash), the strongest model in the 30B class that balances performance and efficiency.

	GLM-4.7-Flash features:
	- 30B-A3B MoE architecture (64 experts + shared expert, 2-4 active per token)
	- Multi-head Latent Attention (MLA) for 8× KV cache compression
	- State-of-the-art reasoning (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
	- Bilingual (English + Chinese)

	## Quantization Details

	Quantized using Trellis (EXL3-style) with Metal Marlin acceleration:

	### Bit Allocation

	\| Bit Width \| Tensors \| Parameters \| % of Model \|
	\|-----------\|---------\|------------\|------------\|
	\| 6-bit \| 3,037 \| 9.4B \| 32.2% \|
	\| 3-bit \| 2,710 \| 8.6B \| 29.3% \|
	\| 2-bit \| 2,736 \| 8.6B \| 29.3% \|
	\| 4-bit \| 575 \| 2.1B \| 7.2% \|
	\| 5-bit \| 196 \| 591M \| 2.0% \|

	### Sensitivity-Aware Allocation

	- 8-bit: Router weights, embeddings, LM head, layer norms
	- 6-bit: Gate layers, attention projections with high outlier ratios
	- 4-5 bit: Standard attention layers (q/k/v/o projections)
	- 2-3 bit: MoE expert layers (lowest sensitivity)

	### Quantization Statistics

	- Average MSE: 0.000223
	- Average RMSE: 0.0149
	- Quantization time: ~110 seconds (RTX 3090 Ti)
	- Method: Trellis with Hadamard preprocessing, Viterbi nearest-neighbor, group-wise scales (g=128)

	## Files

	```
	GLM-4.7-Flash-Trellis-MM/
	├── model-00001-of-00007.safetensors # ~2 GB each
	├── model-00002-of-00007.safetensors
	├── model-00003-of-00007.safetensors
	├── model-00004-of-00007.safetensors
	├── model-00005-of-00007.safetensors
	├── model-00006-of-00007.safetensors
	├── model-00007-of-00007.safetensors
	├── model.safetensors.index.json # Weight map
	├── base_weights.safetensors # Embeddings, norms (FP16)
	├── config.json # Model config
	├── tokenizer.json # Tokenizer
	├── tokenizer_config.json
	└── quantization_index.json # Quantization metadata
	```

	## Usage

	### With Metal Marlin (Apple Silicon)

	```python
	from metal_marlin.trellis import TrellisForCausalLM
	from transformers import AutoTokenizer

	model = TrellisForCausalLM.from_pretrained(
	"RESMP-DEV/GLM-4.7-Flash-Trellis-3.8bpw",
	device="mps"
	)
	tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")

	prompt = "<\|user\|>\nExplain quantum computing in simple terms.\n<\|assistant\|>\n"
	input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
	output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```


	### Tensor Format

	Each quantized tensor has 4 components:
	- `{name}__indices`: Packed uint8 Trellis indices
	- `{name}__scales`: FP16 per-group scales (group_size=128)
	- `{name}__su`: FP16 row scaling factors
	- `{name}__sv`: FP16 column scaling factors

	## Hardware Requirements

	\| Device \| VRAM \| Notes \|
	\|--------\|------\|-------\|
	\| Apple M2 Ultra \| 64 GB+ \| Via Metal Marlin \|
	\| Apple M4 Max \| 36 GB+ \| Via Metal Marlin \|

	## Benchmarks

	### Original Model Performance (from Z.AI)

	\| Benchmark \| GLM-4.7-Flash \| Qwen3-30B-A3B \| GPT-OSS-20B \|
	\|-----------\|---------------\|---------------\|-------------\|
	\| AIME 2025 \| 91.6 \| 85.0 \| 91.7 \|
	\| GPQA \| 75.2 \| 73.4 \| 71.5 \|
	\| SWE-bench Verified \| 59.2 \| 22.0 \| 34.0 \|
	\| τ²-Bench \| 79.5 \| 49.0 \| 47.7 \|
	\| BrowseComp \| 42.8 \| 2.29 \| 28.3 \|

	### Quantized Model (Metal Marlin, M4 Max)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Decode \| 5.4 tok/s \|
	\| Prefill (2K) \| 42 tok/s \|
	\| Memory \| 16.9 GB \|

	## Limitations

	- Not compatible with standard transformers — requires Trellis-aware inference code
	- No speculative decoding yet
	- Quality loss: ~1-2% on benchmarks vs FP16 (typical for 3-4 bit quantization)

	## Credits

	- Original model: [Z.AI / GLM Team](https://huggingface.co/zai-org/GLM-4.7-Flash)
	- Quantization method: [Trellis/EXL3](https://github.com/turboderp/exllamav3)
	- Quantization toolkit: [Metal Marlin](https://github.com/RESMP-DEV/metal-marlin)

	## Citation

	If you use this model, please cite the original GLM-4.5 paper:

	```bibtex
	@misc{glm2025glm45,
	title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
	author={GLM Team and Aohan Zeng and Xin Lv and others},
	year={2025},
	eprint={2508.06471},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2508.06471},
	}
	```

	## License

	This quantized model inherits the MIT License from the original GLM-4.7-Flash model.