varjosoft
/

GLM-4.7-Flash-TQ3

turboquant-vllm

Mixture of Experts

weight-compression

Model card Files Files and versions

varjoranta commited on 9 days ago

Commit

2a19f00

·

verified ·

1 Parent(s): 3b77a21

Add model card

Files changed (1) hide show

README.md +80 -0

README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+---
+license: mit
+base_model: zai-org/GLM-4.7-Flash
+tags:
+  - turboquant
+  - quantization
+  - moe
+  - vllm
+  - weight-compression
+library_name: turboquant-vllm
+---
+# GLM-4.7-Flash TQ3 (3-bit TurboQuant)
+TurboQuant 3-bit weight-compressed checkpoint of [GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
+## Key Numbers
+| Metric | Value |
+|--------|-------|
+| Original size (BF16) | ~62 GB |
+| Compressed size (TQ3) | ~14.7 GB |
+| Compression ratio | ~4.2x |
+| Architecture | GLM-4 MoE (355B total, 32B active, 64 experts) |
+| Attention | MLA (Multi-Latent Attention) |
+| Group size | 128 |
+| Quantization | TurboQuant polar (WHT rotation + Lloyd-Max codebook) |
+## Usage with vLLM
+Requires [turboquant-plus-vllm](https://github.com/varjoranta/turboquant-vllm) plugin:
+```bash
+pip install turboquant-plus-vllm
+vllm serve varjosoft/GLM-4.7-Flash-TQ3 --max-model-len 4096 --enforce-eager
+```
+The plugin auto-registers the `turboquant` quantization method via vLLM's plugin system. No `--quantization` flag needed — it's detected from the checkpoint's `config.json`.
+## Standalone Usage
+```python
+from turboquant_vllm.checkpoint import load_tq3_model
+model, tokenizer = load_tq3_model("varjosoft/GLM-4.7-Flash-TQ3")
+```
+## How It Works
+TurboQuant uses data-oblivious compression (no calibration data needed):
+1. Weight rows are split into groups of 128
+2. Each group is normalized, rotated (Walsh-Hadamard Transform with random diagonal signs), and quantized against a Lloyd-Max codebook for N(0, 1/d)
+3. 3-bit indices are sub-byte packed (8 indices → 3 bytes)
+4. At inference time, indices are unpacked, dequantized via codebook lookup + inverse rotation
+## Compression Details
+- **Compressed layers**: All linear weights (attention projections, MLP projections, MoE expert weights)
+- **Uncompressed**: Embeddings, layer norms, MoE router weights, lm_head
+- **Checkpoint format**: Standard safetensors with `.tq_packed` (uint8) and `.tq_norms` (float32) tensors
+## Architecture Notes
+GLM-4.7-Flash uses the same MoE architecture as GLM-5.1 (769B) but 10x smaller:
+- 47 transformer layers (3 dense + 44 MoE)
+- 64 routed experts + 1 shared expert per MoE layer, top-4 routing
+- MLA attention (kv_lora_rank based)
+- This checkpoint validates TQ3 loading for the GLM MoE family
+## Status
+- **GPU tested**: Pending (checkpoint loads verified, inference testing in progress)
+- **Quality**: Expected to match BF16 based on TQ3 results on other models (Gemma 4 scored 4.76/5 on 20-scenario benchmark)
+## Related
+- [turboquant-vllm](https://github.com/varjoranta/turboquant-vllm) — TurboQuant compression library for vLLM
+- [varjosoft/GLM-5.1-Open-TQ3](https://huggingface.co/varjosoft/GLM-5.1-Open-TQ3) — Same compression on GLM-5.1 (769B)
+- [varjosoft/gemma-4-26B-A4B-it-TQ3-native](https://huggingface.co/varjosoft/gemma-4-26B-A4B-it-TQ3-native) — TQ3 on Gemma 4
+Created by [Varjosoft](https://varjosoft.com).