varjoranta commited on
Commit
2a19f00
·
verified ·
1 Parent(s): 3b77a21

Add model card

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: zai-org/GLM-4.7-Flash
4
+ tags:
5
+ - turboquant
6
+ - quantization
7
+ - moe
8
+ - vllm
9
+ - weight-compression
10
+ library_name: turboquant-vllm
11
+ ---
12
+
13
+ # GLM-4.7-Flash TQ3 (3-bit TurboQuant)
14
+
15
+ TurboQuant 3-bit weight-compressed checkpoint of [GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).
16
+
17
+ ## Key Numbers
18
+
19
+ | Metric | Value |
20
+ |--------|-------|
21
+ | Original size (BF16) | ~62 GB |
22
+ | Compressed size (TQ3) | ~14.7 GB |
23
+ | Compression ratio | ~4.2x |
24
+ | Architecture | GLM-4 MoE (355B total, 32B active, 64 experts) |
25
+ | Attention | MLA (Multi-Latent Attention) |
26
+ | Group size | 128 |
27
+ | Quantization | TurboQuant polar (WHT rotation + Lloyd-Max codebook) |
28
+
29
+ ## Usage with vLLM
30
+
31
+ Requires [turboquant-plus-vllm](https://github.com/varjoranta/turboquant-vllm) plugin:
32
+
33
+ ```bash
34
+ pip install turboquant-plus-vllm
35
+ vllm serve varjosoft/GLM-4.7-Flash-TQ3 --max-model-len 4096 --enforce-eager
36
+ ```
37
+
38
+ The plugin auto-registers the `turboquant` quantization method via vLLM's plugin system. No `--quantization` flag needed — it's detected from the checkpoint's `config.json`.
39
+
40
+ ## Standalone Usage
41
+
42
+ ```python
43
+ from turboquant_vllm.checkpoint import load_tq3_model
44
+ model, tokenizer = load_tq3_model("varjosoft/GLM-4.7-Flash-TQ3")
45
+ ```
46
+
47
+ ## How It Works
48
+
49
+ TurboQuant uses data-oblivious compression (no calibration data needed):
50
+ 1. Weight rows are split into groups of 128
51
+ 2. Each group is normalized, rotated (Walsh-Hadamard Transform with random diagonal signs), and quantized against a Lloyd-Max codebook for N(0, 1/d)
52
+ 3. 3-bit indices are sub-byte packed (8 indices → 3 bytes)
53
+ 4. At inference time, indices are unpacked, dequantized via codebook lookup + inverse rotation
54
+
55
+ ## Compression Details
56
+
57
+ - **Compressed layers**: All linear weights (attention projections, MLP projections, MoE expert weights)
58
+ - **Uncompressed**: Embeddings, layer norms, MoE router weights, lm_head
59
+ - **Checkpoint format**: Standard safetensors with `.tq_packed` (uint8) and `.tq_norms` (float32) tensors
60
+
61
+ ## Architecture Notes
62
+
63
+ GLM-4.7-Flash uses the same MoE architecture as GLM-5.1 (769B) but 10x smaller:
64
+ - 47 transformer layers (3 dense + 44 MoE)
65
+ - 64 routed experts + 1 shared expert per MoE layer, top-4 routing
66
+ - MLA attention (kv_lora_rank based)
67
+ - This checkpoint validates TQ3 loading for the GLM MoE family
68
+
69
+ ## Status
70
+
71
+ - **GPU tested**: Pending (checkpoint loads verified, inference testing in progress)
72
+ - **Quality**: Expected to match BF16 based on TQ3 results on other models (Gemma 4 scored 4.76/5 on 20-scenario benchmark)
73
+
74
+ ## Related
75
+
76
+ - [turboquant-vllm](https://github.com/varjoranta/turboquant-vllm) — TurboQuant compression library for vLLM
77
+ - [varjosoft/GLM-5.1-Open-TQ3](https://huggingface.co/varjosoft/GLM-5.1-Open-TQ3) — Same compression on GLM-5.1 (769B)
78
+ - [varjosoft/gemma-4-26B-A4B-it-TQ3-native](https://huggingface.co/varjosoft/gemma-4-26B-A4B-it-TQ3-native) — TQ3 on Gemma 4
79
+
80
+ Created by [Varjosoft](https://varjosoft.com).