GLM-4.7-Flash TQ3 (3-bit TurboQuant)
TurboQuant 3-bit weight-compressed checkpoint of GLM-4.7-Flash.
Key Numbers
| Metric | Value |
|---|---|
| Original size (BF16) | ~62 GB |
| Compressed size (TQ3) | ~14.7 GB |
| Compression ratio | ~4.2x |
| Architecture | GLM-4 MoE (355B total, 32B active, 64 experts) |
| Attention | MLA (Multi-Latent Attention) |
| Group size | 128 |
| Quantization | TurboQuant polar (WHT rotation + Lloyd-Max codebook) |
Usage with vLLM
Requires turboquant-plus-vllm plugin:
pip install turboquant-plus-vllm
vllm serve varjosoft/GLM-4.7-Flash-TQ3 --max-model-len 4096 --enforce-eager
The plugin auto-registers the turboquant quantization method via vLLM's plugin system. No --quantization flag needed — it's detected from the checkpoint's config.json.
Standalone Usage
from turboquant_vllm.checkpoint import load_tq3_model
model, tokenizer = load_tq3_model("varjosoft/GLM-4.7-Flash-TQ3")
How It Works
TurboQuant uses data-oblivious compression (no calibration data needed):
- Weight rows are split into groups of 128
- Each group is normalized, rotated (Walsh-Hadamard Transform with random diagonal signs), and quantized against a Lloyd-Max codebook for N(0, 1/d)
- 3-bit indices are sub-byte packed (8 indices → 3 bytes)
- At inference time, indices are unpacked, dequantized via codebook lookup + inverse rotation
Compression Details
- Compressed layers: All linear weights (attention projections, MLP projections, MoE expert weights)
- Uncompressed: Embeddings, layer norms, MoE router weights, lm_head
- Checkpoint format: Standard safetensors with
.tq_packed(uint8) and.tq_norms(float32) tensors
Architecture Notes
GLM-4.7-Flash uses the same MoE architecture as GLM-5.1 (769B) but 10x smaller:
- 47 transformer layers (3 dense + 44 MoE)
- 64 routed experts + 1 shared expert per MoE layer, top-4 routing
- MLA attention (kv_lora_rank based)
- This checkpoint validates TQ3 loading for the GLM MoE family
Status
- GPU tested: Pending (checkpoint loads verified, inference testing in progress)
- Quality: Expected to match BF16 based on TQ3 results on other models (Gemma 4 scored 4.76/5 on 20-scenario benchmark)
Related
- turboquant-vllm — TurboQuant compression library for vLLM
- varjosoft/GLM-5.1-Open-TQ3 — Same compression on GLM-5.1 (769B)
- varjosoft/gemma-4-26B-A4B-it-TQ3-native — TQ3 on Gemma 4
Created by Varjosoft.
- Downloads last month
- 198
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for varjosoft/GLM-4.7-Flash-TQ3
Base model
zai-org/GLM-4.7-Flash