GLM-4.7-Flash TQ3 (3-bit TurboQuant)

TurboQuant 3-bit weight-compressed checkpoint of GLM-4.7-Flash.

Key Numbers

Metric	Value
Original size (BF16)	~62 GB
Compressed size (TQ3)	~14.7 GB
Compression ratio	~4.2x
Architecture	GLM-4 MoE (355B total, 32B active, 64 experts)
Attention	MLA (Multi-Latent Attention)
Group size	128
Quantization	TurboQuant polar (WHT rotation + Lloyd-Max codebook)

Usage with vLLM

Requires turboquant-plus-vllm plugin:

pip install turboquant-plus-vllm
vllm serve varjosoft/GLM-4.7-Flash-TQ3 --max-model-len 4096 --enforce-eager

The plugin auto-registers the turboquant quantization method via vLLM's plugin system. No --quantization flag needed — it's detected from the checkpoint's config.json.

Standalone Usage

from turboquant_vllm.checkpoint import load_tq3_model
model, tokenizer = load_tq3_model("varjosoft/GLM-4.7-Flash-TQ3")

How It Works

TurboQuant uses data-oblivious compression (no calibration data needed):

Weight rows are split into groups of 128
Each group is normalized, rotated (Walsh-Hadamard Transform with random diagonal signs), and quantized against a Lloyd-Max codebook for N(0, 1/d)
3-bit indices are sub-byte packed (8 indices → 3 bytes)
At inference time, indices are unpacked, dequantized via codebook lookup + inverse rotation

Compression Details

Compressed layers: All linear weights (attention projections, MLP projections, MoE expert weights)
Uncompressed: Embeddings, layer norms, MoE router weights, lm_head
Checkpoint format: Standard safetensors with .tq_packed (uint8) and .tq_norms (float32) tensors

Architecture Notes

GLM-4.7-Flash uses the same MoE architecture as GLM-5.1 (769B) but 10x smaller:

47 transformer layers (3 dense + 44 MoE)
64 routed experts + 1 shared expert per MoE layer, top-4 routing
MLA attention (kv_lora_rank based)
This checkpoint validates TQ3 loading for the GLM MoE family

Status

GPU tested: Pending (checkpoint loads verified, inference testing in progress)
Quality: Expected to match BF16 based on TQ3 results on other models (Gemma 4 scored 4.76/5 on 20-scenario benchmark)

turboquant-vllm — TurboQuant compression library for vLLM
varjosoft/GLM-5.1-Open-TQ3 — Same compression on GLM-5.1 (769B)
varjosoft/gemma-4-26B-A4B-it-TQ3-native — TQ3 on Gemma 4

Created by Varjosoft.

Downloads last month: 6

Safetensors

Model size

13B params

Tensor type

F32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for varjosoft/GLM-4.7-Flash-TQ3

Base model

zai-org/GLM-4.7-Flash

Quantized

(90)

this model

varjosoft
/

GLM-4.7-Flash-TQ3