GLM-4.7-Flash TQ3 (3-bit TurboQuant)

TurboQuant 3-bit weight-compressed checkpoint of GLM-4.7-Flash.

Key Numbers

Metric Value
Original size (BF16) ~62 GB
Compressed size (TQ3) ~14.7 GB
Compression ratio ~4.2x
Architecture GLM-4 MoE (355B total, 32B active, 64 experts)
Attention MLA (Multi-Latent Attention)
Group size 128
Quantization TurboQuant polar (WHT rotation + Lloyd-Max codebook)

Usage with vLLM

Requires turboquant-plus-vllm plugin:

pip install turboquant-plus-vllm
vllm serve varjosoft/GLM-4.7-Flash-TQ3 --max-model-len 4096 --enforce-eager

The plugin auto-registers the turboquant quantization method via vLLM's plugin system. No --quantization flag needed — it's detected from the checkpoint's config.json.

Standalone Usage

from turboquant_vllm.checkpoint import load_tq3_model
model, tokenizer = load_tq3_model("varjosoft/GLM-4.7-Flash-TQ3")

How It Works

TurboQuant uses data-oblivious compression (no calibration data needed):

  1. Weight rows are split into groups of 128
  2. Each group is normalized, rotated (Walsh-Hadamard Transform with random diagonal signs), and quantized against a Lloyd-Max codebook for N(0, 1/d)
  3. 3-bit indices are sub-byte packed (8 indices → 3 bytes)
  4. At inference time, indices are unpacked, dequantized via codebook lookup + inverse rotation

Compression Details

  • Compressed layers: All linear weights (attention projections, MLP projections, MoE expert weights)
  • Uncompressed: Embeddings, layer norms, MoE router weights, lm_head
  • Checkpoint format: Standard safetensors with .tq_packed (uint8) and .tq_norms (float32) tensors

Architecture Notes

GLM-4.7-Flash uses the same MoE architecture as GLM-5.1 (769B) but 10x smaller:

  • 47 transformer layers (3 dense + 44 MoE)
  • 64 routed experts + 1 shared expert per MoE layer, top-4 routing
  • MLA attention (kv_lora_rank based)
  • This checkpoint validates TQ3 loading for the GLM MoE family

Status

  • GPU tested: Pending (checkpoint loads verified, inference testing in progress)
  • Quality: Expected to match BF16 based on TQ3 results on other models (Gemma 4 scored 4.76/5 on 20-scenario benchmark)

Related

Created by Varjosoft.

Downloads last month
198
Safetensors
Model size
13B params
Tensor type
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for varjosoft/GLM-4.7-Flash-TQ3

Quantized
(78)
this model