Canada Quant Labs

company

AI & ML interests

Canada Quant Labs — Canada's open-weight model lab. We train, quantize, and ship sovereign reference models for regulated industries (legal, medical, defence, finance) on a DGX B300 at Equinix Vancouver. Upstream contributors to vLLM and llm-compressor. Recipes: W4A16, NVFP4, MXFP4. Built in Victoria, BC. partnerships@cql.ca · cql.ca

Recent Activity

pastapaul new activity 13 days ago

canada-quant/GLM-5.2-W4A16-MTP:Loading the model

pastapaul updated a model 13 days ago

canada-quant/GLM-5.2-W4A16-MTP

pastapaul new activity 15 days ago

canada-quant/GLM-5.2-W4A16-MTP:Can you run NVFP4 + Marlin?

View all activity

Organization Card

Community About org cards

Canada Quant Labs

Canada's open-weight model lab.

We train, quantize, and deploy sovereign AI models on Canadian Blackwell silicon — for the regulated industries that can't run on someone else's API.

What we do

Post-training on open base models (SFT, DPO, GRPO, RLAIF)
Production quantization recipes (W4A16, NVFP4, MXFP4)
Audited, air-gapped deployment with eval evidence and MRM docs

Where we work

Legal · Medical · Defence · Finance
Headquarters: Victoria, BC
Compute: NVIDIA DGX B300 at Equinix Vancouver

Upstream

Contributors to vLLM, llm-compressor, compressed-tensors

Partnerships · partnerships@cql.ca Press · press@cql.ca Web · cql.ca

Latest release — GLM-5.2 W4A16-MTP

A 4-bit weight quantization of GLM-5.2 (744B-parameter MoE) that keeps the multi-token-prediction (MTP) draft head in BF16. It matches the FP8 release on quality, fits on four H200s instead of eight (~1.49 TB BF16 → ~405 GB), and is the fastest of the popular 4-bit GLM-5.2 quants in the interactive serving regime. MIT-licensed.

→ canada-quant/GLM-5.2-W4A16-MTP

Recipe — routed-expert weights to INT4 (group-size 128, GPTQ, via llm-compressor); attention, dense prefix layers, shared experts, router, embeddings, and LM head left in BF16. The MTP draft head is re-injected at BF16 after quantization, so speculative decoding survives end-to-end — a lossless speedup that changes latency, not answers.

Quality — within run-to-run noise of zai-org/GLM-5.2-FP8 on the same harness (8×H200):

Task	W4A16+MTP	FP8
GSM8K (strict)	0.960	0.955
IFEval (prompt / inst, strict)	0.909 / 0.911	0.891 / 0.903
MATH-500	0.954	0.958
RULER @ 32K / 64K	0.832 / 0.841	0.831 / 0.813
SWE-bench Verified	82.0%	82.2%

Speed — 132 output tok/s at concurrency 1, +69% over the next-fastest 4-bit GLM-5.2 quant; +48% vs FP8 at c=1 and +32% at c=8, where MTP helps most. At full saturation the no-MTP quants pull ~13–15% ahead — an honest trade-off in the high-throughput regime.

Serving — 4×H200 covers up to ~128K context; the full 1M-token context needs 8×H200. Validated on Hopper (H200); Blackwell serving needs additional kernel flags. Built on GLM-5.2, quantized with llm-compressor, served with vLLM. Full recipe, evaluation methodology, and engineering log in the repo.

Writeup: Running GLM-5.2 on half the GPUs: a W4A16 + MTP quantization.

Open releases — DeepSeek-V4 quantization family

Four artifacts in the same lineage. One base model in two sizes (V4-Flash, V4-Pro); two routed-expert formats (W4A16, NVFP4); Multi-Token Prediction (MTP) draft head retained on three of four. Attention is FP8 block 128×128 across all four.

Model	Base	Routed experts	MTP	On-disk	Min hardware (TP=2)	When to pick
DeepSeek-V4-Flash-W4A16-FP8	V4-Flash	W4A16 INT4 g=128	no	~143 GB	H200 / DGX Spark / RTX PRO 6000	maximum compatibility, no MTP needed
DeepSeek-V4-Flash-W4A16-FP8-MTP	V4-Flash	W4A16 INT4 g=128	yes (BF16)	159 GB	H200 / RTX PRO 6000	best $/token interactive on V4-Flash
DeepSeek-V4-Flash-NVFP4-FP8-MTP	V4-Flash	NVFP4 g=16	yes (BF16)	172 GB	RTX PRO 6000 / B300	best Blackwell-native interactive on V4-Flash
DeepSeek-V4-Pro-NVFP4-FP8-MTP	V4-Pro	NVFP4 g=16	yes (byte-identical)	913 GiB	8× B300 (TP=8 + EP)	only choice for V4-Pro deployment; +25–37% throughput vs upstream MXFP4

Upstream reference recipes: RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 (Flash NVFP4 topology) and nvidia/DeepSeek-V3.2-NVFP4 (Pro NVFP4, MTP-exclusion topology).

Hardware shorthand

H200 — 8× NVIDIA H200 SXM5 (Hopper SM 9.0a, 141 GB HBM3e/GPU)
DGX Spark — 2× NVIDIA DGX Spark (GB10, Blackwell SM 12.1a)
RTX PRO 6000 — NVIDIA RTX PRO 6000 Blackwell Server Edition (SM 12.0, sm_120, 96 GB HBM)
B300 — NVIDIA B300 SXM6 AC (Blackwell SM 10.3, sm_103a, 288 GB HBM3e/GPU)

Reproduction repos

Every artifact has a public reproduction repo with calibration scripts, vLLM patches, bench harnesses, and findings docs:

Upstream contributions filed during this work

vLLM: PRs #42209 (merged — NVFP4 MoE for DSV4), #43248, #43288, #43290, #43319, #43467, #41511, #41700 (landed via jasl/vllm@1d6f5c4)
llm-compressor: #2745
compressed-tensors: #711