tritllm-codec
Reference implementation of the balanced ternary post-training quantization codec from "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).
Quantizes FP16 LLM weights to balanced ternary at configurable depth d โ {1, 2, 3, 4} (3, 9, 27, 81 levels per weight) with no calibration data and no per-model tuning. Output is dequantized FP16 safetensors that load into stock transformers and lm-eval without a custom loader.
What gets quantized
The codec quantizes all 2D linear weight matrices in a model. The following are kept in FP16 and not counted in the BPW total:
lm_head(output projection)- Token embeddings (
embed_tokens) - All
*_normlayers (RMSNorm, LayerNorm โ these are 1D anyway)
This is the standard convention in quantization papers (see GPTQ, AWQ, NF4) and reflects the fact that embedding lookups and the final classifier are not GEMV-bound at inference time. Throughout the paper, "BPW" refers to the average bits-per-weight of the quantized matrices only.
Install
pip install torch transformers safetensors numpy huggingface_hub
git clone https://huggingface.co/Entrit/tritllm-codec
cd tritllm-codec
Quick start
# Quantize Qwen2.5-7B at uniform depth d=2 (3.47 bpw)
python quantize_model_v2.py \
--model Qwen/Qwen2.5-7B \
--configs uniform-d2 \
--out ./out
# Multi-config single pass (computes scales once, derives 6 configs)
python quantize_model_v2.py \
--model Qwen/Qwen2.5-7B \
--configs uniform-d1,uniform-d2,uniform-d3,uniform-d4,d3scale-sens002,d3scale-sens003 \
--out ./out
The output directory contains one HF-loadable model per config:
out/
uniform-d2/
model/
config.json
model.safetensors # dequantized FP16
tokenizer.json
...
Load like any HF model:
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("./out/uniform-d2/model")
t = AutoTokenizer.from_pretrained("./out/uniform-d2/model")
Settled design (don't change unless reproducing an ablation)
| Parameter | Value | Notes |
|---|---|---|
Group size G |
16 | Per Section 6.1 of the paper, gs=64 is also viable; gs=16 gives best PPL |
Scale depth d_s |
3 | 27-entry log-spaced codebook per matrix |
| Power mapping | d1=1.0, d2=1.5, d3=1.2, d4=1.0 | Tuned once on Qwen2.5-7B, held fixed for all subsequent models |
| Scale candidates | indices [G-6, G-4, G-2, G-1] of sorted |w| |
MSE-minimum over the 4 candidates is selected per group |
| Scale codebook range | log_min = 0.1th percentile of group |w|-maxes, log_max = max |
Fixed in commit 0c16d24 (was 99.9th percentile, which clipped) |
lm_head, embeddings, norms |
kept FP16 | See "What gets quantized" above |
BPW calculation
bpw = (d * log2(3) + d_s * log2(3) / G) / 1 # weights + scales only
= d * 1.585 + 0.297 # for G=16, d_s=3
Resulting BPW: d1=1.88, d2=3.47, d3=5.05, d4=6.64.
Reproducibility tips
- Pass
--revision <git-sha>to pin the source model โ without it the upstream HF repo can move under you between runs. - Each checkpoint stores a fingerprint of
(model, revision, codec version, group size, depth-power mapping)and the matrix shape. On resume, mismatched checkpoints are discarded and re-quantized rather than silently mixed. - The
assembled config.jsonrecords the full fingerprint so you can verify which source model and codec version produced any given output.
Known limitations
Two design tradeoffs (not bugs) are documented in KNOWN_ISSUES.md: the 4-candidate scale search and the log_max = max(...) codebook upper bound. Both are intentional choices; the file explains the reasoning and what to look for in new model families.
Citation
@article{stentzel2026ternaryptq,
title = {Balanced Ternary Post-Training Quantization for Large Language Models},
author = {Stentzel, Eric},
year = 2026,
note = {Entrit Systems}
}
Models quantized with this codec
See the Entrit organization page for prequantized model checkpoints across Qwen2.5 (0.5Bโ72B), Llama-3.1-8B, and Mistral-7B at depths d=1 through d=4.