tritllm-codec / README.md
Entrit's picture
fix: address codex review BLOCKERs and SHOULD-FIXes; update KNOWN_ISSUES
6c2b514 verified
---
license: apache-2.0
tags:
- quantization
- ternary
- llm
- post-training-quantization
library_name: transformers
---
# tritllm-codec
Reference implementation of the balanced ternary post-training quantization codec from
**"Balanced Ternary Post-Training Quantization for Large Language Models"** (Stentzel, 2026).
Quantizes FP16 LLM weights to balanced ternary at configurable depth `d ∈ {1, 2, 3, 4}` (3, 9, 27, 81 levels per weight) with no calibration data and no per-model tuning. Output is dequantized FP16 safetensors that load into stock `transformers` and `lm-eval` without a custom loader.
## What gets quantized
The codec quantizes all 2D linear weight matrices in a model. **The following are kept in FP16 and not counted in the BPW total:**
- `lm_head` (output projection)
- Token embeddings (`embed_tokens`)
- All `*_norm` layers (RMSNorm, LayerNorm — these are 1D anyway)
This is the standard convention in quantization papers (see GPTQ, AWQ, NF4) and reflects the fact that embedding lookups and the final classifier are not GEMV-bound at inference time. Throughout the paper, "BPW" refers to the average bits-per-weight of the quantized matrices only.
## Install
```bash
pip install torch transformers safetensors numpy huggingface_hub
git clone https://huggingface.co/Entrit/tritllm-codec
cd tritllm-codec
```
## Quick start
```bash
# Quantize Qwen2.5-7B at uniform depth d=2 (3.47 bpw)
python quantize_model_v2.py \
--model Qwen/Qwen2.5-7B \
--configs uniform-d2 \
--out ./out
# Multi-config single pass (computes scales once, derives 6 configs)
python quantize_model_v2.py \
--model Qwen/Qwen2.5-7B \
--configs uniform-d1,uniform-d2,uniform-d3,uniform-d4,d3scale-sens002,d3scale-sens003 \
--out ./out
```
The output directory contains one HF-loadable model per config:
```
out/
uniform-d2/
model/
config.json
model.safetensors # dequantized FP16
tokenizer.json
...
```
Load like any HF model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("./out/uniform-d2/model")
t = AutoTokenizer.from_pretrained("./out/uniform-d2/model")
```
## Settled design (don't change unless reproducing an ablation)
| Parameter | Value | Notes |
|---|---|---|
| Group size `G` | 16 | Per Section 6.1 of the paper, gs=64 is also viable; gs=16 gives best PPL |
| Scale depth `d_s` | 3 | 27-entry log-spaced codebook per matrix |
| Power mapping | d1=1.0, d2=1.5, d3=1.2, d4=1.0 | Tuned once on Qwen2.5-7B, held fixed for all subsequent models |
| Scale candidates | indices `[G-6, G-4, G-2, G-1]` of sorted `\|w\|` | MSE-minimum over the 4 candidates is selected per group |
| Scale codebook range | `log_min` = 0.1th percentile of group `\|w\|`-maxes, `log_max` = max | Fixed in commit `0c16d24` (was 99.9th percentile, which clipped) |
| `lm_head`, embeddings, norms | kept FP16 | See "What gets quantized" above |
## BPW calculation
```
bpw = (d * log2(3) + d_s * log2(3) / G) / 1 # weights + scales only
= d * 1.585 + 0.297 # for G=16, d_s=3
```
Resulting BPW: d1=1.88, d2=3.47, d3=5.05, d4=6.64.
## Reproducibility tips
- Pass `--revision <git-sha>` to pin the source model — without it the upstream HF repo can move under you between runs.
- Each checkpoint stores a fingerprint of `(model, revision, codec version, group size, depth-power mapping)` and the matrix shape. On resume, mismatched checkpoints are discarded and re-quantized rather than silently mixed.
- The `assembled config.json` records the full fingerprint so you can verify which source model and codec version produced any given output.
## Known limitations
Two design tradeoffs (not bugs) are documented in [KNOWN_ISSUES.md](KNOWN_ISSUES.md): the 4-candidate scale search and the `log_max = max(...)` codebook upper bound. Both are intentional choices; the file explains the reasoning and what to look for in new model families.
## Citation
```
@article{stentzel2026ternaryptq,
title = {Balanced Ternary Post-Training Quantization for Large Language Models},
author = {Stentzel, Eric},
year = 2026,
note = {Entrit Systems}
}
```
## Models quantized with this codec
See the [Entrit organization page](https://huggingface.co/Entrit) for prequantized model checkpoints across Qwen2.5 (0.5B–72B), Llama-3.1-8B, and Mistral-7B at depths d=1 through d=4.