--- license: apache-2.0 tags: - quantization - ternary - llm - post-training-quantization library_name: transformers --- # tritllm-codec Reference implementation of the balanced ternary post-training quantization codec from **"Balanced Ternary Post-Training Quantization for Large Language Models"** (Stentzel, 2026). Quantizes FP16 LLM weights to balanced ternary at configurable depth `d ∈ {1, 2, 3, 4}` (3, 9, 27, 81 levels per weight) with no calibration data and no per-model tuning. Output is dequantized FP16 safetensors that load into stock `transformers` and `lm-eval` without a custom loader. ## What gets quantized The codec quantizes all 2D linear weight matrices in a model. **The following are kept in FP16 and not counted in the BPW total:** - `lm_head` (output projection) - Token embeddings (`embed_tokens`) - All `*_norm` layers (RMSNorm, LayerNorm — these are 1D anyway) This is the standard convention in quantization papers (see GPTQ, AWQ, NF4) and reflects the fact that embedding lookups and the final classifier are not GEMV-bound at inference time. Throughout the paper, "BPW" refers to the average bits-per-weight of the quantized matrices only. ## Install ```bash pip install torch transformers safetensors numpy huggingface_hub git clone https://huggingface.co/Entrit/tritllm-codec cd tritllm-codec ``` ## Quick start ```bash # Quantize Qwen2.5-7B at uniform depth d=2 (3.47 bpw) python quantize_model_v2.py \ --model Qwen/Qwen2.5-7B \ --configs uniform-d2 \ --out ./out # Multi-config single pass (computes scales once, derives 6 configs) python quantize_model_v2.py \ --model Qwen/Qwen2.5-7B \ --configs uniform-d1,uniform-d2,uniform-d3,uniform-d4,d3scale-sens002,d3scale-sens003 \ --out ./out ``` The output directory contains one HF-loadable model per config: ``` out/ uniform-d2/ model/ config.json model.safetensors # dequantized FP16 tokenizer.json ... ``` Load like any HF model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer m = AutoModelForCausalLM.from_pretrained("./out/uniform-d2/model") t = AutoTokenizer.from_pretrained("./out/uniform-d2/model") ``` ## Settled design (don't change unless reproducing an ablation) | Parameter | Value | Notes | |---|---|---| | Group size `G` | 16 | Per Section 6.1 of the paper, gs=64 is also viable; gs=16 gives best PPL | | Scale depth `d_s` | 3 | 27-entry log-spaced codebook per matrix | | Power mapping | d1=1.0, d2=1.5, d3=1.2, d4=1.0 | Tuned once on Qwen2.5-7B, held fixed for all subsequent models | | Scale candidates | indices `[G-6, G-4, G-2, G-1]` of sorted `\|w\|` | MSE-minimum over the 4 candidates is selected per group | | Scale codebook range | `log_min` = 0.1th percentile of group `\|w\|`-maxes, `log_max` = max | Fixed in commit `0c16d24` (was 99.9th percentile, which clipped) | | `lm_head`, embeddings, norms | kept FP16 | See "What gets quantized" above | ## BPW calculation ``` bpw = (d * log2(3) + d_s * log2(3) / G) / 1 # weights + scales only = d * 1.585 + 0.297 # for G=16, d_s=3 ``` Resulting BPW: d1=1.88, d2=3.47, d3=5.05, d4=6.64. ## Reproducibility tips - Pass `--revision ` to pin the source model — without it the upstream HF repo can move under you between runs. - Each checkpoint stores a fingerprint of `(model, revision, codec version, group size, depth-power mapping)` and the matrix shape. On resume, mismatched checkpoints are discarded and re-quantized rather than silently mixed. - The `assembled config.json` records the full fingerprint so you can verify which source model and codec version produced any given output. ## Known limitations Two design tradeoffs (not bugs) are documented in [KNOWN_ISSUES.md](KNOWN_ISSUES.md): the 4-candidate scale search and the `log_max = max(...)` codebook upper bound. Both are intentional choices; the file explains the reasoning and what to look for in new model families. ## Citation ``` @article{stentzel2026ternaryptq, title = {Balanced Ternary Post-Training Quantization for Large Language Models}, author = {Stentzel, Eric}, year = 2026, note = {Entrit Systems} } ``` ## Models quantized with this codec See the [Entrit organization page](https://huggingface.co/Entrit) for prequantized model checkpoints across Qwen2.5 (0.5B–72B), Llama-3.1-8B, and Mistral-7B at depths d=1 through d=4.