| --- |
| license: apache-2.0 |
| tags: |
| - quantization |
| - ternary |
| - llm |
| - post-training-quantization |
| library_name: transformers |
| --- |
| |
| # tritllm-codec |
|
|
| Reference implementation of the balanced ternary post-training quantization codec from |
| **"Balanced Ternary Post-Training Quantization for Large Language Models"** (Stentzel, 2026). |
|
|
| Quantizes FP16 LLM weights to balanced ternary at configurable depth `d ∈ {1, 2, 3, 4}` (3, 9, 27, 81 levels per weight) with no calibration data and no per-model tuning. Output is dequantized FP16 safetensors that load into stock `transformers` and `lm-eval` without a custom loader. |
|
|
| ## What gets quantized |
|
|
| The codec quantizes all 2D linear weight matrices in a model. **The following are kept in FP16 and not counted in the BPW total:** |
|
|
| - `lm_head` (output projection) |
| - Token embeddings (`embed_tokens`) |
| - All `*_norm` layers (RMSNorm, LayerNorm — these are 1D anyway) |
|
|
| This is the standard convention in quantization papers (see GPTQ, AWQ, NF4) and reflects the fact that embedding lookups and the final classifier are not GEMV-bound at inference time. Throughout the paper, "BPW" refers to the average bits-per-weight of the quantized matrices only. |
|
|
| ## Install |
|
|
| ```bash |
| pip install torch transformers safetensors numpy huggingface_hub |
| git clone https://huggingface.co/Entrit/tritllm-codec |
| cd tritllm-codec |
| ``` |
|
|
| ## Quick start |
|
|
| ```bash |
| # Quantize Qwen2.5-7B at uniform depth d=2 (3.47 bpw) |
| python quantize_model_v2.py \ |
| --model Qwen/Qwen2.5-7B \ |
| --configs uniform-d2 \ |
| --out ./out |
| |
| # Multi-config single pass (computes scales once, derives 6 configs) |
| python quantize_model_v2.py \ |
| --model Qwen/Qwen2.5-7B \ |
| --configs uniform-d1,uniform-d2,uniform-d3,uniform-d4,d3scale-sens002,d3scale-sens003 \ |
| --out ./out |
| ``` |
|
|
| The output directory contains one HF-loadable model per config: |
|
|
| ``` |
| out/ |
| uniform-d2/ |
| model/ |
| config.json |
| model.safetensors # dequantized FP16 |
| tokenizer.json |
| ... |
| ``` |
|
|
| Load like any HF model: |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| m = AutoModelForCausalLM.from_pretrained("./out/uniform-d2/model") |
| t = AutoTokenizer.from_pretrained("./out/uniform-d2/model") |
| ``` |
|
|
| ## Settled design (don't change unless reproducing an ablation) |
|
|
| | Parameter | Value | Notes | |
| |---|---|---| |
| | Group size `G` | 16 | Per Section 6.1 of the paper, gs=64 is also viable; gs=16 gives best PPL | |
| | Scale depth `d_s` | 3 | 27-entry log-spaced codebook per matrix | |
| | Power mapping | d1=1.0, d2=1.5, d3=1.2, d4=1.0 | Tuned once on Qwen2.5-7B, held fixed for all subsequent models | |
| | Scale candidates | indices `[G-6, G-4, G-2, G-1]` of sorted `\|w\|` | MSE-minimum over the 4 candidates is selected per group | |
| | Scale codebook range | `log_min` = 0.1th percentile of group `\|w\|`-maxes, `log_max` = max | Fixed in commit `0c16d24` (was 99.9th percentile, which clipped) | |
| | `lm_head`, embeddings, norms | kept FP16 | See "What gets quantized" above | |
|
|
| ## BPW calculation |
|
|
| ``` |
| bpw = (d * log2(3) + d_s * log2(3) / G) / 1 # weights + scales only |
| = d * 1.585 + 0.297 # for G=16, d_s=3 |
| ``` |
|
|
| Resulting BPW: d1=1.88, d2=3.47, d3=5.05, d4=6.64. |
|
|
| ## Reproducibility tips |
|
|
| - Pass `--revision <git-sha>` to pin the source model — without it the upstream HF repo can move under you between runs. |
| - Each checkpoint stores a fingerprint of `(model, revision, codec version, group size, depth-power mapping)` and the matrix shape. On resume, mismatched checkpoints are discarded and re-quantized rather than silently mixed. |
| - The `assembled config.json` records the full fingerprint so you can verify which source model and codec version produced any given output. |
|
|
| ## Known limitations |
|
|
| Two design tradeoffs (not bugs) are documented in [KNOWN_ISSUES.md](KNOWN_ISSUES.md): the 4-candidate scale search and the `log_max = max(...)` codebook upper bound. Both are intentional choices; the file explains the reasoning and what to look for in new model families. |
|
|
| ## Citation |
|
|
| ``` |
| @article{stentzel2026ternaryptq, |
| title = {Balanced Ternary Post-Training Quantization for Large Language Models}, |
| author = {Stentzel, Eric}, |
| year = 2026, |
| note = {Entrit Systems} |
| } |
| ``` |
|
|
| ## Models quantized with this codec |
|
|
| See the [Entrit organization page](https://huggingface.co/Entrit) for prequantized model checkpoints across Qwen2.5 (0.5B–72B), Llama-3.1-8B, and Mistral-7B at depths d=1 through d=4. |
|
|