File size: 4,472 Bytes
d599083
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c2b514
 
 
 
 
d599083
6c2b514
d599083
6c2b514
d599083
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
license: apache-2.0
tags:
  - quantization
  - ternary
  - llm
  - post-training-quantization
library_name: transformers
---

# tritllm-codec

Reference implementation of the balanced ternary post-training quantization codec from
**"Balanced Ternary Post-Training Quantization for Large Language Models"** (Stentzel, 2026).

Quantizes FP16 LLM weights to balanced ternary at configurable depth `d ∈ {1, 2, 3, 4}` (3, 9, 27, 81 levels per weight) with no calibration data and no per-model tuning. Output is dequantized FP16 safetensors that load into stock `transformers` and `lm-eval` without a custom loader.

## What gets quantized

The codec quantizes all 2D linear weight matrices in a model. **The following are kept in FP16 and not counted in the BPW total:**

- `lm_head` (output projection)
- Token embeddings (`embed_tokens`)
- All `*_norm` layers (RMSNorm, LayerNorm — these are 1D anyway)

This is the standard convention in quantization papers (see GPTQ, AWQ, NF4) and reflects the fact that embedding lookups and the final classifier are not GEMV-bound at inference time. Throughout the paper, "BPW" refers to the average bits-per-weight of the quantized matrices only.

## Install

```bash
pip install torch transformers safetensors numpy huggingface_hub
git clone https://huggingface.co/Entrit/tritllm-codec
cd tritllm-codec
```

## Quick start

```bash
# Quantize Qwen2.5-7B at uniform depth d=2 (3.47 bpw)
python quantize_model_v2.py \
    --model Qwen/Qwen2.5-7B \
    --configs uniform-d2 \
    --out ./out

# Multi-config single pass (computes scales once, derives 6 configs)
python quantize_model_v2.py \
    --model Qwen/Qwen2.5-7B \
    --configs uniform-d1,uniform-d2,uniform-d3,uniform-d4,d3scale-sens002,d3scale-sens003 \
    --out ./out
```

The output directory contains one HF-loadable model per config:

```
out/
  uniform-d2/
    model/
      config.json
      model.safetensors          # dequantized FP16
      tokenizer.json
      ...
```

Load like any HF model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("./out/uniform-d2/model")
t = AutoTokenizer.from_pretrained("./out/uniform-d2/model")
```

## Settled design (don't change unless reproducing an ablation)

| Parameter | Value | Notes |
|---|---|---|
| Group size `G` | 16 | Per Section 6.1 of the paper, gs=64 is also viable; gs=16 gives best PPL |
| Scale depth `d_s` | 3 | 27-entry log-spaced codebook per matrix |
| Power mapping | d1=1.0, d2=1.5, d3=1.2, d4=1.0 | Tuned once on Qwen2.5-7B, held fixed for all subsequent models |
| Scale candidates | indices `[G-6, G-4, G-2, G-1]` of sorted `\|w\|` | MSE-minimum over the 4 candidates is selected per group |
| Scale codebook range | `log_min` = 0.1th percentile of group `\|w\|`-maxes, `log_max` = max | Fixed in commit `0c16d24` (was 99.9th percentile, which clipped) |
| `lm_head`, embeddings, norms | kept FP16 | See "What gets quantized" above |

## BPW calculation

```
bpw = (d * log2(3) + d_s * log2(3) / G) / 1                    # weights + scales only
    = d * 1.585 + 0.297                                        # for G=16, d_s=3
```

Resulting BPW: d1=1.88, d2=3.47, d3=5.05, d4=6.64.

## Reproducibility tips

- Pass `--revision <git-sha>` to pin the source model — without it the upstream HF repo can move under you between runs.
- Each checkpoint stores a fingerprint of `(model, revision, codec version, group size, depth-power mapping)` and the matrix shape. On resume, mismatched checkpoints are discarded and re-quantized rather than silently mixed.
- The `assembled config.json` records the full fingerprint so you can verify which source model and codec version produced any given output.

## Known limitations

Two design tradeoffs (not bugs) are documented in [KNOWN_ISSUES.md](KNOWN_ISSUES.md): the 4-candidate scale search and the `log_max = max(...)` codebook upper bound. Both are intentional choices; the file explains the reasoning and what to look for in new model families.

## Citation

```
@article{stentzel2026ternaryptq,
  title  = {Balanced Ternary Post-Training Quantization for Large Language Models},
  author = {Stentzel, Eric},
  year   = 2026,
  note   = {Entrit Systems}
}
```

## Models quantized with this codec

See the [Entrit organization page](https://huggingface.co/Entrit) for prequantized model checkpoints across Qwen2.5 (0.5B–72B), Llama-3.1-8B, and Mistral-7B at depths d=1 through d=4.