fix: address codex review BLOCKERs and SHOULD-FIXes; update KNOWN_ISSUES

6c2b514 verified 5 days ago

4.47 kB

	---
	license: apache-2.0
	tags:
	- quantization
	- ternary
	- llm
	- post-training-quantization
	library_name: transformers
	---

	# tritllm-codec

	Reference implementation of the balanced ternary post-training quantization codec from
	"Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

	Quantizes FP16 LLM weights to balanced ternary at configurable depth `d ∈ {1, 2, 3, 4}` (3, 9, 27, 81 levels per weight) with no calibration data and no per-model tuning. Output is dequantized FP16 safetensors that load into stock `transformers` and `lm-eval` without a custom loader.

	## What gets quantized

	The codec quantizes all 2D linear weight matrices in a model. The following are kept in FP16 and not counted in the BPW total:

	- `lm_head` (output projection)
	- Token embeddings (`embed_tokens`)
	- All `*_norm` layers (RMSNorm, LayerNorm — these are 1D anyway)

	This is the standard convention in quantization papers (see GPTQ, AWQ, NF4) and reflects the fact that embedding lookups and the final classifier are not GEMV-bound at inference time. Throughout the paper, "BPW" refers to the average bits-per-weight of the quantized matrices only.

	## Install

	```bash
	pip install torch transformers safetensors numpy huggingface_hub
	git clone https://huggingface.co/Entrit/tritllm-codec
	cd tritllm-codec
	```

	## Quick start

	```bash
	# Quantize Qwen2.5-7B at uniform depth d=2 (3.47 bpw)
	python quantize_model_v2.py \
	--model Qwen/Qwen2.5-7B \
	--configs uniform-d2 \
	--out ./out

	# Multi-config single pass (computes scales once, derives 6 configs)
	python quantize_model_v2.py \
	--model Qwen/Qwen2.5-7B \
	--configs uniform-d1,uniform-d2,uniform-d3,uniform-d4,d3scale-sens002,d3scale-sens003 \
	--out ./out
	```

	The output directory contains one HF-loadable model per config:

	```
	out/
	uniform-d2/
	model/
	config.json
	model.safetensors # dequantized FP16
	tokenizer.json
	...
	```

	Load like any HF model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	m = AutoModelForCausalLM.from_pretrained("./out/uniform-d2/model")
	t = AutoTokenizer.from_pretrained("./out/uniform-d2/model")
	```

	## Settled design (don't change unless reproducing an ablation)

	\| Parameter \| Value \| Notes \|
	\|---\|---\|---\|
	\| Group size `G` \| 16 \| Per Section 6.1 of the paper, gs=64 is also viable; gs=16 gives best PPL \|
	\| Scale depth `d_s` \| 3 \| 27-entry log-spaced codebook per matrix \|
	\| Power mapping \| d1=1.0, d2=1.5, d3=1.2, d4=1.0 \| Tuned once on Qwen2.5-7B, held fixed for all subsequent models \|
	\| Scale candidates \| indices `[G-6, G-4, G-2, G-1]` of sorted `\\|w\\|` \| MSE-minimum over the 4 candidates is selected per group \|
	\| Scale codebook range \| `log_min` = 0.1th percentile of group `\\|w\\|`-maxes, `log_max` = max \| Fixed in commit `0c16d24` (was 99.9th percentile, which clipped) \|
	\| `lm_head`, embeddings, norms \| kept FP16 \| See "What gets quantized" above \|

	## BPW calculation

	```
	bpw = (d * log2(3) + d_s * log2(3) / G) / 1 # weights + scales only
	= d * 1.585 + 0.297 # for G=16, d_s=3
	```

	Resulting BPW: d1=1.88, d2=3.47, d3=5.05, d4=6.64.

	## Reproducibility tips

	- Pass `--revision <git-sha>` to pin the source model — without it the upstream HF repo can move under you between runs.
	- Each checkpoint stores a fingerprint of `(model, revision, codec version, group size, depth-power mapping)` and the matrix shape. On resume, mismatched checkpoints are discarded and re-quantized rather than silently mixed.
	- The `assembled config.json` records the full fingerprint so you can verify which source model and codec version produced any given output.

	## Known limitations

	Two design tradeoffs (not bugs) are documented in [KNOWN_ISSUES.md](KNOWN_ISSUES.md): the 4-candidate scale search and the `log_max = max(...)` codebook upper bound. Both are intentional choices; the file explains the reasoning and what to look for in new model families.

	## Citation

	```
	@article{stentzel2026ternaryptq,
	title = {Balanced Ternary Post-Training Quantization for Large Language Models},
	author = {Stentzel, Eric},
	year = 2026,
	note = {Entrit Systems}
	}
	```

	## Models quantized with this codec

	See the [Entrit organization page](https://huggingface.co/Entrit) for prequantized model checkpoints across Qwen2.5 (0.5B–72B), Llama-3.1-8B, and Mistral-7B at depths d=1 through d=4.