tritllm-codec

Reference implementation of the balanced ternary post-training quantization codec from "Balanced Ternary Post-Training Quantization for Large Language Models" (Stentzel, 2026).

Quantizes FP16 LLM weights to balanced ternary at configurable depth d โˆˆ {1, 2, 3, 4} (3, 9, 27, 81 levels per weight) with no calibration data and no per-model tuning. Output is dequantized FP16 safetensors that load into stock transformers and lm-eval without a custom loader.

What gets quantized

The codec quantizes all 2D linear weight matrices in a model. The following are kept in FP16 and not counted in the BPW total:

  • lm_head (output projection)
  • Token embeddings (embed_tokens)
  • All *_norm layers (RMSNorm, LayerNorm โ€” these are 1D anyway)

This is the standard convention in quantization papers (see GPTQ, AWQ, NF4) and reflects the fact that embedding lookups and the final classifier are not GEMV-bound at inference time. Throughout the paper, "BPW" refers to the average bits-per-weight of the quantized matrices only.

Install

pip install torch transformers safetensors numpy huggingface_hub
git clone https://huggingface.co/Entrit/tritllm-codec
cd tritllm-codec

Quick start

# Quantize Qwen2.5-7B at uniform depth d=2 (3.47 bpw)
python quantize_model_v2.py \
    --model Qwen/Qwen2.5-7B \
    --configs uniform-d2 \
    --out ./out

# Multi-config single pass (computes scales once, derives 6 configs)
python quantize_model_v2.py \
    --model Qwen/Qwen2.5-7B \
    --configs uniform-d1,uniform-d2,uniform-d3,uniform-d4,d3scale-sens002,d3scale-sens003 \
    --out ./out

The output directory contains one HF-loadable model per config:

out/
  uniform-d2/
    model/
      config.json
      model.safetensors          # dequantized FP16
      tokenizer.json
      ...

Load like any HF model:

from transformers import AutoModelForCausalLM, AutoTokenizer
m = AutoModelForCausalLM.from_pretrained("./out/uniform-d2/model")
t = AutoTokenizer.from_pretrained("./out/uniform-d2/model")

Settled design (don't change unless reproducing an ablation)

Parameter Value Notes
Group size G 16 Per Section 6.1 of the paper, gs=64 is also viable; gs=16 gives best PPL
Scale depth d_s 3 27-entry log-spaced codebook per matrix
Power mapping d1=1.0, d2=1.5, d3=1.2, d4=1.0 Tuned once on Qwen2.5-7B, held fixed for all subsequent models
Scale candidates indices [G-6, G-4, G-2, G-1] of sorted |w| MSE-minimum over the 4 candidates is selected per group
Scale codebook range log_min = 0.1th percentile of group |w|-maxes, log_max = max Fixed in commit 0c16d24 (was 99.9th percentile, which clipped)
lm_head, embeddings, norms kept FP16 See "What gets quantized" above

BPW calculation

bpw = (d * log2(3) + d_s * log2(3) / G) / 1                    # weights + scales only
    = d * 1.585 + 0.297                                        # for G=16, d_s=3

Resulting BPW: d1=1.88, d2=3.47, d3=5.05, d4=6.64.

Reproducibility tips

  • Pass --revision <git-sha> to pin the source model โ€” without it the upstream HF repo can move under you between runs.
  • Each checkpoint stores a fingerprint of (model, revision, codec version, group size, depth-power mapping) and the matrix shape. On resume, mismatched checkpoints are discarded and re-quantized rather than silently mixed.
  • The assembled config.json records the full fingerprint so you can verify which source model and codec version produced any given output.

Known limitations

Two design tradeoffs (not bugs) are documented in KNOWN_ISSUES.md: the 4-candidate scale search and the log_max = max(...) codebook upper bound. Both are intentional choices; the file explains the reasoning and what to look for in new model families.

Citation

@article{stentzel2026ternaryptq,
  title  = {Balanced Ternary Post-Training Quantization for Large Language Models},
  author = {Stentzel, Eric},
  year   = 2026,
  note   = {Entrit Systems}
}

Models quantized with this codec

See the Entrit organization page for prequantized model checkpoints across Qwen2.5 (0.5Bโ€“72B), Llama-3.1-8B, and Mistral-7B at depths d=1 through d=4.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support