Balanced Ternary Post-Training Quantization for Large Language Models
Eric Stentzel Β· Entrit Systems Β· 2026
π Balanced-Ternary-PTQ.pdf β full paper (17 pages) ποΈ source.tar β LaTeX source for reproducibility
Abstract
We present a post-training quantization codec that encodes LLM weights as balanced ternary values at configurable depth. Global codec parameters were selected once on Qwen2.5-7B and held fixed across all subsequent models; PTQ runs require no per-model tuning and no calibration data. Each weight is represented as a sum of d balanced ternary digits (ti β {β1, 0, +1}), yielding 3d quantization levels with MSE-optimal per-group scaling.
We evaluate the codec across three model families (Qwen2.5 0.5Bβ72B, Llama-3.1-8B, Mistral-7B) at four depths on eight standard benchmarks. At depth d = 2 (9 levels, 3.47 bits per weight), the codec retains 99.4% of FP16 MMLU at 32B and 72B scale, and is competitive with NF4 at 16% fewer bits at equivalent accounting. At smaller scales, lightweight block-wise quantization-aware training recovers most of the gap: on Qwen2.5-7B, QAT brings d = 2 within 1.2 percentage points of FP16 on MMLU and recovers GSM8K from 53.2 to 78.5 (vs. FP16's 81.0). Quality saturates at d = 3 (27 levels, 5.05 bpw); adding a fourth trit provides no measurable improvement.
The balanced ternary representation replaces multiplication with conditional addition, enabling multiply-free inference kernels. Per-layer GEMV benchmarks projected to full-model token generation show a 7.8Γ speedup over FP16 cuBLAS on an RTX 4090; these are kernel-only projections, not end-to-end serving throughput, and exclude attention, sampling, and tokenizer overhead. This maps to processing-in-memory architectures where eliminating multiplier circuits is the primary design constraint.
Reproducibility β all artifacts public on HuggingFace
| What | Where | License |
|---|---|---|
| Codec source | Entrit/tritllm-codec |
Apache 2.0 |
| Multiply-free CUDA kernel | Entrit/tritllm-kernel |
Apache 2.0 |
| Pre-quantized Qwen2.5 (0.5Bβ72B) at d = 1, 2, 3, 4 | Entrit org | Apache 2.0 |
| Pre-quantized Llama-3.1-8B at d = 1, 2, 3, 4 | Entrit org | Llama 3.1 Community |
| Pre-quantized Mistral-7B-v0.3 at d = 1, 2, 3, 4 | Entrit org | Apache 2.0 |
| Block-wise QAT checkpoint (Qwen2.5-7B d = 2) | Entrit/Qwen2.5-7B-qat-d2 |
Apache 2.0 |
Each checkpoint records a codec fingerprint (model id + revision + codec version + depth-power mapping) so reruns are bit-stable.
Citation
@misc{stentzel2026ternaryptq,
title = {Balanced Ternary Post-Training Quantization for Large Language Models},
author = {Stentzel, Eric},
year = {2026},
howpublished = {Entrit Systems},
url = {https://huggingface.co/Entrit/tritllm-paper}
}
License and patent notice
This paper is distributed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The codec and kernel source are released under Apache License 2.0. Quantized model checkpoints inherit their source-model licenses (Apache 2.0 for Qwen2.5 and Mistral derivatives; Llama 3.1 Community License for Llama-3.1-8B derivatives, distributed with required "Built with Llama" attribution).
The methods described in this paper are the subject of pending patent applications by Entrit Systems. The released software and weights may be used under their respective licenses for research, evaluation, and academic purposes. Commercial deployment of the underlying methods may require a separate license; contact eric@entrit.io.
Contact
Eric Stentzel Β· eric@entrit.io Β· Entrit Systems