Balanced Ternary Post-Training Quantization for Large Language Models

Eric Stentzel Β· Entrit Systems Β· 2026

πŸ“„ Balanced-Ternary-PTQ.pdf β€” full paper (17 pages) πŸ—œοΈ source.tar β€” LaTeX source for reproducibility

Abstract

We present a post-training quantization codec that encodes LLM weights as balanced ternary values at configurable depth. Global codec parameters were selected once on Qwen2.5-7B and held fixed across all subsequent models; PTQ runs require no per-model tuning and no calibration data. Each weight is represented as a sum of d balanced ternary digits (ti ∈ {βˆ’1, 0, +1}), yielding 3d quantization levels with MSE-optimal per-group scaling.

We evaluate the codec across three model families (Qwen2.5 0.5B–72B, Llama-3.1-8B, Mistral-7B) at four depths on eight standard benchmarks. At depth d = 2 (9 levels, 3.47 bits per weight), the codec retains 99.4% of FP16 MMLU at 32B and 72B scale, and is competitive with NF4 at 16% fewer bits at equivalent accounting. At smaller scales, lightweight block-wise quantization-aware training recovers most of the gap: on Qwen2.5-7B, QAT brings d = 2 within 1.2 percentage points of FP16 on MMLU and recovers GSM8K from 53.2 to 78.5 (vs. FP16's 81.0). Quality saturates at d = 3 (27 levels, 5.05 bpw); adding a fourth trit provides no measurable improvement.

The balanced ternary representation replaces multiplication with conditional addition, enabling multiply-free inference kernels. Per-layer GEMV benchmarks projected to full-model token generation show a 7.8Γ— speedup over FP16 cuBLAS on an RTX 4090; these are kernel-only projections, not end-to-end serving throughput, and exclude attention, sampling, and tokenizer overhead. This maps to processing-in-memory architectures where eliminating multiplier circuits is the primary design constraint.

Reproducibility β€” all artifacts public on HuggingFace

What Where License
Codec source Entrit/tritllm-codec Apache 2.0
Multiply-free CUDA kernel Entrit/tritllm-kernel Apache 2.0
Pre-quantized Qwen2.5 (0.5B–72B) at d = 1, 2, 3, 4 Entrit org Apache 2.0
Pre-quantized Llama-3.1-8B at d = 1, 2, 3, 4 Entrit org Llama 3.1 Community
Pre-quantized Mistral-7B-v0.3 at d = 1, 2, 3, 4 Entrit org Apache 2.0
Block-wise QAT checkpoint (Qwen2.5-7B d = 2) Entrit/Qwen2.5-7B-qat-d2 Apache 2.0

Each checkpoint records a codec fingerprint (model id + revision + codec version + depth-power mapping) so reruns are bit-stable.

Citation

@misc{stentzel2026ternaryptq,
  title  = {Balanced Ternary Post-Training Quantization for Large Language Models},
  author = {Stentzel, Eric},
  year   = {2026},
  howpublished = {Entrit Systems},
  url    = {https://huggingface.co/Entrit/tritllm-paper}
}

License and patent notice

This paper is distributed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). The codec and kernel source are released under Apache License 2.0. Quantized model checkpoints inherit their source-model licenses (Apache 2.0 for Qwen2.5 and Mistral derivatives; Llama 3.1 Community License for Llama-3.1-8B derivatives, distributed with required "Built with Llama" attribution).

The methods described in this paper are the subject of pending patent applications by Entrit Systems. The released software and weights may be used under their respective licenses for research, evaluation, and academic purposes. Commercial deployment of the underlying methods may require a separate license; contact eric@entrit.io.

Contact

Eric Stentzel Β· eric@entrit.io Β· Entrit Systems

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support