Papers
arxiv:2604.08118

Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

Published on Apr 9
· Submitted by
Ian Kennedy
on Apr 13
Authors:

Abstract

Additive quantization for LLM compression faces challenges at 2-bit precision due to codebook initialization issues, which OA-EM addresses through output-aware EM initialization based on Hessian-weighted Mahalanobis distance.

AI-generated summary

Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio ho = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with ho: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.

Community

Paper author Paper submitter

Fixing catastrophic degradation in 2-bit LLMs (Llama 3.1, 3.2 & Qwen 2.5 2-bit weights inside)

Catastrophic degradation in 2-bit LLMs isn't a compute problem; it’s an initialisation problem. At 2 bits per parameter, additive quantization (like AQLM) hits an "undercomplete regime." Weight groups compete for starved codebook capacity, and standard greedy initialisation traps the model in a bad optimisation basin.

We introduce Output-Aware Expectation-Maximisation (OA-EM). On Llama 3.2 3B (2bpp), OA-EM achieves 11.53 Post-PV perplexity in just 6.1h, beating the greedy wide-beam baseline (12.01) that takes 16.9h.

The best part: Because we keep free-form codebooks, you get O(1) LUT dequantization with EXACTLY ZERO MAC operations. Pure memory reads are perfectly suited for edge deployment with a 4096 context window.

We've open-sourced the code and the 2-bit weights on our Hugging Face profile. Happy to answer any questions about the Hessian-weighting or implementation!

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.08118
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.08118 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.08118 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.