arxiv:2603.17891

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Published on Mar 18

· Submitted by

Authors:

Abstract

Reinforcement learning-based mixed precision quantization method achieves superior compression efficiency and model performance for large language models through adaptive bit width assignment and novel scale folding techniques.

AI-generated summary

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

View arXiv page View PDF Add to collection

Community

ArpitSinghGautam

Paper submitter about 15 hours ago

We introduce RAMP (Reinforcement Adaptive Mixed Precision), an off-policy Soft Actor-Critic framework for learning per-layer bit-width allocations in LLM quantization.

RAMP treats quantization as a sequential decision problem: a policy assigns bit-widths under a global memory budget using an 11-dimensional state capturing activation statistics, weight properties, and structural features. This enables a single learned policy to generalize across models.

Key results:
• Llama-2-7B: 5.54 perplexity at 3.68 GB (3.65 effective bits)
• Outperforms uniform 4-bit AWQ (5.60 at 3.90 GB) and GPTQ in both size (~~6%) and quality (~~1–3%)
• Zero-shot transfer from Llama-2-7B to Llama-2-13B and Mistral-7B

We also introduce Scale Folding, a preconditioning method that improves stability in sub-4-bit regimes by redistributing activation outliers into weights.

Finally, the HALO pipeline exports learned allocations directly to GGUF, enabling kernel-free deployment across CPUs, GPUs, and edge devices while retaining ~99.5% of FP16 commonsense performance.

Overall, results suggest that quantization sensitivity can be learned and transferred, rather than tuned per model.

Would be interested in feedback, especially on RL-based approaches to model compression and cross-model generalization.