# RotorQuant Weights + Runtime Quantization for Qwen2.5-0.5B-Instruct This repository implements and benchmarks multiple quantization/deployment paths for `Qwen/Qwen2.5-0.5B-Instruct`, including: 1. RotorQuant-style 3-bit weight quantization (custom codec) 2. Quantized model loading + text generation with proper Qwen chat template 3. Validation against FP32 baseline 4. Runtime acceleration paths: 1. RotorQuant fused runtime (packed-weight linear) 2. Dynamic INT8 runtime baseline (for speed comparison) The goal is to evaluate low-bit compression and practical inference/runtime tradeoffs in a reproducible way. ## Model Reference - Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct - This repo uses the official chat template via `tokenizer.apply_chat_template(...)`. ## Repository Layout - `rotorquant.py`: Original RotorQuant core classes (`RotorQuantMSE`, `RotorQuantProd`, `RotorQuantKVCache`) using Clifford algebra + Lloyd-Max. - `clifford.py`: Cl(3,0) multivector algebra and rotor operations. - `lloyd_max.py`: Lloyd-Max codebook solver (SciPy integration) and utilities. - `rotorquant_weights.py`: Custom weight quantization codec for model tensors (3-bit packing, dequantization, reports). - `quantize_qwen.py`: Quantize a Hugging Face model checkpoint into custom package format. - `run_inference.py`: Load quantized package, reconstruct model, run generation. - `validate_quantization.py`: Baseline vs quantized quality checks (logit cosine + token match). - `benchmark_scenarios.py`: Benchmark baseline and all quantized artifacts in `artifacts/`. - `runtime_int8.py`: Build/load dynamic INT8 runtime models (full or selective modules). - `runtime_rotor_fused.py`: Fused RotorQuant runtime path using packed-weight linear module. - `benchmark_runtime_vs_rotor.py`: Unified benchmark: FP32 vs RotorQuant package vs Rotor fused vs INT8 runtime. - `artifacts/*.json`: Saved reports/metrics from experiments. ## Environment Setup ```bash python3 -m venv .venv source .venv/bin/activate pip install --index-url https://download.pytorch.org/whl/cpu torch pip install transformers accelerate safetensors huggingface_hub datasets scipy psutil ``` Or: ```bash pip install -r requirements.txt ``` ## How Quantization Is Done ### 1) RotorQuant-style 3-bit weight codec (`rotorquant_weights.py`) For selected float tensors (typically `Linear.weight`): 1. Optional rotor transform over triples of values (deterministic per tensor) 2. Block-wise normalization 1. mean-center block 2. scale by max-abs per block 3. 3-bit scalar quantization to 8-level codebook 4. Packed 3-bit index serialization 5. Stored metadata for dequantization (scales, centers, codebook, shape info) Supported options include: - `--block-size` (example: 128 or 64) - `--rowwise` - `--include-name-contains` / `--skip-name` selection - `--lowrank-rank` residual correction - `--outlier-frac` residual outlier preservation ### 2) Rotor fused runtime (`runtime_rotor_fused.py`) - Replaces selected `nn.Linear` modules with `FusedRotorLinear`. - Reads packed 3-bit weights directly. - Decodes on demand and caches decoded weight for repeated use. - Avoids full eager dequantization at load for quantized layers. ### 3) Dynamic INT8 runtime (`runtime_int8.py`) - Uses PyTorch dynamic quantization (`nn.Linear -> qint8`) as a runtime speed baseline. - Supports full-model or selective-module quantization (`--include-name-contains mlp.`). ## Reproducible Commands ### A) Quantize (RotorQuant package) Example: MLP-only, 3-bit, block size 64: ```bash python quantize_qwen.py \ --model-id Qwen/Qwen2.5-0.5B-Instruct \ --output artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \ --report artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64-report.json \ --bits 3 \ --block-size 64 \ --dtype float32 \ --include-name-contains mlp. ``` ### B) Inference from quantized package ```bash python run_inference.py \ --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \ --prompt "Explain quantization in one paragraph." \ --max-new-tokens 64 ``` ### C) Quality validation ```bash python validate_quantization.py \ --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \ --max-new-tokens 48 \ --dtype float32 ``` ### D) Build dynamic INT8 runtime models Full dynamic INT8: ```bash python runtime_int8.py build \ --model-id Qwen/Qwen2.5-0.5B-Instruct \ --out artifacts/qwen2.5-0.5b-dynamic-int8.pt \ --meta artifacts/qwen2.5-0.5b-dynamic-int8-meta.json ``` Selective dynamic INT8 (MLP-only): ```bash python runtime_int8.py build \ --model-id Qwen/Qwen2.5-0.5B-Instruct \ --out artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \ --meta artifacts/qwen2.5-0.5b-dynamic-int8-mlp-meta.json \ --include-name-contains mlp. ``` ### E) Run Rotor fused runtime ```bash python runtime_rotor_fused.py run \ --pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \ --prompt "Explain quantization in one paragraph." \ --max-new-tokens 64 ``` ### F) Benchmark all quantized artifacts in `artifacts/` ```bash python benchmark_scenarios.py \ --model-id Qwen/Qwen2.5-0.5B-Instruct \ --artifacts-dir artifacts \ --max-new-tokens 64 \ --dtype float32 \ --out artifacts/benchmark_results.json ``` ### G) Unified runtime benchmark (FP32 vs Rotor pkg vs Rotor fused vs INT8) ```bash python benchmark_runtime_vs_rotor.py \ --model-id Qwen/Qwen2.5-0.5B-Instruct \ --rotor-pkg artifacts/qwen2.5-0.5b-rotorq3-mlp-only.pt \ --fused-pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \ --int8-model artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \ --max-new-tokens 64 \ --out artifacts/runtime_benchmark_with_fused.json ``` ## Reported Metrics ### Quality metric example (RotorQuant package) For `qwen2.5-0.5b-rotorq3-mlp-only-b64` (`~3.50 bits/weight` on quantized tensors): - Mean cosine similarity (last-token logits): `0.868771` - Mean greedy token-match ratio: `0.0781` (From `validate_quantization.py` run on 4 prompts.) ### Runtime benchmark summary From `artifacts/runtime_benchmark_with_fused.json` (CPU, `max_new_tokens=64`, 4 prompts): | Scenario | Load (s) | First Token (s) | Generate (s) | Decode tok/s | RSS after load (GB) | Token Match vs FP32 | |---|---:|---:|---:|---:|---:|---:| | FP32 baseline | 6.495 | 0.159 | 6.547 | 9.78 | 2.28 | 1.0000 | | RotorQuant package (dequantized load) | 6.533 | 0.152 | 6.707 | 9.55 | 2.71 | 0.0820 | | RotorQuant fused runtime | 3.895 | 0.162 | 6.190 | 10.34 | 2.89 | 0.0039 | | Dynamic INT8 runtime (MLP-only) | 4.646 | 0.097 | 4.059 | 15.77 | 2.65 | 0.0156 |