| # RotorQuant Weights + Runtime Quantization for Qwen2.5-0.5B-Instruct |
|
|
| This repository implements and benchmarks multiple quantization/deployment paths for `Qwen/Qwen2.5-0.5B-Instruct`, including: |
|
|
| 1. RotorQuant-style 3-bit weight quantization (custom codec) |
| 2. Quantized model loading + text generation with proper Qwen chat template |
| 3. Validation against FP32 baseline |
| 4. Runtime acceleration paths: |
| 1. RotorQuant fused runtime (packed-weight linear) |
| 2. Dynamic INT8 runtime baseline (for speed comparison) |
|
|
| The goal is to evaluate low-bit compression and practical inference/runtime tradeoffs in a reproducible way. |
|
|
| ## Model Reference |
|
|
| - Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct |
| - This repo uses the official chat template via `tokenizer.apply_chat_template(...)`. |
|
|
| ## Repository Layout |
|
|
| - `rotorquant.py`: Original RotorQuant core classes (`RotorQuantMSE`, `RotorQuantProd`, `RotorQuantKVCache`) using Clifford algebra + Lloyd-Max. |
| - `clifford.py`: Cl(3,0) multivector algebra and rotor operations. |
| - `lloyd_max.py`: Lloyd-Max codebook solver (SciPy integration) and utilities. |
| - `rotorquant_weights.py`: Custom weight quantization codec for model tensors (3-bit packing, dequantization, reports). |
| - `quantize_qwen.py`: Quantize a Hugging Face model checkpoint into custom package format. |
| - `run_inference.py`: Load quantized package, reconstruct model, run generation. |
| - `validate_quantization.py`: Baseline vs quantized quality checks (logit cosine + token match). |
| - `benchmark_scenarios.py`: Benchmark baseline and all quantized artifacts in `artifacts/`. |
| - `runtime_int8.py`: Build/load dynamic INT8 runtime models (full or selective modules). |
| - `runtime_rotor_fused.py`: Fused RotorQuant runtime path using packed-weight linear module. |
| - `benchmark_runtime_vs_rotor.py`: Unified benchmark: FP32 vs RotorQuant package vs Rotor fused vs INT8 runtime. |
| - `artifacts/*.json`: Saved reports/metrics from experiments. |
|
|
| ## Environment Setup |
|
|
| ```bash |
| python3 -m venv .venv |
| source .venv/bin/activate |
| pip install --index-url https://download.pytorch.org/whl/cpu torch |
| pip install transformers accelerate safetensors huggingface_hub datasets scipy psutil |
| ``` |
|
|
| Or: |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ## How Quantization Is Done |
|
|
| ### 1) RotorQuant-style 3-bit weight codec (`rotorquant_weights.py`) |
| |
| For selected float tensors (typically `Linear.weight`): |
| |
| 1. Optional rotor transform over triples of values (deterministic per tensor) |
| 2. Block-wise normalization |
| 1. mean-center block |
| 2. scale by max-abs per block |
| 3. 3-bit scalar quantization to 8-level codebook |
| 4. Packed 3-bit index serialization |
| 5. Stored metadata for dequantization (scales, centers, codebook, shape info) |
| |
| Supported options include: |
| |
| - `--block-size` (example: 128 or 64) |
| - `--rowwise` |
| - `--include-name-contains` / `--skip-name` selection |
| - `--lowrank-rank` residual correction |
| - `--outlier-frac` residual outlier preservation |
| |
| ### 2) Rotor fused runtime (`runtime_rotor_fused.py`) |
| |
| - Replaces selected `nn.Linear` modules with `FusedRotorLinear`. |
| - Reads packed 3-bit weights directly. |
| - Decodes on demand and caches decoded weight for repeated use. |
| - Avoids full eager dequantization at load for quantized layers. |
| |
| ### 3) Dynamic INT8 runtime (`runtime_int8.py`) |
|
|
| - Uses PyTorch dynamic quantization (`nn.Linear -> qint8`) as a runtime speed baseline. |
| - Supports full-model or selective-module quantization (`--include-name-contains mlp.`). |
|
|
| ## Reproducible Commands |
|
|
| ### A) Quantize (RotorQuant package) |
|
|
| Example: MLP-only, 3-bit, block size 64: |
|
|
| ```bash |
| python quantize_qwen.py \ |
| --model-id Qwen/Qwen2.5-0.5B-Instruct \ |
| --output artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \ |
| --report artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64-report.json \ |
| --bits 3 \ |
| --block-size 64 \ |
| --dtype float32 \ |
| --include-name-contains mlp. |
| ``` |
|
|
| ### B) Inference from quantized package |
|
|
| ```bash |
| python run_inference.py \ |
| --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \ |
| --prompt "Explain quantization in one paragraph." \ |
| --max-new-tokens 64 |
| ``` |
|
|
| ### C) Quality validation |
|
|
| ```bash |
| python validate_quantization.py \ |
| --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \ |
| --max-new-tokens 48 \ |
| --dtype float32 |
| ``` |
|
|
| ### D) Build dynamic INT8 runtime models |
|
|
| Full dynamic INT8: |
|
|
| ```bash |
| python runtime_int8.py build \ |
| --model-id Qwen/Qwen2.5-0.5B-Instruct \ |
| --out artifacts/qwen2.5-0.5b-dynamic-int8.pt \ |
| --meta artifacts/qwen2.5-0.5b-dynamic-int8-meta.json |
| ``` |
|
|
| Selective dynamic INT8 (MLP-only): |
|
|
| ```bash |
| python runtime_int8.py build \ |
| --model-id Qwen/Qwen2.5-0.5B-Instruct \ |
| --out artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \ |
| --meta artifacts/qwen2.5-0.5b-dynamic-int8-mlp-meta.json \ |
| --include-name-contains mlp. |
| ``` |
|
|
| ### E) Run Rotor fused runtime |
|
|
| ```bash |
| python runtime_rotor_fused.py run \ |
| --pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \ |
| --prompt "Explain quantization in one paragraph." \ |
| --max-new-tokens 64 |
| ``` |
|
|
| ### F) Benchmark all quantized artifacts in `artifacts/` |
|
|
| ```bash |
| python benchmark_scenarios.py \ |
| --model-id Qwen/Qwen2.5-0.5B-Instruct \ |
| --artifacts-dir artifacts \ |
| --max-new-tokens 64 \ |
| --dtype float32 \ |
| --out artifacts/benchmark_results.json |
| ``` |
|
|
| ### G) Unified runtime benchmark (FP32 vs Rotor pkg vs Rotor fused vs INT8) |
|
|
| ```bash |
| python benchmark_runtime_vs_rotor.py \ |
| --model-id Qwen/Qwen2.5-0.5B-Instruct \ |
| --rotor-pkg artifacts/qwen2.5-0.5b-rotorq3-mlp-only.pt \ |
| --fused-pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \ |
| --int8-model artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \ |
| --max-new-tokens 64 \ |
| --out artifacts/runtime_benchmark_with_fused.json |
| ``` |
|
|
| ## Reported Metrics |
|
|
| ### Quality metric example (RotorQuant package) |
|
|
| For `qwen2.5-0.5b-rotorq3-mlp-only-b64` (`~3.50 bits/weight` on quantized tensors): |
|
|
| - Mean cosine similarity (last-token logits): `0.868771` |
| - Mean greedy token-match ratio: `0.0781` |
|
|
| (From `validate_quantization.py` run on 4 prompts.) |
|
|
| ### Runtime benchmark summary |
|
|
| From `artifacts/runtime_benchmark_with_fused.json` (CPU, `max_new_tokens=64`, 4 prompts): |
|
|
| | Scenario | Load (s) | First Token (s) | Generate (s) | Decode tok/s | RSS after load (GB) | Token Match vs FP32 | |
| |---|---:|---:|---:|---:|---:|---:| |
| | FP32 baseline | 6.495 | 0.159 | 6.547 | 9.78 | 2.28 | 1.0000 | |
| | RotorQuant package (dequantized load) | 6.533 | 0.152 | 6.707 | 9.55 | 2.71 | 0.0820 | |
| | RotorQuant fused runtime | 3.895 | 0.162 | 6.190 | 10.34 | 2.89 | 0.0039 | |
| | Dynamic INT8 runtime (MLP-only) | 4.646 | 0.097 | 4.059 | 15.77 | 2.65 | 0.0156 | |
|
|