File size: 6,581 Bytes
18f4d80 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | # RotorQuant Weights + Runtime Quantization for Qwen2.5-0.5B-Instruct
This repository implements and benchmarks multiple quantization/deployment paths for `Qwen/Qwen2.5-0.5B-Instruct`, including:
1. RotorQuant-style 3-bit weight quantization (custom codec)
2. Quantized model loading + text generation with proper Qwen chat template
3. Validation against FP32 baseline
4. Runtime acceleration paths:
1. RotorQuant fused runtime (packed-weight linear)
2. Dynamic INT8 runtime baseline (for speed comparison)
The goal is to evaluate low-bit compression and practical inference/runtime tradeoffs in a reproducible way.
## Model Reference
- Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
- This repo uses the official chat template via `tokenizer.apply_chat_template(...)`.
## Repository Layout
- `rotorquant.py`: Original RotorQuant core classes (`RotorQuantMSE`, `RotorQuantProd`, `RotorQuantKVCache`) using Clifford algebra + Lloyd-Max.
- `clifford.py`: Cl(3,0) multivector algebra and rotor operations.
- `lloyd_max.py`: Lloyd-Max codebook solver (SciPy integration) and utilities.
- `rotorquant_weights.py`: Custom weight quantization codec for model tensors (3-bit packing, dequantization, reports).
- `quantize_qwen.py`: Quantize a Hugging Face model checkpoint into custom package format.
- `run_inference.py`: Load quantized package, reconstruct model, run generation.
- `validate_quantization.py`: Baseline vs quantized quality checks (logit cosine + token match).
- `benchmark_scenarios.py`: Benchmark baseline and all quantized artifacts in `artifacts/`.
- `runtime_int8.py`: Build/load dynamic INT8 runtime models (full or selective modules).
- `runtime_rotor_fused.py`: Fused RotorQuant runtime path using packed-weight linear module.
- `benchmark_runtime_vs_rotor.py`: Unified benchmark: FP32 vs RotorQuant package vs Rotor fused vs INT8 runtime.
- `artifacts/*.json`: Saved reports/metrics from experiments.
## Environment Setup
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install transformers accelerate safetensors huggingface_hub datasets scipy psutil
```
Or:
```bash
pip install -r requirements.txt
```
## How Quantization Is Done
### 1) RotorQuant-style 3-bit weight codec (`rotorquant_weights.py`)
For selected float tensors (typically `Linear.weight`):
1. Optional rotor transform over triples of values (deterministic per tensor)
2. Block-wise normalization
1. mean-center block
2. scale by max-abs per block
3. 3-bit scalar quantization to 8-level codebook
4. Packed 3-bit index serialization
5. Stored metadata for dequantization (scales, centers, codebook, shape info)
Supported options include:
- `--block-size` (example: 128 or 64)
- `--rowwise`
- `--include-name-contains` / `--skip-name` selection
- `--lowrank-rank` residual correction
- `--outlier-frac` residual outlier preservation
### 2) Rotor fused runtime (`runtime_rotor_fused.py`)
- Replaces selected `nn.Linear` modules with `FusedRotorLinear`.
- Reads packed 3-bit weights directly.
- Decodes on demand and caches decoded weight for repeated use.
- Avoids full eager dequantization at load for quantized layers.
### 3) Dynamic INT8 runtime (`runtime_int8.py`)
- Uses PyTorch dynamic quantization (`nn.Linear -> qint8`) as a runtime speed baseline.
- Supports full-model or selective-module quantization (`--include-name-contains mlp.`).
## Reproducible Commands
### A) Quantize (RotorQuant package)
Example: MLP-only, 3-bit, block size 64:
```bash
python quantize_qwen.py \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--output artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
--report artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64-report.json \
--bits 3 \
--block-size 64 \
--dtype float32 \
--include-name-contains mlp.
```
### B) Inference from quantized package
```bash
python run_inference.py \
--quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
--prompt "Explain quantization in one paragraph." \
--max-new-tokens 64
```
### C) Quality validation
```bash
python validate_quantization.py \
--quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
--max-new-tokens 48 \
--dtype float32
```
### D) Build dynamic INT8 runtime models
Full dynamic INT8:
```bash
python runtime_int8.py build \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--out artifacts/qwen2.5-0.5b-dynamic-int8.pt \
--meta artifacts/qwen2.5-0.5b-dynamic-int8-meta.json
```
Selective dynamic INT8 (MLP-only):
```bash
python runtime_int8.py build \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--out artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
--meta artifacts/qwen2.5-0.5b-dynamic-int8-mlp-meta.json \
--include-name-contains mlp.
```
### E) Run Rotor fused runtime
```bash
python runtime_rotor_fused.py run \
--pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
--prompt "Explain quantization in one paragraph." \
--max-new-tokens 64
```
### F) Benchmark all quantized artifacts in `artifacts/`
```bash
python benchmark_scenarios.py \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--artifacts-dir artifacts \
--max-new-tokens 64 \
--dtype float32 \
--out artifacts/benchmark_results.json
```
### G) Unified runtime benchmark (FP32 vs Rotor pkg vs Rotor fused vs INT8)
```bash
python benchmark_runtime_vs_rotor.py \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--rotor-pkg artifacts/qwen2.5-0.5b-rotorq3-mlp-only.pt \
--fused-pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
--int8-model artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
--max-new-tokens 64 \
--out artifacts/runtime_benchmark_with_fused.json
```
## Reported Metrics
### Quality metric example (RotorQuant package)
For `qwen2.5-0.5b-rotorq3-mlp-only-b64` (`~3.50 bits/weight` on quantized tensors):
- Mean cosine similarity (last-token logits): `0.868771`
- Mean greedy token-match ratio: `0.0781`
(From `validate_quantization.py` run on 4 prompts.)
### Runtime benchmark summary
From `artifacts/runtime_benchmark_with_fused.json` (CPU, `max_new_tokens=64`, 4 prompts):
| Scenario | Load (s) | First Token (s) | Generate (s) | Decode tok/s | RSS after load (GB) | Token Match vs FP32 |
|---|---:|---:|---:|---:|---:|---:|
| FP32 baseline | 6.495 | 0.159 | 6.547 | 9.78 | 2.28 | 1.0000 |
| RotorQuant package (dequantized load) | 6.533 | 0.152 | 6.707 | 9.55 | 2.71 | 0.0820 |
| RotorQuant fused runtime | 3.895 | 0.162 | 6.190 | 10.34 | 2.89 | 0.0039 |
| Dynamic INT8 runtime (MLP-only) | 4.646 | 0.097 | 4.059 | 15.77 | 2.65 | 0.0156 |
|