cnmoro's picture
Upload 29 files
18f4d80 verified
# RotorQuant Weights + Runtime Quantization for Qwen2.5-0.5B-Instruct
This repository implements and benchmarks multiple quantization/deployment paths for `Qwen/Qwen2.5-0.5B-Instruct`, including:
1. RotorQuant-style 3-bit weight quantization (custom codec)
2. Quantized model loading + text generation with proper Qwen chat template
3. Validation against FP32 baseline
4. Runtime acceleration paths:
1. RotorQuant fused runtime (packed-weight linear)
2. Dynamic INT8 runtime baseline (for speed comparison)
The goal is to evaluate low-bit compression and practical inference/runtime tradeoffs in a reproducible way.
## Model Reference
- Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
- This repo uses the official chat template via `tokenizer.apply_chat_template(...)`.
## Repository Layout
- `rotorquant.py`: Original RotorQuant core classes (`RotorQuantMSE`, `RotorQuantProd`, `RotorQuantKVCache`) using Clifford algebra + Lloyd-Max.
- `clifford.py`: Cl(3,0) multivector algebra and rotor operations.
- `lloyd_max.py`: Lloyd-Max codebook solver (SciPy integration) and utilities.
- `rotorquant_weights.py`: Custom weight quantization codec for model tensors (3-bit packing, dequantization, reports).
- `quantize_qwen.py`: Quantize a Hugging Face model checkpoint into custom package format.
- `run_inference.py`: Load quantized package, reconstruct model, run generation.
- `validate_quantization.py`: Baseline vs quantized quality checks (logit cosine + token match).
- `benchmark_scenarios.py`: Benchmark baseline and all quantized artifacts in `artifacts/`.
- `runtime_int8.py`: Build/load dynamic INT8 runtime models (full or selective modules).
- `runtime_rotor_fused.py`: Fused RotorQuant runtime path using packed-weight linear module.
- `benchmark_runtime_vs_rotor.py`: Unified benchmark: FP32 vs RotorQuant package vs Rotor fused vs INT8 runtime.
- `artifacts/*.json`: Saved reports/metrics from experiments.
## Environment Setup
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install transformers accelerate safetensors huggingface_hub datasets scipy psutil
```
Or:
```bash
pip install -r requirements.txt
```
## How Quantization Is Done
### 1) RotorQuant-style 3-bit weight codec (`rotorquant_weights.py`)
For selected float tensors (typically `Linear.weight`):
1. Optional rotor transform over triples of values (deterministic per tensor)
2. Block-wise normalization
1. mean-center block
2. scale by max-abs per block
3. 3-bit scalar quantization to 8-level codebook
4. Packed 3-bit index serialization
5. Stored metadata for dequantization (scales, centers, codebook, shape info)
Supported options include:
- `--block-size` (example: 128 or 64)
- `--rowwise`
- `--include-name-contains` / `--skip-name` selection
- `--lowrank-rank` residual correction
- `--outlier-frac` residual outlier preservation
### 2) Rotor fused runtime (`runtime_rotor_fused.py`)
- Replaces selected `nn.Linear` modules with `FusedRotorLinear`.
- Reads packed 3-bit weights directly.
- Decodes on demand and caches decoded weight for repeated use.
- Avoids full eager dequantization at load for quantized layers.
### 3) Dynamic INT8 runtime (`runtime_int8.py`)
- Uses PyTorch dynamic quantization (`nn.Linear -> qint8`) as a runtime speed baseline.
- Supports full-model or selective-module quantization (`--include-name-contains mlp.`).
## Reproducible Commands
### A) Quantize (RotorQuant package)
Example: MLP-only, 3-bit, block size 64:
```bash
python quantize_qwen.py \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--output artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
--report artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64-report.json \
--bits 3 \
--block-size 64 \
--dtype float32 \
--include-name-contains mlp.
```
### B) Inference from quantized package
```bash
python run_inference.py \
--quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
--prompt "Explain quantization in one paragraph." \
--max-new-tokens 64
```
### C) Quality validation
```bash
python validate_quantization.py \
--quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
--max-new-tokens 48 \
--dtype float32
```
### D) Build dynamic INT8 runtime models
Full dynamic INT8:
```bash
python runtime_int8.py build \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--out artifacts/qwen2.5-0.5b-dynamic-int8.pt \
--meta artifacts/qwen2.5-0.5b-dynamic-int8-meta.json
```
Selective dynamic INT8 (MLP-only):
```bash
python runtime_int8.py build \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--out artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
--meta artifacts/qwen2.5-0.5b-dynamic-int8-mlp-meta.json \
--include-name-contains mlp.
```
### E) Run Rotor fused runtime
```bash
python runtime_rotor_fused.py run \
--pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
--prompt "Explain quantization in one paragraph." \
--max-new-tokens 64
```
### F) Benchmark all quantized artifacts in `artifacts/`
```bash
python benchmark_scenarios.py \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--artifacts-dir artifacts \
--max-new-tokens 64 \
--dtype float32 \
--out artifacts/benchmark_results.json
```
### G) Unified runtime benchmark (FP32 vs Rotor pkg vs Rotor fused vs INT8)
```bash
python benchmark_runtime_vs_rotor.py \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--rotor-pkg artifacts/qwen2.5-0.5b-rotorq3-mlp-only.pt \
--fused-pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
--int8-model artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
--max-new-tokens 64 \
--out artifacts/runtime_benchmark_with_fused.json
```
## Reported Metrics
### Quality metric example (RotorQuant package)
For `qwen2.5-0.5b-rotorq3-mlp-only-b64` (`~3.50 bits/weight` on quantized tensors):
- Mean cosine similarity (last-token logits): `0.868771`
- Mean greedy token-match ratio: `0.0781`
(From `validate_quantization.py` run on 4 prompts.)
### Runtime benchmark summary
From `artifacts/runtime_benchmark_with_fused.json` (CPU, `max_new_tokens=64`, 4 prompts):
| Scenario | Load (s) | First Token (s) | Generate (s) | Decode tok/s | RSS after load (GB) | Token Match vs FP32 |
|---|---:|---:|---:|---:|---:|---:|
| FP32 baseline | 6.495 | 0.159 | 6.547 | 9.78 | 2.28 | 1.0000 |
| RotorQuant package (dequantized load) | 6.533 | 0.152 | 6.707 | 9.55 | 2.71 | 0.0820 |
| RotorQuant fused runtime | 3.895 | 0.162 | 6.190 | 10.34 | 2.89 | 0.0039 |
| Dynamic INT8 runtime (MLP-only) | 4.646 | 0.097 | 4.059 | 15.77 | 2.65 | 0.0156 |