# RotorQuant Weights + Runtime Quantization for Qwen2.5-0.5B-Instruct

This repository implements and benchmarks multiple quantization/deployment paths for `Qwen/Qwen2.5-0.5B-Instruct`, including:

1. RotorQuant-style 3-bit weight quantization (custom codec)
2. Quantized model loading + text generation with proper Qwen chat template
3. Validation against FP32 baseline
4. Runtime acceleration paths:
1. RotorQuant fused runtime (packed-weight linear)
2. Dynamic INT8 runtime baseline (for speed comparison)

The goal is to evaluate low-bit compression and practical inference/runtime tradeoffs in a reproducible way.

## Model Reference

- Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
- This repo uses the official chat template via `tokenizer.apply_chat_template(...)`.

## Repository Layout

- `rotorquant.py`: Original RotorQuant core classes (`RotorQuantMSE`, `RotorQuantProd`, `RotorQuantKVCache`) using Clifford algebra + Lloyd-Max.
- `clifford.py`: Cl(3,0) multivector algebra and rotor operations.
- `lloyd_max.py`: Lloyd-Max codebook solver (SciPy integration) and utilities.
- `rotorquant_weights.py`: Custom weight quantization codec for model tensors (3-bit packing, dequantization, reports).
- `quantize_qwen.py`: Quantize a Hugging Face model checkpoint into custom package format.
- `run_inference.py`: Load quantized package, reconstruct model, run generation.
- `validate_quantization.py`: Baseline vs quantized quality checks (logit cosine + token match).
- `benchmark_scenarios.py`: Benchmark baseline and all quantized artifacts in `artifacts/`.
- `runtime_int8.py`: Build/load dynamic INT8 runtime models (full or selective modules).
- `runtime_rotor_fused.py`: Fused RotorQuant runtime path using packed-weight linear module.
- `benchmark_runtime_vs_rotor.py`: Unified benchmark: FP32 vs RotorQuant package vs Rotor fused vs INT8 runtime.
- `artifacts/*.json`: Saved reports/metrics from experiments.

## Environment Setup

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install transformers accelerate safetensors huggingface_hub datasets scipy psutil
```

Or:

```bash
pip install -r requirements.txt
```

## How Quantization Is Done

### 1) RotorQuant-style 3-bit weight codec (`rotorquant_weights.py`)

For selected float tensors (typically `Linear.weight`):

1. Optional rotor transform over triples of values (deterministic per tensor)
2. Block-wise normalization
1. mean-center block
2. scale by max-abs per block
3. 3-bit scalar quantization to 8-level codebook
4. Packed 3-bit index serialization
5. Stored metadata for dequantization (scales, centers, codebook, shape info)

Supported options include:

- `--block-size` (example: 128 or 64)
- `--rowwise`
- `--include-name-contains` / `--skip-name` selection
- `--lowrank-rank` residual correction
- `--outlier-frac` residual outlier preservation

### 2) Rotor fused runtime (`runtime_rotor_fused.py`)

- Replaces selected `nn.Linear` modules with `FusedRotorLinear`.
- Reads packed 3-bit weights directly.
- Decodes on demand and caches decoded weight for repeated use.
- Avoids full eager dequantization at load for quantized layers.

### 3) Dynamic INT8 runtime (`runtime_int8.py`)

- Uses PyTorch dynamic quantization (`nn.Linear -> qint8`) as a runtime speed baseline.
- Supports full-model or selective-module quantization (`--include-name-contains mlp.`).

## Reproducible Commands

### A) Quantize (RotorQuant package)

Example: MLP-only, 3-bit, block size 64:

```bash
python quantize_qwen.py \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --output artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
  --report artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64-report.json \
  --bits 3 \
  --block-size 64 \
  --dtype float32 \
  --include-name-contains mlp.
```

### B) Inference from quantized package

```bash
python run_inference.py \
  --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
  --prompt "Explain quantization in one paragraph." \
  --max-new-tokens 64
```

### C) Quality validation

```bash
python validate_quantization.py \
  --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
  --max-new-tokens 48 \
  --dtype float32
```

### D) Build dynamic INT8 runtime models

Full dynamic INT8:

```bash
python runtime_int8.py build \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --out artifacts/qwen2.5-0.5b-dynamic-int8.pt \
  --meta artifacts/qwen2.5-0.5b-dynamic-int8-meta.json
```

Selective dynamic INT8 (MLP-only):

```bash
python runtime_int8.py build \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --out artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
  --meta artifacts/qwen2.5-0.5b-dynamic-int8-mlp-meta.json \
  --include-name-contains mlp.
```

### E) Run Rotor fused runtime

```bash
python runtime_rotor_fused.py run \
  --pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
  --prompt "Explain quantization in one paragraph." \
  --max-new-tokens 64
```

### F) Benchmark all quantized artifacts in `artifacts/`

```bash
python benchmark_scenarios.py \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --artifacts-dir artifacts \
  --max-new-tokens 64 \
  --dtype float32 \
  --out artifacts/benchmark_results.json
```

### G) Unified runtime benchmark (FP32 vs Rotor pkg vs Rotor fused vs INT8)

```bash
python benchmark_runtime_vs_rotor.py \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --rotor-pkg artifacts/qwen2.5-0.5b-rotorq3-mlp-only.pt \
  --fused-pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
  --int8-model artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
  --max-new-tokens 64 \
  --out artifacts/runtime_benchmark_with_fused.json
```

## Reported Metrics

### Quality metric example (RotorQuant package)

For `qwen2.5-0.5b-rotorq3-mlp-only-b64` (`~3.50 bits/weight` on quantized tensors):

- Mean cosine similarity (last-token logits): `0.868771`
- Mean greedy token-match ratio: `0.0781`

(From `validate_quantization.py` run on 4 prompts.)

### Runtime benchmark summary

From `artifacts/runtime_benchmark_with_fused.json` (CPU, `max_new_tokens=64`, 4 prompts):

| Scenario | Load (s) | First Token (s) | Generate (s) | Decode tok/s | RSS after load (GB) | Token Match vs FP32 |
|---|---:|---:|---:|---:|---:|---:|
| FP32 baseline | 6.495 | 0.159 | 6.547 | 9.78 | 2.28 | 1.0000 |
| RotorQuant package (dequantized load) | 6.533 | 0.152 | 6.707 | 9.55 | 2.71 | 0.0820 |
| RotorQuant fused runtime | 3.895 | 0.162 | 6.190 | 10.34 | 2.89 | 0.0039 |
| Dynamic INT8 runtime (MLP-only) | 4.646 | 0.097 | 4.059 | 15.77 | 2.65 | 0.0156 |