Upload 29 files

18f4d80 verified about 2 months ago

6.58 kB

	# RotorQuant Weights + Runtime Quantization for Qwen2.5-0.5B-Instruct

	This repository implements and benchmarks multiple quantization/deployment paths for `Qwen/Qwen2.5-0.5B-Instruct`, including:

	1. RotorQuant-style 3-bit weight quantization (custom codec)
	2. Quantized model loading + text generation with proper Qwen chat template
	3. Validation against FP32 baseline
	4. Runtime acceleration paths:
	1. RotorQuant fused runtime (packed-weight linear)
	2. Dynamic INT8 runtime baseline (for speed comparison)

	The goal is to evaluate low-bit compression and practical inference/runtime tradeoffs in a reproducible way.

	## Model Reference

	- Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
	- This repo uses the official chat template via `tokenizer.apply_chat_template(...)`.

	## Repository Layout

	- `rotorquant.py`: Original RotorQuant core classes (`RotorQuantMSE`, `RotorQuantProd`, `RotorQuantKVCache`) using Clifford algebra + Lloyd-Max.
	- `clifford.py`: Cl(3,0) multivector algebra and rotor operations.
	- `lloyd_max.py`: Lloyd-Max codebook solver (SciPy integration) and utilities.
	- `rotorquant_weights.py`: Custom weight quantization codec for model tensors (3-bit packing, dequantization, reports).
	- `quantize_qwen.py`: Quantize a Hugging Face model checkpoint into custom package format.
	- `run_inference.py`: Load quantized package, reconstruct model, run generation.
	- `validate_quantization.py`: Baseline vs quantized quality checks (logit cosine + token match).
	- `benchmark_scenarios.py`: Benchmark baseline and all quantized artifacts in `artifacts/`.
	- `runtime_int8.py`: Build/load dynamic INT8 runtime models (full or selective modules).
	- `runtime_rotor_fused.py`: Fused RotorQuant runtime path using packed-weight linear module.
	- `benchmark_runtime_vs_rotor.py`: Unified benchmark: FP32 vs RotorQuant package vs Rotor fused vs INT8 runtime.
	- `artifacts/*.json`: Saved reports/metrics from experiments.

	## Environment Setup

	```bash
	python3 -m venv .venv
	source .venv/bin/activate
	pip install --index-url https://download.pytorch.org/whl/cpu torch
	pip install transformers accelerate safetensors huggingface_hub datasets scipy psutil
	```

	Or:

	```bash
	pip install -r requirements.txt
	```

	## How Quantization Is Done

	### 1) RotorQuant-style 3-bit weight codec (`rotorquant_weights.py`)

	For selected float tensors (typically `Linear.weight`):

	1. Optional rotor transform over triples of values (deterministic per tensor)
	2. Block-wise normalization
	1. mean-center block
	2. scale by max-abs per block
	3. 3-bit scalar quantization to 8-level codebook
	4. Packed 3-bit index serialization
	5. Stored metadata for dequantization (scales, centers, codebook, shape info)

	Supported options include:

	- `--block-size` (example: 128 or 64)
	- `--rowwise`
	- `--include-name-contains` / `--skip-name` selection
	- `--lowrank-rank` residual correction
	- `--outlier-frac` residual outlier preservation

	### 2) Rotor fused runtime (`runtime_rotor_fused.py`)

	- Replaces selected `nn.Linear` modules with `FusedRotorLinear`.
	- Reads packed 3-bit weights directly.
	- Decodes on demand and caches decoded weight for repeated use.
	- Avoids full eager dequantization at load for quantized layers.

	### 3) Dynamic INT8 runtime (`runtime_int8.py`)

	- Uses PyTorch dynamic quantization (`nn.Linear -> qint8`) as a runtime speed baseline.
	- Supports full-model or selective-module quantization (`--include-name-contains mlp.`).

	## Reproducible Commands

	### A) Quantize (RotorQuant package)

	Example: MLP-only, 3-bit, block size 64:

	```bash
	python quantize_qwen.py \
	--model-id Qwen/Qwen2.5-0.5B-Instruct \
	--output artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
	--report artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64-report.json \
	--bits 3 \
	--block-size 64 \
	--dtype float32 \
	--include-name-contains mlp.
	```

	### B) Inference from quantized package

	```bash
	python run_inference.py \
	--quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
	--prompt "Explain quantization in one paragraph." \
	--max-new-tokens 64
	```

	### C) Quality validation

	```bash
	python validate_quantization.py \
	--quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
	--max-new-tokens 48 \
	--dtype float32
	```

	### D) Build dynamic INT8 runtime models

	Full dynamic INT8:

	```bash
	python runtime_int8.py build \
	--model-id Qwen/Qwen2.5-0.5B-Instruct \
	--out artifacts/qwen2.5-0.5b-dynamic-int8.pt \
	--meta artifacts/qwen2.5-0.5b-dynamic-int8-meta.json
	```

	Selective dynamic INT8 (MLP-only):

	```bash
	python runtime_int8.py build \
	--model-id Qwen/Qwen2.5-0.5B-Instruct \
	--out artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
	--meta artifacts/qwen2.5-0.5b-dynamic-int8-mlp-meta.json \
	--include-name-contains mlp.
	```

	### E) Run Rotor fused runtime

	```bash
	python runtime_rotor_fused.py run \
	--pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
	--prompt "Explain quantization in one paragraph." \
	--max-new-tokens 64
	```

	### F) Benchmark all quantized artifacts in `artifacts/`

	```bash
	python benchmark_scenarios.py \
	--model-id Qwen/Qwen2.5-0.5B-Instruct \
	--artifacts-dir artifacts \
	--max-new-tokens 64 \
	--dtype float32 \
	--out artifacts/benchmark_results.json
	```

	### G) Unified runtime benchmark (FP32 vs Rotor pkg vs Rotor fused vs INT8)

	```bash
	python benchmark_runtime_vs_rotor.py \
	--model-id Qwen/Qwen2.5-0.5B-Instruct \
	--rotor-pkg artifacts/qwen2.5-0.5b-rotorq3-mlp-only.pt \
	--fused-pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
	--int8-model artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
	--max-new-tokens 64 \
	--out artifacts/runtime_benchmark_with_fused.json
	```

	## Reported Metrics

	### Quality metric example (RotorQuant package)

	For `qwen2.5-0.5b-rotorq3-mlp-only-b64` (`~3.50 bits/weight` on quantized tensors):

	- Mean cosine similarity (last-token logits): `0.868771`
	- Mean greedy token-match ratio: `0.0781`

	(From `validate_quantization.py` run on 4 prompts.)

	### Runtime benchmark summary

	From `artifacts/runtime_benchmark_with_fused.json` (CPU, `max_new_tokens=64`, 4 prompts):

	\| Scenario \| Load (s) \| First Token (s) \| Generate (s) \| Decode tok/s \| RSS after load (GB) \| Token Match vs FP32 \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| FP32 baseline \| 6.495 \| 0.159 \| 6.547 \| 9.78 \| 2.28 \| 1.0000 \|
	\| RotorQuant package (dequantized load) \| 6.533 \| 0.152 \| 6.707 \| 9.55 \| 2.71 \| 0.0820 \|
	\| RotorQuant fused runtime \| 3.895 \| 0.162 \| 6.190 \| 10.34 \| 2.89 \| 0.0039 \|
	\| Dynamic INT8 runtime (MLP-only) \| 4.646 \| 0.097 \| 4.059 \| 15.77 \| 2.65 \| 0.0156 \|