File size: 6,581 Bytes
18f4d80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# RotorQuant Weights + Runtime Quantization for Qwen2.5-0.5B-Instruct

This repository implements and benchmarks multiple quantization/deployment paths for `Qwen/Qwen2.5-0.5B-Instruct`, including:

1. RotorQuant-style 3-bit weight quantization (custom codec)
2. Quantized model loading + text generation with proper Qwen chat template
3. Validation against FP32 baseline
4. Runtime acceleration paths:
1. RotorQuant fused runtime (packed-weight linear)
2. Dynamic INT8 runtime baseline (for speed comparison)

The goal is to evaluate low-bit compression and practical inference/runtime tradeoffs in a reproducible way.

## Model Reference

- Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
- This repo uses the official chat template via `tokenizer.apply_chat_template(...)`.

## Repository Layout

- `rotorquant.py`: Original RotorQuant core classes (`RotorQuantMSE`, `RotorQuantProd`, `RotorQuantKVCache`) using Clifford algebra + Lloyd-Max.
- `clifford.py`: Cl(3,0) multivector algebra and rotor operations.
- `lloyd_max.py`: Lloyd-Max codebook solver (SciPy integration) and utilities.
- `rotorquant_weights.py`: Custom weight quantization codec for model tensors (3-bit packing, dequantization, reports).
- `quantize_qwen.py`: Quantize a Hugging Face model checkpoint into custom package format.
- `run_inference.py`: Load quantized package, reconstruct model, run generation.
- `validate_quantization.py`: Baseline vs quantized quality checks (logit cosine + token match).
- `benchmark_scenarios.py`: Benchmark baseline and all quantized artifacts in `artifacts/`.
- `runtime_int8.py`: Build/load dynamic INT8 runtime models (full or selective modules).
- `runtime_rotor_fused.py`: Fused RotorQuant runtime path using packed-weight linear module.
- `benchmark_runtime_vs_rotor.py`: Unified benchmark: FP32 vs RotorQuant package vs Rotor fused vs INT8 runtime.
- `artifacts/*.json`: Saved reports/metrics from experiments.

## Environment Setup

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install transformers accelerate safetensors huggingface_hub datasets scipy psutil
```

Or:

```bash
pip install -r requirements.txt
```

## How Quantization Is Done

### 1) RotorQuant-style 3-bit weight codec (`rotorquant_weights.py`)

For selected float tensors (typically `Linear.weight`):

1. Optional rotor transform over triples of values (deterministic per tensor)
2. Block-wise normalization
1. mean-center block
2. scale by max-abs per block
3. 3-bit scalar quantization to 8-level codebook
4. Packed 3-bit index serialization
5. Stored metadata for dequantization (scales, centers, codebook, shape info)

Supported options include:

- `--block-size` (example: 128 or 64)
- `--rowwise`
- `--include-name-contains` / `--skip-name` selection
- `--lowrank-rank` residual correction
- `--outlier-frac` residual outlier preservation

### 2) Rotor fused runtime (`runtime_rotor_fused.py`)

- Replaces selected `nn.Linear` modules with `FusedRotorLinear`.
- Reads packed 3-bit weights directly.
- Decodes on demand and caches decoded weight for repeated use.
- Avoids full eager dequantization at load for quantized layers.

### 3) Dynamic INT8 runtime (`runtime_int8.py`)

- Uses PyTorch dynamic quantization (`nn.Linear -> qint8`) as a runtime speed baseline.
- Supports full-model or selective-module quantization (`--include-name-contains mlp.`).

## Reproducible Commands

### A) Quantize (RotorQuant package)

Example: MLP-only, 3-bit, block size 64:

```bash
python quantize_qwen.py \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --output artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
  --report artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64-report.json \
  --bits 3 \
  --block-size 64 \
  --dtype float32 \
  --include-name-contains mlp.
```

### B) Inference from quantized package

```bash
python run_inference.py \
  --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
  --prompt "Explain quantization in one paragraph." \
  --max-new-tokens 64
```

### C) Quality validation

```bash
python validate_quantization.py \
  --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
  --max-new-tokens 48 \
  --dtype float32
```

### D) Build dynamic INT8 runtime models

Full dynamic INT8:

```bash
python runtime_int8.py build \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --out artifacts/qwen2.5-0.5b-dynamic-int8.pt \
  --meta artifacts/qwen2.5-0.5b-dynamic-int8-meta.json
```

Selective dynamic INT8 (MLP-only):

```bash
python runtime_int8.py build \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --out artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
  --meta artifacts/qwen2.5-0.5b-dynamic-int8-mlp-meta.json \
  --include-name-contains mlp.
```

### E) Run Rotor fused runtime

```bash
python runtime_rotor_fused.py run \
  --pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
  --prompt "Explain quantization in one paragraph." \
  --max-new-tokens 64
```

### F) Benchmark all quantized artifacts in `artifacts/`

```bash
python benchmark_scenarios.py \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --artifacts-dir artifacts \
  --max-new-tokens 64 \
  --dtype float32 \
  --out artifacts/benchmark_results.json
```

### G) Unified runtime benchmark (FP32 vs Rotor pkg vs Rotor fused vs INT8)

```bash
python benchmark_runtime_vs_rotor.py \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --rotor-pkg artifacts/qwen2.5-0.5b-rotorq3-mlp-only.pt \
  --fused-pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
  --int8-model artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
  --max-new-tokens 64 \
  --out artifacts/runtime_benchmark_with_fused.json
```

## Reported Metrics

### Quality metric example (RotorQuant package)

For `qwen2.5-0.5b-rotorq3-mlp-only-b64` (`~3.50 bits/weight` on quantized tensors):

- Mean cosine similarity (last-token logits): `0.868771`
- Mean greedy token-match ratio: `0.0781`

(From `validate_quantization.py` run on 4 prompts.)

### Runtime benchmark summary

From `artifacts/runtime_benchmark_with_fused.json` (CPU, `max_new_tokens=64`, 4 prompts):

| Scenario | Load (s) | First Token (s) | Generate (s) | Decode tok/s | RSS after load (GB) | Token Match vs FP32 |
|---|---:|---:|---:|---:|---:|---:|
| FP32 baseline | 6.495 | 0.159 | 6.547 | 9.78 | 2.28 | 1.0000 |
| RotorQuant package (dequantized load) | 6.533 | 0.152 | 6.707 | 9.55 | 2.71 | 0.0820 |
| RotorQuant fused runtime | 3.895 | 0.162 | 6.190 | 10.34 | 2.89 | 0.0039 |
| Dynamic INT8 runtime (MLP-only) | 4.646 | 0.097 | 4.059 | 15.77 | 2.65 | 0.0156 |