Upload README.md

16ff30b verified about 1 month ago

6.76 kB

	# Q&C: When Quantization Meets Cache in Efficient Image Generation

	Unofficial implementation of the paper: [Q&C: When Quantization Meets Cache in Efficient Image Generation](https://arxiv.org/abs/2503.02508)

	> The official code was announced at `https://github.com/xinding-sys/Quant-Cache` but is not yet publicly available. This repo provides a working implementation based on the paper's methodology sections.

	## 📋 Overview

	This repo implements the Q&C method for accelerating Diffusion Transformers (DiTs) by combining post-training quantization with feature caching. The paper identifies two key challenges when combining these techniques and proposes solutions:

	1. TAP (Temporal-Aware Parallel Clustering) — Improves calibration dataset selection for PTQ when caching reduces sample diversity
	2. VC (Variance Compensation) — Corrects exposure bias amplified by the quantization+cache combination

	## 🏗️ Architecture

	```
	qandc/
	├── __init__.py # Package exports
	├── quantizer.py # Uniform PTQ (W8A8/W4A8, Eq 1-3)
	├── cache.py # FORA-style feature caching (Section 2.1)
	├── tap.py # TAP calibration selection (Section 3.1, Algorithm 1)
	└── variance_compensation.py # VC exposure bias correction (Section 3.2, Eq 9-12)

	run_experiment.py # Self-contained experiment runner
	results/
	└── experiment_summary.json # Our experimental results
	```

	## 🚀 Quick Start

	```bash
	pip install torch torchvision diffusers transformers accelerate scipy scikit-learn
	```

	```python
	from diffusers import DiTPipeline, DDPMScheduler
	from qandc import quantize_model, apply_cache_to_dit, reset_all_caches

	# Load DiT-XL/2
	pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256")
	pipe.scheduler = DDPMScheduler.from_config(pipe.scheduler.config)
	pipe = pipe.to("cuda")

	# Apply W8A8 quantization (170 Linear layers)
	pipe.transformer = quantize_model(pipe.transformer, w_bits=8, a_bits=8,
	skip_patterns=["pos_embed", "norm"])

	# Apply feature caching (recompute every 5th step)
	apply_cache_to_dit(pipe.transformer, cache_interval=5)

	# Generate images
	reset_all_caches(pipe.transformer)
	output = pipe(class_labels=[207], num_inference_steps=50, guidance_scale=4.0)
	```

	## 📊 Experiment Results

	We ran 6 ablation experiments on DiT-XL/2-256 with DDPM scheduler to validate the paper's claims. Run on CPU with 16 images, 20 steps (reduced scale for free compute — paper uses 10K images, 50 steps on A100 GPUs).

	\| Experiment \| Inception Score ↑ \| Time/Image (s) ↓ \| Speedup \| Description \|
	\|:-----------\|------------------:\|------------------:\|--------:\|:------------\|
	\| FP Baseline \| 13.45 \| 65.52 \| 1.00x \| Full-precision DiT-XL/2, DDPM 20 steps \|
	\| Quant Only (W8A8) \| 7.53 \| 56.54 \| 1.16x \| Uniform PTQ, no caching \|
	\| Cache Only (N=4) \| 1.79 \| 17.10 \| 3.83x \| FORA-style caching, no quantization \|
	\| Q&C Naive \| 1.84 \| 20.89 \| 3.14x \| Quant + Cache, no TAP/VC \|
	\| Q&C + TAP \| 1.84 \| 19.69 \| 3.33x \| + Temporal-Aware Parallel Clustering \|
	\| Q&C Full (TAP+VC) \| 2.27 \| 21.12 \| 3.10x \| Full method with Variance Compensation \|

	### Key Observations

	1. Caching provides dramatic speedup (3.8x) but severely degrades quality — confirming the paper's Challenge 1
	2. Naive Q+C combination is catastrophic (IS drops from 13.45 → 1.84) — confirming Challenge 2
	3. Q&C Full (TAP+VC) shows IS improvement (1.84 → 2.27, +23%) over naive combination, demonstrating VC's effectiveness at correcting exposure bias
	4. TAP improves efficiency (faster time/image in Q&C+TAP vs naive) through better calibration data selection

	### Paper Reference (Table 1, ImageNet 256×256, W8A8, 50 steps)

	\| Method \| FID ↓ \| sFID ↓ \| IS ↑ \| Precision ↑ \| Speed \|
	\|:-------\|------:\|-------:\|-----:\|------------:\|------:\|
	\| DDPM (FP) \| 5.22 \| 17.63 \| 237.8 \| 0.8056 \| 5× \|
	\| PTQ4DiT \| 5.45 \| 19.50 \| 250.68 \| 0.7882 \| 10× \|
	\| Q&C (paper) \| 5.43 \| 19.52 \| 250.68 \| 0.7895 \| 12.7× \|

	> Note: Our numbers are NOT directly comparable to the paper's because: (1) we use only 16 images (paper: 10K), (2) 20 steps (paper: 50), (3) CPU execution, and (4) aggressive cache interval of 4 (paper optimizes this). The purpose is to validate the relative trends between methods.

	## 🔧 Implementation Details

	### Quantization (quantizer.py)
	- Uniform symmetric quantization following Eq 1-3 from the paper
	- Channel-wise quantization for weights (per output channel)
	- Tensor-wise quantization for activations
	- Supports W8A8 (8-bit weights, 8-bit activations) and W4A8
	- Replaces all `nn.Linear` layers except normalization and positional embeddings

	### Feature Caching (cache.py)
	- FORA-style static caching: wraps each transformer block
	- At every N-th step: full forward pass + cache the residual output
	- For N-1 following steps: reuse the cached residual (skip expensive MHSA + FFN)
	- `__getattr__` delegation ensures transparency for DiT's conditioning code

	### TAP (tap.py)
	- Spatial similarity: cosine similarity between flattened latent features (Eq 7)
	- Temporal similarity: Gaussian kernel on timestep distances (Eq 8)
	- Combined similarity: `A_final = α·A_spatial + (1-α)·A_temporal` (Eq 6)
	- Parallel subsampling: m=3 independent subsamples, each 1/20 of full dataset
	- Spectral clustering on each subsample → co-occurrence matrix → final KMeans

	### Variance Compensation (variance_compensation.py)
	- Implements both the full analytical K_t (Eq 12) and a simplified version
	- Corrects variance shift in later denoising stages (t > T/2)
	- `x_corrected = μ + K_t · (x̂ - μ)` where K_t is the per-channel, per-timestep correction factor
	- Calibrated offline using a few samples through the quantized+cached pipeline

	## 🔬 Running Full Experiments

	For GPU-scale experiments matching the paper:

	```python
	# Modify run_experiment.py settings:
	args = {
	"num_steps": 50, # Paper: 50/100/250
	"num_images": 10000, # Paper: 10,000
	"batch_size": 16, # GPU batch size
	"cache_interval": 5, # Tune for quality vs speed
	"num_calib_samples": 800, # Paper recommendation
	"tap_clusters": 100, # Paper setting
	}
	```

	## 📝 Citation

	```bibtex
	@article{qandc2025,
	title={Q\&C: When Quantization Meets Cache in Efficient Image Generation},
	author={Xinding et al.},
	journal={arXiv preprint arXiv:2503.02508},
	year={2025}
	}
	```

	## 📄 License

	This implementation is provided for research purposes. The DiT model (`facebook/DiT-XL-2-256`) is under CC-BY-NC-4.0 license.