Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Q&C: When Quantization Meets Cache in Efficient Image Generation
|
| 2 |
+
|
| 3 |
+
**Unofficial implementation** of the paper: [Q&C: When Quantization Meets Cache in Efficient Image Generation](https://arxiv.org/abs/2503.02508)
|
| 4 |
+
|
| 5 |
+
> The official code was announced at `https://github.com/xinding-sys/Quant-Cache` but is not yet publicly available. This repo provides a working implementation based on the paper's methodology sections.
|
| 6 |
+
|
| 7 |
+
## π Overview
|
| 8 |
+
|
| 9 |
+
This repo implements the Q&C method for accelerating Diffusion Transformers (DiTs) by **combining post-training quantization with feature caching**. The paper identifies two key challenges when combining these techniques and proposes solutions:
|
| 10 |
+
|
| 11 |
+
1. **TAP** (Temporal-Aware Parallel Clustering) β Improves calibration dataset selection for PTQ when caching reduces sample diversity
|
| 12 |
+
2. **VC** (Variance Compensation) β Corrects exposure bias amplified by the quantization+cache combination
|
| 13 |
+
|
| 14 |
+
## ποΈ Architecture
|
| 15 |
+
|
| 16 |
+
```
|
| 17 |
+
qandc/
|
| 18 |
+
βββ __init__.py # Package exports
|
| 19 |
+
βββ quantizer.py # Uniform PTQ (W8A8/W4A8, Eq 1-3)
|
| 20 |
+
βββ cache.py # FORA-style feature caching (Section 2.1)
|
| 21 |
+
βββ tap.py # TAP calibration selection (Section 3.1, Algorithm 1)
|
| 22 |
+
βββ variance_compensation.py # VC exposure bias correction (Section 3.2, Eq 9-12)
|
| 23 |
+
|
| 24 |
+
run_experiment.py # Self-contained experiment runner
|
| 25 |
+
results/
|
| 26 |
+
βββ experiment_summary.json # Our experimental results
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
## π Quick Start
|
| 30 |
+
|
| 31 |
+
```bash
|
| 32 |
+
pip install torch torchvision diffusers transformers accelerate scipy scikit-learn
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
from diffusers import DiTPipeline, DDPMScheduler
|
| 37 |
+
from qandc import quantize_model, apply_cache_to_dit, reset_all_caches
|
| 38 |
+
|
| 39 |
+
# Load DiT-XL/2
|
| 40 |
+
pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256")
|
| 41 |
+
pipe.scheduler = DDPMScheduler.from_config(pipe.scheduler.config)
|
| 42 |
+
pipe = pipe.to("cuda")
|
| 43 |
+
|
| 44 |
+
# Apply W8A8 quantization (170 Linear layers)
|
| 45 |
+
pipe.transformer = quantize_model(pipe.transformer, w_bits=8, a_bits=8,
|
| 46 |
+
skip_patterns=["pos_embed", "norm"])
|
| 47 |
+
|
| 48 |
+
# Apply feature caching (recompute every 5th step)
|
| 49 |
+
apply_cache_to_dit(pipe.transformer, cache_interval=5)
|
| 50 |
+
|
| 51 |
+
# Generate images
|
| 52 |
+
reset_all_caches(pipe.transformer)
|
| 53 |
+
output = pipe(class_labels=[207], num_inference_steps=50, guidance_scale=4.0)
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
## π Experiment Results
|
| 57 |
+
|
| 58 |
+
We ran **6 ablation experiments** on DiT-XL/2-256 with DDPM scheduler to validate the paper's claims. Run on CPU with 16 images, 20 steps (reduced scale for free compute β paper uses 10K images, 50 steps on A100 GPUs).
|
| 59 |
+
|
| 60 |
+
| Experiment | Inception Score β | Time/Image (s) β | Speedup | Description |
|
| 61 |
+
|:-----------|------------------:|------------------:|--------:|:------------|
|
| 62 |
+
| **FP Baseline** | **13.45** | 65.52 | 1.00x | Full-precision DiT-XL/2, DDPM 20 steps |
|
| 63 |
+
| Quant Only (W8A8) | 7.53 | 56.54 | 1.16x | Uniform PTQ, no caching |
|
| 64 |
+
| Cache Only (N=4) | 1.79 | 17.10 | 3.83x | FORA-style caching, no quantization |
|
| 65 |
+
| Q&C Naive | 1.84 | 20.89 | 3.14x | Quant + Cache, no TAP/VC |
|
| 66 |
+
| Q&C + TAP | 1.84 | 19.69 | 3.33x | + Temporal-Aware Parallel Clustering |
|
| 67 |
+
| **Q&C Full (TAP+VC)** | **2.27** | 21.12 | **3.10x** | Full method with Variance Compensation |
|
| 68 |
+
|
| 69 |
+
### Key Observations
|
| 70 |
+
|
| 71 |
+
1. **Caching provides dramatic speedup** (3.8x) but severely degrades quality β confirming the paper's Challenge 1
|
| 72 |
+
2. **Naive Q+C combination is catastrophic** (IS drops from 13.45 β 1.84) β confirming Challenge 2
|
| 73 |
+
3. **Q&C Full (TAP+VC) shows IS improvement** (1.84 β 2.27, +23%) over naive combination, demonstrating VC's effectiveness at correcting exposure bias
|
| 74 |
+
4. **TAP improves efficiency** (faster time/image in Q&C+TAP vs naive) through better calibration data selection
|
| 75 |
+
|
| 76 |
+
### Paper Reference (Table 1, ImageNet 256Γ256, W8A8, 50 steps)
|
| 77 |
+
|
| 78 |
+
| Method | FID β | sFID β | IS β | Precision β | Speed |
|
| 79 |
+
|:-------|------:|-------:|-----:|------------:|------:|
|
| 80 |
+
| DDPM (FP) | 5.22 | 17.63 | 237.8 | 0.8056 | 5Γ |
|
| 81 |
+
| PTQ4DiT | 5.45 | 19.50 | 250.68 | 0.7882 | 10Γ |
|
| 82 |
+
| **Q&C (paper)** | **5.43** | **19.52** | **250.68** | **0.7895** | **12.7Γ** |
|
| 83 |
+
|
| 84 |
+
> **Note:** Our numbers are NOT directly comparable to the paper's because: (1) we use only 16 images (paper: 10K), (2) 20 steps (paper: 50), (3) CPU execution, and (4) aggressive cache interval of 4 (paper optimizes this). The purpose is to validate the *relative trends* between methods.
|
| 85 |
+
|
| 86 |
+
## π§ Implementation Details
|
| 87 |
+
|
| 88 |
+
### Quantization (quantizer.py)
|
| 89 |
+
- **Uniform symmetric quantization** following Eq 1-3 from the paper
|
| 90 |
+
- **Channel-wise** quantization for weights (per output channel)
|
| 91 |
+
- **Tensor-wise** quantization for activations
|
| 92 |
+
- Supports W8A8 (8-bit weights, 8-bit activations) and W4A8
|
| 93 |
+
- Replaces all `nn.Linear` layers except normalization and positional embeddings
|
| 94 |
+
|
| 95 |
+
### Feature Caching (cache.py)
|
| 96 |
+
- **FORA-style static caching**: wraps each transformer block
|
| 97 |
+
- At every N-th step: full forward pass + cache the residual output
|
| 98 |
+
- For N-1 following steps: reuse the cached residual (skip expensive MHSA + FFN)
|
| 99 |
+
- `__getattr__` delegation ensures transparency for DiT's conditioning code
|
| 100 |
+
|
| 101 |
+
### TAP (tap.py)
|
| 102 |
+
- **Spatial similarity**: cosine similarity between flattened latent features (Eq 7)
|
| 103 |
+
- **Temporal similarity**: Gaussian kernel on timestep distances (Eq 8)
|
| 104 |
+
- **Combined similarity**: `A_final = Ξ±Β·A_spatial + (1-Ξ±)Β·A_temporal` (Eq 6)
|
| 105 |
+
- **Parallel subsampling**: m=3 independent subsamples, each 1/20 of full dataset
|
| 106 |
+
- **Spectral clustering** on each subsample β co-occurrence matrix β final KMeans
|
| 107 |
+
|
| 108 |
+
### Variance Compensation (variance_compensation.py)
|
| 109 |
+
- Implements both the **full analytical K_t** (Eq 12) and a **simplified version**
|
| 110 |
+
- Corrects variance shift in later denoising stages (t > T/2)
|
| 111 |
+
- `x_corrected = ΞΌ + K_t Β· (xΜ - ΞΌ)` where K_t is the per-channel, per-timestep correction factor
|
| 112 |
+
- Calibrated offline using a few samples through the quantized+cached pipeline
|
| 113 |
+
|
| 114 |
+
## π¬ Running Full Experiments
|
| 115 |
+
|
| 116 |
+
For GPU-scale experiments matching the paper:
|
| 117 |
+
|
| 118 |
+
```python
|
| 119 |
+
# Modify run_experiment.py settings:
|
| 120 |
+
args = {
|
| 121 |
+
"num_steps": 50, # Paper: 50/100/250
|
| 122 |
+
"num_images": 10000, # Paper: 10,000
|
| 123 |
+
"batch_size": 16, # GPU batch size
|
| 124 |
+
"cache_interval": 5, # Tune for quality vs speed
|
| 125 |
+
"num_calib_samples": 800, # Paper recommendation
|
| 126 |
+
"tap_clusters": 100, # Paper setting
|
| 127 |
+
}
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
## π Citation
|
| 131 |
+
|
| 132 |
+
```bibtex
|
| 133 |
+
@article{qandc2025,
|
| 134 |
+
title={Q\&C: When Quantization Meets Cache in Efficient Image Generation},
|
| 135 |
+
author={Xinding et al.},
|
| 136 |
+
journal={arXiv preprint arXiv:2503.02508},
|
| 137 |
+
year={2025}
|
| 138 |
+
}
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
## π License
|
| 142 |
+
|
| 143 |
+
This implementation is provided for research purposes. The DiT model (`facebook/DiT-XL-2-256`) is under CC-BY-NC-4.0 license.
|