| # Q&C: When Quantization Meets Cache in Efficient Image Generation |
|
|
| **Unofficial implementation** of the paper: [Q&C: When Quantization Meets Cache in Efficient Image Generation](https://arxiv.org/abs/2503.02508) |
|
|
| > The official code was announced at `https://github.com/xinding-sys/Quant-Cache` but is not yet publicly available. This repo provides a working implementation based on the paper's methodology sections. |
|
|
| ## π Overview |
|
|
| This repo implements the Q&C method for accelerating Diffusion Transformers (DiTs) by **combining post-training quantization with feature caching**. The paper identifies two key challenges when combining these techniques and proposes solutions: |
|
|
| 1. **TAP** (Temporal-Aware Parallel Clustering) β Improves calibration dataset selection for PTQ when caching reduces sample diversity |
| 2. **VC** (Variance Compensation) β Corrects exposure bias amplified by the quantization+cache combination |
|
|
| ## ποΈ Architecture |
|
|
| ``` |
| qandc/ |
| βββ __init__.py # Package exports |
| βββ quantizer.py # Uniform PTQ (W8A8/W4A8, Eq 1-3) |
| βββ cache.py # FORA-style feature caching (Section 2.1) |
| βββ tap.py # TAP calibration selection (Section 3.1, Algorithm 1) |
| βββ variance_compensation.py # VC exposure bias correction (Section 3.2, Eq 9-12) |
| |
| run_experiment.py # Self-contained experiment runner |
| results/ |
| βββ experiment_summary.json # Our experimental results |
| ``` |
|
|
| ## π Quick Start |
|
|
| ```bash |
| pip install torch torchvision diffusers transformers accelerate scipy scikit-learn |
| ``` |
|
|
| ```python |
| from diffusers import DiTPipeline, DDPMScheduler |
| from qandc import quantize_model, apply_cache_to_dit, reset_all_caches |
| |
| # Load DiT-XL/2 |
| pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256") |
| pipe.scheduler = DDPMScheduler.from_config(pipe.scheduler.config) |
| pipe = pipe.to("cuda") |
| |
| # Apply W8A8 quantization (170 Linear layers) |
| pipe.transformer = quantize_model(pipe.transformer, w_bits=8, a_bits=8, |
| skip_patterns=["pos_embed", "norm"]) |
| |
| # Apply feature caching (recompute every 5th step) |
| apply_cache_to_dit(pipe.transformer, cache_interval=5) |
| |
| # Generate images |
| reset_all_caches(pipe.transformer) |
| output = pipe(class_labels=[207], num_inference_steps=50, guidance_scale=4.0) |
| ``` |
|
|
| ## π Experiment Results |
|
|
| We ran **6 ablation experiments** on DiT-XL/2-256 with DDPM scheduler to validate the paper's claims. Run on CPU with 16 images, 20 steps (reduced scale for free compute β paper uses 10K images, 50 steps on A100 GPUs). |
|
|
| | Experiment | Inception Score β | Time/Image (s) β | Speedup | Description | |
| |:-----------|------------------:|------------------:|--------:|:------------| |
| | **FP Baseline** | **13.45** | 65.52 | 1.00x | Full-precision DiT-XL/2, DDPM 20 steps | |
| | Quant Only (W8A8) | 7.53 | 56.54 | 1.16x | Uniform PTQ, no caching | |
| | Cache Only (N=4) | 1.79 | 17.10 | 3.83x | FORA-style caching, no quantization | |
| | Q&C Naive | 1.84 | 20.89 | 3.14x | Quant + Cache, no TAP/VC | |
| | Q&C + TAP | 1.84 | 19.69 | 3.33x | + Temporal-Aware Parallel Clustering | |
| | **Q&C Full (TAP+VC)** | **2.27** | 21.12 | **3.10x** | Full method with Variance Compensation | |
|
|
| ### Key Observations |
|
|
| 1. **Caching provides dramatic speedup** (3.8x) but severely degrades quality β confirming the paper's Challenge 1 |
| 2. **Naive Q+C combination is catastrophic** (IS drops from 13.45 β 1.84) β confirming Challenge 2 |
| 3. **Q&C Full (TAP+VC) shows IS improvement** (1.84 β 2.27, +23%) over naive combination, demonstrating VC's effectiveness at correcting exposure bias |
| 4. **TAP improves efficiency** (faster time/image in Q&C+TAP vs naive) through better calibration data selection |
|
|
| ### Paper Reference (Table 1, ImageNet 256Γ256, W8A8, 50 steps) |
|
|
| | Method | FID β | sFID β | IS β | Precision β | Speed | |
| |:-------|------:|-------:|-----:|------------:|------:| |
| | DDPM (FP) | 5.22 | 17.63 | 237.8 | 0.8056 | 5Γ | |
| | PTQ4DiT | 5.45 | 19.50 | 250.68 | 0.7882 | 10Γ | |
| | **Q&C (paper)** | **5.43** | **19.52** | **250.68** | **0.7895** | **12.7Γ** | |
|
|
| > **Note:** Our numbers are NOT directly comparable to the paper's because: (1) we use only 16 images (paper: 10K), (2) 20 steps (paper: 50), (3) CPU execution, and (4) aggressive cache interval of 4 (paper optimizes this). The purpose is to validate the *relative trends* between methods. |
|
|
| ## π§ Implementation Details |
|
|
| ### Quantization (quantizer.py) |
| - **Uniform symmetric quantization** following Eq 1-3 from the paper |
| - **Channel-wise** quantization for weights (per output channel) |
| - **Tensor-wise** quantization for activations |
| - Supports W8A8 (8-bit weights, 8-bit activations) and W4A8 |
| - Replaces all `nn.Linear` layers except normalization and positional embeddings |
|
|
| ### Feature Caching (cache.py) |
| - **FORA-style static caching**: wraps each transformer block |
| - At every N-th step: full forward pass + cache the residual output |
| - For N-1 following steps: reuse the cached residual (skip expensive MHSA + FFN) |
| - `__getattr__` delegation ensures transparency for DiT's conditioning code |
|
|
| ### TAP (tap.py) |
| - **Spatial similarity**: cosine similarity between flattened latent features (Eq 7) |
| - **Temporal similarity**: Gaussian kernel on timestep distances (Eq 8) |
| - **Combined similarity**: `A_final = Ξ±Β·A_spatial + (1-Ξ±)Β·A_temporal` (Eq 6) |
| - **Parallel subsampling**: m=3 independent subsamples, each 1/20 of full dataset |
| - **Spectral clustering** on each subsample β co-occurrence matrix β final KMeans |
|
|
| ### Variance Compensation (variance_compensation.py) |
| - Implements both the **full analytical K_t** (Eq 12) and a **simplified version** |
| - Corrects variance shift in later denoising stages (t > T/2) |
| - `x_corrected = ΞΌ + K_t Β· (xΜ - ΞΌ)` where K_t is the per-channel, per-timestep correction factor |
| - Calibrated offline using a few samples through the quantized+cached pipeline |
|
|
| ## π¬ Running Full Experiments |
|
|
| For GPU-scale experiments matching the paper: |
|
|
| ```python |
| # Modify run_experiment.py settings: |
| args = { |
| "num_steps": 50, # Paper: 50/100/250 |
| "num_images": 10000, # Paper: 10,000 |
| "batch_size": 16, # GPU batch size |
| "cache_interval": 5, # Tune for quality vs speed |
| "num_calib_samples": 800, # Paper recommendation |
| "tap_clusters": 100, # Paper setting |
| } |
| ``` |
|
|
| ## π Citation |
|
|
| ```bibtex |
| @article{qandc2025, |
| title={Q\&C: When Quantization Meets Cache in Efficient Image Generation}, |
| author={Xinding et al.}, |
| journal={arXiv preprint arXiv:2503.02508}, |
| year={2025} |
| } |
| ``` |
|
|
| ## π License |
|
|
| This implementation is provided for research purposes. The DiT model (`facebook/DiT-XL-2-256`) is under CC-BY-NC-4.0 license. |
|
|