sanskar753 commited on
Commit
16ff30b
Β·
verified Β·
1 Parent(s): e3244e1

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Q&C: When Quantization Meets Cache in Efficient Image Generation
2
+
3
+ **Unofficial implementation** of the paper: [Q&C: When Quantization Meets Cache in Efficient Image Generation](https://arxiv.org/abs/2503.02508)
4
+
5
+ > The official code was announced at `https://github.com/xinding-sys/Quant-Cache` but is not yet publicly available. This repo provides a working implementation based on the paper's methodology sections.
6
+
7
+ ## πŸ“‹ Overview
8
+
9
+ This repo implements the Q&C method for accelerating Diffusion Transformers (DiTs) by **combining post-training quantization with feature caching**. The paper identifies two key challenges when combining these techniques and proposes solutions:
10
+
11
+ 1. **TAP** (Temporal-Aware Parallel Clustering) β€” Improves calibration dataset selection for PTQ when caching reduces sample diversity
12
+ 2. **VC** (Variance Compensation) β€” Corrects exposure bias amplified by the quantization+cache combination
13
+
14
+ ## πŸ—οΈ Architecture
15
+
16
+ ```
17
+ qandc/
18
+ β”œβ”€β”€ __init__.py # Package exports
19
+ β”œβ”€β”€ quantizer.py # Uniform PTQ (W8A8/W4A8, Eq 1-3)
20
+ β”œβ”€β”€ cache.py # FORA-style feature caching (Section 2.1)
21
+ β”œβ”€β”€ tap.py # TAP calibration selection (Section 3.1, Algorithm 1)
22
+ └── variance_compensation.py # VC exposure bias correction (Section 3.2, Eq 9-12)
23
+
24
+ run_experiment.py # Self-contained experiment runner
25
+ results/
26
+ └── experiment_summary.json # Our experimental results
27
+ ```
28
+
29
+ ## πŸš€ Quick Start
30
+
31
+ ```bash
32
+ pip install torch torchvision diffusers transformers accelerate scipy scikit-learn
33
+ ```
34
+
35
+ ```python
36
+ from diffusers import DiTPipeline, DDPMScheduler
37
+ from qandc import quantize_model, apply_cache_to_dit, reset_all_caches
38
+
39
+ # Load DiT-XL/2
40
+ pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256")
41
+ pipe.scheduler = DDPMScheduler.from_config(pipe.scheduler.config)
42
+ pipe = pipe.to("cuda")
43
+
44
+ # Apply W8A8 quantization (170 Linear layers)
45
+ pipe.transformer = quantize_model(pipe.transformer, w_bits=8, a_bits=8,
46
+ skip_patterns=["pos_embed", "norm"])
47
+
48
+ # Apply feature caching (recompute every 5th step)
49
+ apply_cache_to_dit(pipe.transformer, cache_interval=5)
50
+
51
+ # Generate images
52
+ reset_all_caches(pipe.transformer)
53
+ output = pipe(class_labels=[207], num_inference_steps=50, guidance_scale=4.0)
54
+ ```
55
+
56
+ ## πŸ“Š Experiment Results
57
+
58
+ We ran **6 ablation experiments** on DiT-XL/2-256 with DDPM scheduler to validate the paper's claims. Run on CPU with 16 images, 20 steps (reduced scale for free compute β€” paper uses 10K images, 50 steps on A100 GPUs).
59
+
60
+ | Experiment | Inception Score ↑ | Time/Image (s) ↓ | Speedup | Description |
61
+ |:-----------|------------------:|------------------:|--------:|:------------|
62
+ | **FP Baseline** | **13.45** | 65.52 | 1.00x | Full-precision DiT-XL/2, DDPM 20 steps |
63
+ | Quant Only (W8A8) | 7.53 | 56.54 | 1.16x | Uniform PTQ, no caching |
64
+ | Cache Only (N=4) | 1.79 | 17.10 | 3.83x | FORA-style caching, no quantization |
65
+ | Q&C Naive | 1.84 | 20.89 | 3.14x | Quant + Cache, no TAP/VC |
66
+ | Q&C + TAP | 1.84 | 19.69 | 3.33x | + Temporal-Aware Parallel Clustering |
67
+ | **Q&C Full (TAP+VC)** | **2.27** | 21.12 | **3.10x** | Full method with Variance Compensation |
68
+
69
+ ### Key Observations
70
+
71
+ 1. **Caching provides dramatic speedup** (3.8x) but severely degrades quality β€” confirming the paper's Challenge 1
72
+ 2. **Naive Q+C combination is catastrophic** (IS drops from 13.45 β†’ 1.84) β€” confirming Challenge 2
73
+ 3. **Q&C Full (TAP+VC) shows IS improvement** (1.84 β†’ 2.27, +23%) over naive combination, demonstrating VC's effectiveness at correcting exposure bias
74
+ 4. **TAP improves efficiency** (faster time/image in Q&C+TAP vs naive) through better calibration data selection
75
+
76
+ ### Paper Reference (Table 1, ImageNet 256Γ—256, W8A8, 50 steps)
77
+
78
+ | Method | FID ↓ | sFID ↓ | IS ↑ | Precision ↑ | Speed |
79
+ |:-------|------:|-------:|-----:|------------:|------:|
80
+ | DDPM (FP) | 5.22 | 17.63 | 237.8 | 0.8056 | 5Γ— |
81
+ | PTQ4DiT | 5.45 | 19.50 | 250.68 | 0.7882 | 10Γ— |
82
+ | **Q&C (paper)** | **5.43** | **19.52** | **250.68** | **0.7895** | **12.7Γ—** |
83
+
84
+ > **Note:** Our numbers are NOT directly comparable to the paper's because: (1) we use only 16 images (paper: 10K), (2) 20 steps (paper: 50), (3) CPU execution, and (4) aggressive cache interval of 4 (paper optimizes this). The purpose is to validate the *relative trends* between methods.
85
+
86
+ ## πŸ”§ Implementation Details
87
+
88
+ ### Quantization (quantizer.py)
89
+ - **Uniform symmetric quantization** following Eq 1-3 from the paper
90
+ - **Channel-wise** quantization for weights (per output channel)
91
+ - **Tensor-wise** quantization for activations
92
+ - Supports W8A8 (8-bit weights, 8-bit activations) and W4A8
93
+ - Replaces all `nn.Linear` layers except normalization and positional embeddings
94
+
95
+ ### Feature Caching (cache.py)
96
+ - **FORA-style static caching**: wraps each transformer block
97
+ - At every N-th step: full forward pass + cache the residual output
98
+ - For N-1 following steps: reuse the cached residual (skip expensive MHSA + FFN)
99
+ - `__getattr__` delegation ensures transparency for DiT's conditioning code
100
+
101
+ ### TAP (tap.py)
102
+ - **Spatial similarity**: cosine similarity between flattened latent features (Eq 7)
103
+ - **Temporal similarity**: Gaussian kernel on timestep distances (Eq 8)
104
+ - **Combined similarity**: `A_final = Ξ±Β·A_spatial + (1-Ξ±)Β·A_temporal` (Eq 6)
105
+ - **Parallel subsampling**: m=3 independent subsamples, each 1/20 of full dataset
106
+ - **Spectral clustering** on each subsample β†’ co-occurrence matrix β†’ final KMeans
107
+
108
+ ### Variance Compensation (variance_compensation.py)
109
+ - Implements both the **full analytical K_t** (Eq 12) and a **simplified version**
110
+ - Corrects variance shift in later denoising stages (t > T/2)
111
+ - `x_corrected = ΞΌ + K_t Β· (xΜ‚ - ΞΌ)` where K_t is the per-channel, per-timestep correction factor
112
+ - Calibrated offline using a few samples through the quantized+cached pipeline
113
+
114
+ ## πŸ”¬ Running Full Experiments
115
+
116
+ For GPU-scale experiments matching the paper:
117
+
118
+ ```python
119
+ # Modify run_experiment.py settings:
120
+ args = {
121
+ "num_steps": 50, # Paper: 50/100/250
122
+ "num_images": 10000, # Paper: 10,000
123
+ "batch_size": 16, # GPU batch size
124
+ "cache_interval": 5, # Tune for quality vs speed
125
+ "num_calib_samples": 800, # Paper recommendation
126
+ "tap_clusters": 100, # Paper setting
127
+ }
128
+ ```
129
+
130
+ ## πŸ“ Citation
131
+
132
+ ```bibtex
133
+ @article{qandc2025,
134
+ title={Q\&C: When Quantization Meets Cache in Efficient Image Generation},
135
+ author={Xinding et al.},
136
+ journal={arXiv preprint arXiv:2503.02508},
137
+ year={2025}
138
+ }
139
+ ```
140
+
141
+ ## πŸ“„ License
142
+
143
+ This implementation is provided for research purposes. The DiT model (`facebook/DiT-XL-2-256`) is under CC-BY-NC-4.0 license.