Image-to-Image
English
art
aoxo commited on
Commit
9a48f44
·
verified ·
1 Parent(s): b9a5707

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -97
README.md CHANGED
@@ -78,7 +78,7 @@ Use the code below to get started with the model.
78
 
79
  ```python
80
  # Instantiate the model
81
- model = ViTImage2Image(img_size=512, patch_size=16, emb_dim=768, num_heads=16, num_layers=8, hidden_dim=3072)
82
 
83
  # Move model to GPU if available
84
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@@ -118,19 +118,50 @@ Images and their corresponding style semantic maps were resized to fit the input
118
 
119
  #### Training Hyperparameters
120
 
121
- - Precision:fp32
122
- - Embedded dimensions: 768
123
- - Hidden dimensions: 3072
124
- - Attention Type: Linear Attention
125
- - Number of attention heads: 16
126
- - Number of attention layers: 8
127
- - Number of transformer encoder layers (feed-forward): 8
128
- - Number of transformer decoder layers (feed-forward): 8
129
- - Activation function(s): ReLU, GeLU
130
- - Patch Size: 8
131
- - Swin Window Size: 7
132
- - Swin Shift Size: 2
133
- - Style Transfer Module: AdaIN
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
  #### Speeds, Sizes, Times
136
 
@@ -139,6 +170,7 @@ Images and their corresponding style semantic maps were resized to fit the input
139
  - v1_2: 200M params
140
  - v1_3: 93M params
141
  - v2_1: 2.9M params
 
142
 
143
  **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
144
 
@@ -155,6 +187,10 @@ Images and their corresponding style semantic maps were resized to fit the input
155
  - v1_2: 764 MB
156
  - v1_3: 355 MB
157
  - v2_2: 11 MB
 
 
 
 
158
 
159
  ## Evaluation Data, Metrics & Results
160
 
@@ -166,8 +202,11 @@ Evaluation was performed on real-time footage captured from Grand Theft Auto V,
166
 
167
  ### Metrics
168
 
169
- - PSNR (Peak Signal-to-Noise Ratio)
170
- - Combined loss (L1 loss + Total Variation loss)
 
 
 
171
 
172
  ### Results
173
 
@@ -209,101 +248,94 @@ The objective of RealFormer is to attain the maximum level of detail to the real
209
  **Architecture:** The latest model, v2_1, introduces Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The three other predecessors attained a similar level of accuracy without the LbMhA layers. The general architecture is as follows:
210
 
211
  ```python
212
- DataParallel(
213
- (module): ViTImage2Image(
214
- (patch_embed): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
215
- (encoder_layers): ModuleList(
216
- (0-7): 8 x TransformerEncoderBlock(
217
- (attn): LocationBasedMultiheadAttention(
218
- (q_proj): Linear(in_features=768, out_features=768, bias=True)
219
- (k_proj): Linear(in_features=768, out_features=768, bias=True)
220
- (v_proj): Linear(in_features=768, out_features=768, bias=True)
221
- (out_proj): Linear(in_features=768, out_features=768, bias=True)
222
- (dropout): Dropout(p=0.1, inplace=False)
223
- )
224
- (ff): Sequential(
225
- (0): Linear(in_features=768, out_features=3072, bias=True)
226
- (1): ReLU()
227
- (2): Linear(in_features=3072, out_features=768, bias=True)
228
- )
229
- (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
230
- (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
231
- (adain): AdaIN(
232
- (norm): InstanceNorm1d(768, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
233
- (fc): Linear(in_features=768, out_features=1536, bias=True)
234
  )
235
  (dropout): Dropout(p=0.1, inplace=False)
236
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
  )
238
- (decoder_layers): ModuleList(
239
- (0-7): 8 x TransformerDecoderBlock(
240
- (attn1): LocationBasedMultiheadAttention(
241
- (q_proj): Linear(in_features=768, out_features=768, bias=True)
242
- (k_proj): Linear(in_features=768, out_features=768, bias=True)
243
- (v_proj): Linear(in_features=768, out_features=768, bias=True)
244
- (out_proj): Linear(in_features=768, out_features=768, bias=True)
245
- (dropout): Dropout(p=0.1, inplace=False)
246
- )
247
- (attn2): LocationBasedMultiheadAttention(
248
- (q_proj): Linear(in_features=768, out_features=768, bias=True)
249
- (k_proj): Linear(in_features=768, out_features=768, bias=True)
250
- (v_proj): Linear(in_features=768, out_features=768, bias=True)
251
- (out_proj): Linear(in_features=768, out_features=768, bias=True)
252
- (dropout): Dropout(p=0.1, inplace=False)
253
- )
254
- (ff): Sequential(
255
- (0): Linear(in_features=768, out_features=3072, bias=True)
256
- (1): ReLU()
257
- (2): Linear(in_features=3072, out_features=768, bias=True)
258
- )
259
- (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
260
- (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
261
- (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
262
- (norm4): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
263
- (adain1): AdaIN(
264
- (norm): InstanceNorm1d(768, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
265
- (fc): Linear(in_features=768, out_features=1536, bias=True)
266
- )
267
- (adain2): AdaIN(
268
- (norm): InstanceNorm1d(768, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
269
- (fc): Linear(in_features=768, out_features=1536, bias=True)
270
  )
271
  (dropout): Dropout(p=0.1, inplace=False)
272
  )
273
- )
274
- (swin_layers): ModuleList(
275
- (0-7): 8 x SwinTransformerBlock(
276
- (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
277
  (attn): MultiheadAttention(
278
  (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
279
  )
280
- (mlp): Sequential(
281
- (0): Linear(in_features=768, out_features=3072, bias=True)
282
- (1): GELU(approximate='none')
283
- (2): Linear(in_features=3072, out_features=768, bias=True)
284
- )
285
- (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
 
 
 
 
 
 
 
 
 
 
 
 
286
  )
287
  )
288
- (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
289
- (mlp_head): Sequential(
290
- (0): Linear(in_features=768, out_features=3072, bias=True)
291
- (1): GELU(approximate='none')
292
- (2): Linear(in_features=3072, out_features=768, bias=True)
293
- )
294
- (refinement): RefinementBlock(
295
- (conv): Conv2d(768, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
296
- (bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
297
- (relu): ReLU(inplace=True)
298
- )
299
- (style_encoder): Sequential(
300
- (0): Conv2d(3, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
301
- (1): ReLU()
302
- (2): AdaptiveAvgPool2d(output_size=1)
303
- (3): Flatten(start_dim=1, end_dim=-1)
304
- (4): Linear(in_features=768, out_features=768, bias=True)
305
  )
306
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
307
  )
308
  ```
309
 
 
78
 
79
  ```python
80
  # Instantiate the model
81
+ model = RealFormerv3(img_size=256, patch_size=8, emb_dim=768, num_heads=42, num_layers=16, hidden_dim=3072)
82
 
83
  # Move model to GPU if available
84
  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
118
 
119
  #### Training Hyperparameters
120
 
121
+ **v1**
122
+ - **Precision**:fp32
123
+ - **Embedded dimensions**: 768
124
+ - **Hidden dimensions**: 3072
125
+ - **Attention Type**: Linear Attention
126
+ - **Number of attention heads**: 16
127
+ - **Number of attention layers**: 8
128
+ - **Number of transformer encoder layers (feed-forward)**: 8
129
+ - **Number of transformer decoder layers (feed-forward)**: 8
130
+ - **Activation function(s)**: ReLU, GeLU
131
+ - **Patch Size**: 8
132
+ - **Swin Window Size**: 7
133
+ - **Swin Shift Size**: 2
134
+ - **Style Transfer Module**: AdaIN (Adaptive Instance Normalization)
135
+
136
+ **v2**
137
+ - **Precision**: fp32
138
+ - **Embedded dimensions**: 768
139
+ - **Hidden dimensions**: 3072
140
+ - **Attention Type**: Location-Based Multi-Head Attention (Linear Attention)
141
+ - **Number of attention heads**: 16
142
+ - **Number of attention layers**: 8
143
+ - **Number of transformer encoder layers (feed-forward)**: 8
144
+ - **Number of transformer decoder layers (feed-forward)**: 8
145
+ - **Activation function(s)**: ReLU, GELU
146
+ - **Patch Size**: 16
147
+ - **Swin Window Size**: 7
148
+ - **Swin Shift Size**: 2
149
+ - **Style Transfer Module**: AdaIN
150
+
151
+ **v3**
152
+ **Precision:** FP32, FP16, BF16, INT8
153
+ **Embedding Dimensions:** 768
154
+ **Hidden Dimensions:** 3072
155
+ **Attention Type:** Location-Based Multi-Head Attention (Linear Attention)
156
+ **Number of Attention Heads:** 42
157
+ **Number of Attention Layers:** 16
158
+ **Number of Transformer Encoder Layers (Feed-Forward):** 16
159
+ **Number of Transformer Decoder Layers (Feed-Forward):** 16
160
+ **Activation Functions:** ReLU, GeLU
161
+ **Patch Size:** 8
162
+ **Swin Window Size:** 7
163
+ **Swin Shift Size:** 2
164
+ **Style Transfer Module:** Style Adaptive Layer Normalization (SALN)
165
 
166
  #### Speeds, Sizes, Times
167
 
 
170
  - v1_2: 200M params
171
  - v1_3: 93M params
172
  - v2_1: 2.9M params
173
+ - v3: 252.6M params
174
 
175
  **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
176
 
 
187
  - v1_2: 764 MB
188
  - v1_3: 355 MB
189
  - v2_2: 11 MB
190
+ - v3: 1.01 GB
191
+ - v3_fp16: 505M
192
+ - v3_bf16: 505M
193
+ - v3_int8: 344M
194
 
195
  ## Evaluation Data, Metrics & Results
196
 
 
202
 
203
  ### Metrics
204
 
205
+ - Peak Signal-to-Noise Ratio (PSNR)
206
+ - Cosine Similarity Score (CSS)
207
+ - L1 Loss
208
+ - Contrastive Loss (CL)
209
+ - Combined loss (L1 loss + PSNR + CSS + CL)
210
 
211
  ### Results
212
 
 
248
  **Architecture:** The latest model, v2_1, introduces Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The three other predecessors attained a similar level of accuracy without the LbMhA layers. The general architecture is as follows:
249
 
250
  ```python
251
+ RealFormerv3(
252
+ (patch_embed): DynamicPatchEmbedding(
253
+ (proj): Conv2d(2048, 768, kernel_size=(1, 1), stride=(1, 1))
254
+ )
255
+ (encoder_layers): ModuleList(
256
+ (0-7): 8 x TransformerEncoderBlock(
257
+ (attn): CrossAttentionLayer(
258
+ (attn): MultiheadAttention(
259
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
260
  )
261
  (dropout): Dropout(p=0.1, inplace=False)
262
  )
263
+ (ff): Sequential(
264
+ (0): Linear(in_features=768, out_features=3072, bias=True)
265
+ (1): ReLU()
266
+ (2): Linear(in_features=3072, out_features=768, bias=True)
267
+ )
268
+ (norm1): StyleAdaptiveLayerNorm(
269
+ (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
270
+ (fc): Linear(in_features=768, out_features=1536, bias=True)
271
+ )
272
+ (norm2): StyleAdaptiveLayerNorm(
273
+ (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
274
+ (fc): Linear(in_features=768, out_features=1536, bias=True)
275
+ )
276
+ (dropout): Dropout(p=0.1, inplace=False)
277
  )
278
+ )
279
+ (decoder_layers): ModuleList(
280
+ (0-7): 8 x TransformerDecoderBlock(
281
+ (attn1): CrossAttentionLayer(
282
+ (attn): MultiheadAttention(
283
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
284
  )
285
  (dropout): Dropout(p=0.1, inplace=False)
286
  )
287
+ (attn2): CrossAttentionLayer(
 
 
 
288
  (attn): MultiheadAttention(
289
  (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
290
  )
291
+ (dropout): Dropout(p=0.1, inplace=False)
292
+ )
293
+ (ff): Sequential(
294
+ (0): Linear(in_features=768, out_features=3072, bias=True)
295
+ (1): ReLU()
296
+ (2): Linear(in_features=3072, out_features=768, bias=True)
297
+ )
298
+ (norm1): StyleAdaptiveLayerNorm(
299
+ (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
300
+ (fc): Linear(in_features=768, out_features=1536, bias=True)
301
+ )
302
+ (norm2): StyleAdaptiveLayerNorm(
303
+ (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
304
+ (fc): Linear(in_features=768, out_features=1536, bias=True)
305
+ )
306
+ (norm3): StyleAdaptiveLayerNorm(
307
+ (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
308
+ (fc): Linear(in_features=768, out_features=1536, bias=True)
309
  )
310
  )
311
+ )
312
+ (swin_layers): ModuleList(
313
+ (0-7): 8 x SwinTransformerBlock(
314
+ (attn): MultiheadAttention(
315
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
316
+ )
317
+ (mlp): Sequential(
318
+ (0): Linear(in_features=768, out_features=3072, bias=True)
319
+ (1): GELU(approximate='none')
320
+ (2): Linear(in_features=3072, out_features=768, bias=True)
321
+ )
322
+ (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
323
+ (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
 
 
 
 
324
  )
325
  )
326
+ (refinement): RefinementBlock(
327
+ (conv): Conv2d(768, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
328
+ (bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
329
+ (relu): ReLU(inplace=True)
330
+ )
331
+ (final_layer): Conv2d(3, 2048, kernel_size=(1, 1), stride=(1, 1))
332
+ (style_encoder): Sequential(
333
+ (0): Conv2d(2048, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
334
+ (1): ReLU()
335
+ (2): AdaptiveAvgPool2d(output_size=1)
336
+ (3): Flatten(start_dim=1, end_dim=-1)
337
+ (4): Linear(in_features=768, out_features=768, bias=True)
338
+ )
339
  )
340
  ```
341