aoxo
/

RealFormer

Image-to-Image

English

art

Model card Files Files and versions

xet

Community

aoxo commited on Oct 6, 2024

Commit

9a48f44

verified ·

1 Parent(s): b9a5707

Update README.md

Browse files

Files changed (1) hide show

README.md +129 -97

README.md CHANGED Viewed

@@ -78,7 +78,7 @@ Use the code below to get started with the model.
 ```python
 # Instantiate the model
-model = ViTImage2Image(img_size=512, patch_size=16, emb_dim=768, num_heads=16, num_layers=8, hidden_dim=3072)
 # Move model to GPU if available
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@@ -118,19 +118,50 @@ Images and their corresponding style semantic maps were resized to fit the input
 #### Training Hyperparameters
-- Precision:fp32
-- Embedded dimensions: 768
-- Hidden dimensions: 3072
-- Attention Type: Linear Attention
-- Number of attention heads: 16
-- Number of attention layers: 8
-- Number of transformer encoder layers (feed-forward): 8
-- Number of transformer decoder layers (feed-forward): 8
-- Activation function(s): ReLU, GeLU
-- Patch Size: 8
-- Swin Window Size: 7
-- Swin Shift Size: 2
-- Style Transfer Module: AdaIN
 #### Speeds, Sizes, Times
@@ -139,6 +170,7 @@ Images and their corresponding style semantic maps were resized to fit the input
 - v1_2: 200M params
 - v1_3: 93M params
 - v2_1: 2.9M params
 **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
@@ -155,6 +187,10 @@ Images and their corresponding style semantic maps were resized to fit the input
 - v1_2: 764 MB
 - v1_3: 355 MB
 - v2_2: 11 MB
 ## Evaluation Data, Metrics & Results
@@ -166,8 +202,11 @@ Evaluation was performed on real-time footage captured from Grand Theft Auto V,
 ### Metrics
-- PSNR (Peak Signal-to-Noise Ratio)
-- Combined loss (L1 loss + Total Variation loss)
 ### Results
@@ -209,101 +248,94 @@ The objective of RealFormer is to attain the maximum level of detail to the real
 **Architecture:** The latest model, v2_1, introduces Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The three other predecessors attained a similar level of accuracy without the LbMhA layers. The general architecture is as follows:
 ```python
-DataParallel(
-  (module): ViTImage2Image(
-    (patch_embed): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
-    (encoder_layers): ModuleList(
-      (0-7): 8 x TransformerEncoderBlock(
-        (attn): LocationBasedMultiheadAttention(
-          (q_proj): Linear(in_features=768, out_features=768, bias=True)
-          (k_proj): Linear(in_features=768, out_features=768, bias=True)
-          (v_proj): Linear(in_features=768, out_features=768, bias=True)
-          (out_proj): Linear(in_features=768, out_features=768, bias=True)
-          (dropout): Dropout(p=0.1, inplace=False)
-        )
-        (ff): Sequential(
-          (0): Linear(in_features=768, out_features=3072, bias=True)
-          (1): ReLU()
-          (2): Linear(in_features=3072, out_features=768, bias=True)
-        )
-        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
-        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
-        (adain): AdaIN(
-          (norm): InstanceNorm1d(768, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
-          (fc): Linear(in_features=768, out_features=1536, bias=True)
         )
         (dropout): Dropout(p=0.1, inplace=False)
       )
     )
-    (decoder_layers): ModuleList(
-      (0-7): 8 x TransformerDecoderBlock(
-        (attn1): LocationBasedMultiheadAttention(
-          (q_proj): Linear(in_features=768, out_features=768, bias=True)
-          (k_proj): Linear(in_features=768, out_features=768, bias=True)
-          (v_proj): Linear(in_features=768, out_features=768, bias=True)
-          (out_proj): Linear(in_features=768, out_features=768, bias=True)
-          (dropout): Dropout(p=0.1, inplace=False)
-        )
-        (attn2): LocationBasedMultiheadAttention(
-          (q_proj): Linear(in_features=768, out_features=768, bias=True)
-          (k_proj): Linear(in_features=768, out_features=768, bias=True)
-          (v_proj): Linear(in_features=768, out_features=768, bias=True)
-          (out_proj): Linear(in_features=768, out_features=768, bias=True)
-          (dropout): Dropout(p=0.1, inplace=False)
-        )
-        (ff): Sequential(
-          (0): Linear(in_features=768, out_features=3072, bias=True)
-          (1): ReLU()
-          (2): Linear(in_features=3072, out_features=768, bias=True)
-        )
-        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
-        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
-        (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
-        (norm4): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
-        (adain1): AdaIN(
-          (norm): InstanceNorm1d(768, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
-          (fc): Linear(in_features=768, out_features=1536, bias=True)
-        )
-        (adain2): AdaIN(
-          (norm): InstanceNorm1d(768, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
-          (fc): Linear(in_features=768, out_features=1536, bias=True)
         )
         (dropout): Dropout(p=0.1, inplace=False)
       )
-    )
-    (swin_layers): ModuleList(
-      (0-7): 8 x SwinTransformerBlock(
-        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
         (attn): MultiheadAttention(
           (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
         )
-        (mlp): Sequential(
-          (0): Linear(in_features=768, out_features=3072, bias=True)
-          (1): GELU(approximate='none')
-          (2): Linear(in_features=3072, out_features=768, bias=True)
-        )
-        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
       )
     )
-    (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
-    (mlp_head): Sequential(
-      (0): Linear(in_features=768, out_features=3072, bias=True)
-      (1): GELU(approximate='none')
-      (2): Linear(in_features=3072, out_features=768, bias=True)
-    )
-    (refinement): RefinementBlock(
-      (conv): Conv2d(768, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
-      (bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
-      (relu): ReLU(inplace=True)
-    )
-    (style_encoder): Sequential(
-      (0): Conv2d(3, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
-      (1): ReLU()
-      (2): AdaptiveAvgPool2d(output_size=1)
-      (3): Flatten(start_dim=1, end_dim=-1)
-      (4): Linear(in_features=768, out_features=768, bias=True)
     )
   )
 )
 ```

 ```python
 # Instantiate the model
+model = RealFormerv3(img_size=256, patch_size=8, emb_dim=768, num_heads=42, num_layers=16, hidden_dim=3072)
 # Move model to GPU if available
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 #### Training Hyperparameters
+**v1**
+- **Precision**:fp32
+- **Embedded dimensions**: 768
+- **Hidden dimensions**: 3072
+- **Attention Type**: Linear Attention
+- **Number of attention heads**: 16
+- **Number of attention layers**: 8
+- **Number of transformer encoder layers (feed-forward)**: 8
+- **Number of transformer decoder layers (feed-forward)**: 8
+- **Activation function(s)**: ReLU, GeLU
+- **Patch Size**: 8
+- **Swin Window Size**: 7
+- **Swin Shift Size**: 2
+- **Style Transfer Module**: AdaIN (Adaptive Instance Normalization)
+**v2**
+- **Precision**: fp32
+- **Embedded dimensions**: 768
+- **Hidden dimensions**: 3072
+- **Attention Type**: Location-Based Multi-Head Attention (Linear Attention)
+- **Number of attention heads**: 16
+- **Number of attention layers**: 8
+- **Number of transformer encoder layers (feed-forward)**: 8
+- **Number of transformer decoder layers (feed-forward)**: 8
+- **Activation function(s)**: ReLU, GELU
+- **Patch Size**: 16
+- **Swin Window Size**: 7
+- **Swin Shift Size**: 2
+- **Style Transfer Module**: AdaIN
+**v3**
+**Precision:** FP32, FP16, BF16, INT8
+**Embedding Dimensions:** 768
+**Hidden Dimensions:** 3072
+**Attention Type:** Location-Based Multi-Head Attention (Linear Attention)
+**Number of Attention Heads:** 42
+**Number of Attention Layers:** 16
+**Number of Transformer Encoder Layers (Feed-Forward):** 16
+**Number of Transformer Decoder Layers (Feed-Forward):** 16
+**Activation Functions:** ReLU, GeLU
+**Patch Size:** 8
+**Swin Window Size:** 7
+**Swin Shift Size:** 2
+**Style Transfer Module:** Style Adaptive Layer Normalization (SALN)
 #### Speeds, Sizes, Times
 - v1_2: 200M params
 - v1_3: 93M params
 - v2_1: 2.9M params
+- v3: 252.6M params
 **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
 - v1_2: 764 MB
 - v1_3: 355 MB
 - v2_2: 11 MB
+- v3: 1.01 GB
+- v3_fp16: 505M
+- v3_bf16: 505M
+- v3_int8: 344M
 ## Evaluation Data, Metrics & Results
 ### Metrics
+- Peak Signal-to-Noise Ratio (PSNR)
+- Cosine Similarity Score (CSS)
+- L1 Loss
+- Contrastive Loss (CL)
+- Combined loss (L1 loss + PSNR + CSS + CL)
 ### Results
 **Architecture:** The latest model, v2_1, introduces Location-based Multi-head Attention (LbMhA) to improve feature extraction at lower parameters. The three other predecessors attained a similar level of accuracy without the LbMhA layers. The general architecture is as follows:
 ```python
+RealFormerv3(
+  (patch_embed): DynamicPatchEmbedding(
+    (proj): Conv2d(2048, 768, kernel_size=(1, 1), stride=(1, 1))
+  )
+  (encoder_layers): ModuleList(
+    (0-7): 8 x TransformerEncoderBlock(
+      (attn): CrossAttentionLayer(
+        (attn): MultiheadAttention(
+          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
         )
         (dropout): Dropout(p=0.1, inplace=False)
       )
+      (ff): Sequential(
+        (0): Linear(in_features=768, out_features=3072, bias=True)
+        (1): ReLU()
+        (2): Linear(in_features=3072, out_features=768, bias=True)
+      )
+      (norm1): StyleAdaptiveLayerNorm(
+        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc): Linear(in_features=768, out_features=1536, bias=True)
+      )
+      (norm2): StyleAdaptiveLayerNorm(
+        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc): Linear(in_features=768, out_features=1536, bias=True)
+      )
+      (dropout): Dropout(p=0.1, inplace=False)
     )
+  )
+  (decoder_layers): ModuleList(
+    (0-7): 8 x TransformerDecoderBlock(
+      (attn1): CrossAttentionLayer(
+        (attn): MultiheadAttention(
+          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
         )
         (dropout): Dropout(p=0.1, inplace=False)
       )
+      (attn2): CrossAttentionLayer(
         (attn): MultiheadAttention(
           (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
         )
+        (dropout): Dropout(p=0.1, inplace=False)
+      )
+      (ff): Sequential(
+        (0): Linear(in_features=768, out_features=3072, bias=True)
+        (1): ReLU()
+        (2): Linear(in_features=3072, out_features=768, bias=True)
+      )
+      (norm1): StyleAdaptiveLayerNorm(
+        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc): Linear(in_features=768, out_features=1536, bias=True)
+      )
+      (norm2): StyleAdaptiveLayerNorm(
+        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc): Linear(in_features=768, out_features=1536, bias=True)
+      )
+      (norm3): StyleAdaptiveLayerNorm(
+        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+        (fc): Linear(in_features=768, out_features=1536, bias=True)
       )
     )
+  )
+  (swin_layers): ModuleList(
+    (0-7): 8 x SwinTransformerBlock(
+      (attn): MultiheadAttention(
+        (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
+      )
+      (mlp): Sequential(
+        (0): Linear(in_features=768, out_features=3072, bias=True)
+        (1): GELU(approximate='none')
+        (2): Linear(in_features=3072, out_features=768, bias=True)
+      )
+      (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
+      (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
     )
   )
+  (refinement): RefinementBlock(
+    (conv): Conv2d(768, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+    (bn): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
+    (relu): ReLU(inplace=True)
+  )
+  (final_layer): Conv2d(3, 2048, kernel_size=(1, 1), stride=(1, 1))
+  (style_encoder): Sequential(
+    (0): Conv2d(2048, 768, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
+    (1): ReLU()
+    (2): AdaptiveAvgPool2d(output_size=1)
+    (3): Flatten(start_dim=1, end_dim=-1)
+    (4): Linear(in_features=768, out_features=768, bias=True)
+  )
 )
 ```