aoxo
/

RealFormer

Image-to-Image

English

art

Model card Files Files and versions

xet

Community

aoxo commited on Oct 12, 2024

Commit

720961d

verified ·

1 Parent(s): 9549f55

Update README.md

Browse files

Files changed (1) hide show

README.md +22 -2

README.md CHANGED Viewed

@@ -78,7 +78,7 @@ Use the code below to get started with the model.
 ```python
 # Instantiate the model
-model = RealFormerv3(img_size=256, patch_size=8, emb_dim=768, num_heads=42, num_layers=16, hidden_dim=3072)
 # Move model to GPU if available
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@@ -168,6 +168,21 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
 - Swin Shift Size: 2
 - Style Transfer Module: Style Adaptive Layer Normalization (SALN)
 #### Speeds, Sizes, Times
 **Model size:** There are currently five versions of the model:
@@ -176,6 +191,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
 - v1_3: 93M params
 - v2_1: 2.9M params
 - v3: 252.6M params
 **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
@@ -196,6 +212,10 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
 - v3_fp16: 505M
 - v3_bf16: 505M
 - v3_int8: 344M
 ## Evaluation Data, Metrics & Results
@@ -203,7 +223,7 @@ This section covers information on how the model was evaluated at each stage.
 ### Evaluation Data
-Evaluation was performed on real-time footage captured from Grand Theft Auto V, Cyberpunk 2077 and WatchDogs 2.
 ### Metrics

 ```python
 # Instantiate the model
+model = RealFormerAGA(img_size=256, patch_size=8, emb_dim=768, num_heads=32, num_layers=16, hidden_dim=3072)
 # Move model to GPU if available
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 - Swin Shift Size: 2
 - Style Transfer Module: Style Adaptive Layer Normalization (SALN)
+**v4**
+- Precision: FP32, FP16, BF16, INT8
+- Embedding Dimensions: 768
+- Hidden Dimensions: 3072
+- Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (pretrained attention-conditioned)
+- Number of Attention Heads: 32
+- Number of Attention Layers: 16
+- Number of Transformer Encoder Layers (Feed-Forward): 16
+- Number of Transformer Decoder Layers (Feed-Forward): 16
+- Activation Functions: ReLU, GeLU
+- Patch Size: 8
+- Swin Window Size: 7
+- Swin Shift Size: 2
+- Style Transfer Module: Style Adaptive Layer Normalization (SALN)
 #### Speeds, Sizes, Times
 **Model size:** There are currently five versions of the model:
 - v1_3: 93M params
 - v2_1: 2.9M params
 - v3: 252.6M params
+- v4: 454.2M params
 **Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
 - v3_fp16: 505M
 - v3_bf16: 505M
 - v3_int8: 344M
+- v4: 1.69 GB
+- v4_fp16: 866M
+- v4_bf16: 866M
+- v4_int8: 578M
 ## Evaluation Data, Metrics & Results
 ### Evaluation Data
+Evaluation was performed on real-time footage captured from Grand Theft Auto IV, Grand Theft Auto V, Cyberpunk 2077, WatchDogs, Marvel's Spiderman, Far Cry 6, Red Read Redemption 2 and Control.
 ### Metrics