Update README.md
Browse files
README.md
CHANGED
|
@@ -78,7 +78,7 @@ Use the code below to get started with the model.
|
|
| 78 |
|
| 79 |
```python
|
| 80 |
# Instantiate the model
|
| 81 |
-
model =
|
| 82 |
|
| 83 |
# Move model to GPU if available
|
| 84 |
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
@@ -168,6 +168,21 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
| 168 |
- Swin Shift Size: 2
|
| 169 |
- Style Transfer Module: Style Adaptive Layer Normalization (SALN)
|
| 170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
#### Speeds, Sizes, Times
|
| 172 |
|
| 173 |
**Model size:** There are currently five versions of the model:
|
|
@@ -176,6 +191,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
| 176 |
- v1_3: 93M params
|
| 177 |
- v2_1: 2.9M params
|
| 178 |
- v3: 252.6M params
|
|
|
|
| 179 |
|
| 180 |
**Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
|
| 181 |
|
|
@@ -196,6 +212,10 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
| 196 |
- v3_fp16: 505M
|
| 197 |
- v3_bf16: 505M
|
| 198 |
- v3_int8: 344M
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
## Evaluation Data, Metrics & Results
|
| 201 |
|
|
@@ -203,7 +223,7 @@ This section covers information on how the model was evaluated at each stage.
|
|
| 203 |
|
| 204 |
### Evaluation Data
|
| 205 |
|
| 206 |
-
Evaluation was performed on real-time footage captured from Grand Theft Auto V, Cyberpunk 2077
|
| 207 |
|
| 208 |
### Metrics
|
| 209 |
|
|
|
|
| 78 |
|
| 79 |
```python
|
| 80 |
# Instantiate the model
|
| 81 |
+
model = RealFormerAGA(img_size=256, patch_size=8, emb_dim=768, num_heads=32, num_layers=16, hidden_dim=3072)
|
| 82 |
|
| 83 |
# Move model to GPU if available
|
| 84 |
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
|
|
|
| 168 |
- Swin Shift Size: 2
|
| 169 |
- Style Transfer Module: Style Adaptive Layer Normalization (SALN)
|
| 170 |
|
| 171 |
+
**v4**
|
| 172 |
+
- Precision: FP32, FP16, BF16, INT8
|
| 173 |
+
- Embedding Dimensions: 768
|
| 174 |
+
- Hidden Dimensions: 3072
|
| 175 |
+
- Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (pretrained attention-conditioned)
|
| 176 |
+
- Number of Attention Heads: 32
|
| 177 |
+
- Number of Attention Layers: 16
|
| 178 |
+
- Number of Transformer Encoder Layers (Feed-Forward): 16
|
| 179 |
+
- Number of Transformer Decoder Layers (Feed-Forward): 16
|
| 180 |
+
- Activation Functions: ReLU, GeLU
|
| 181 |
+
- Patch Size: 8
|
| 182 |
+
- Swin Window Size: 7
|
| 183 |
+
- Swin Shift Size: 2
|
| 184 |
+
- Style Transfer Module: Style Adaptive Layer Normalization (SALN)
|
| 185 |
+
|
| 186 |
#### Speeds, Sizes, Times
|
| 187 |
|
| 188 |
**Model size:** There are currently five versions of the model:
|
|
|
|
| 191 |
- v1_3: 93M params
|
| 192 |
- v2_1: 2.9M params
|
| 193 |
- v3: 252.6M params
|
| 194 |
+
- v4: 454.2M params
|
| 195 |
|
| 196 |
**Training hardware:** Each of the models were trained on 2 x T4 GPUs (multi-GPU training). For this reason, linear attention modules were implemented as ring (distributed) attention during training.
|
| 197 |
|
|
|
|
| 212 |
- v3_fp16: 505M
|
| 213 |
- v3_bf16: 505M
|
| 214 |
- v3_int8: 344M
|
| 215 |
+
- v4: 1.69 GB
|
| 216 |
+
- v4_fp16: 866M
|
| 217 |
+
- v4_bf16: 866M
|
| 218 |
+
- v4_int8: 578M
|
| 219 |
|
| 220 |
## Evaluation Data, Metrics & Results
|
| 221 |
|
|
|
|
| 223 |
|
| 224 |
### Evaluation Data
|
| 225 |
|
| 226 |
+
Evaluation was performed on real-time footage captured from Grand Theft Auto IV, Grand Theft Auto V, Cyberpunk 2077, WatchDogs, Marvel's Spiderman, Far Cry 6, Red Read Redemption 2 and Control.
|
| 227 |
|
| 228 |
### Metrics
|
| 229 |
|