Update README.md
Browse files
README.md
CHANGED
|
@@ -29,7 +29,7 @@ RealFormer is an innovative Vision Transformer (ViT) based architecture that com
|
|
| 29 |
|
| 30 |
### Model Sources [optional]
|
| 31 |
|
| 32 |
-
- **Dataset:** [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v)
|
| 33 |
- **Repository:** [Swin Transformer](https://github.com/microsoft/Swin-Transformer)
|
| 34 |
- **Paper:** [Ze Liu et al. (2021)](https://arxiv.org/abs/2103.14030)
|
| 35 |
|
|
@@ -94,22 +94,37 @@ visualize_tensor(output, "Output Image")
|
|
| 94 |
|
| 95 |
### Training Data
|
| 96 |
|
| 97 |
-
The model was trained on
|
| 98 |
-
|
| 99 |
-
[More Information Needed]
|
| 100 |
|
| 101 |
### Training Procedure
|
| 102 |
|
| 103 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
#### Preprocessing
|
| 106 |
-
|
| 107 |
-
[More Information Needed]
|
| 108 |
|
|
|
|
| 109 |
|
| 110 |
#### Training Hyperparameters
|
| 111 |
|
| 112 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
#### Speeds, Sizes, Times [optional]
|
| 115 |
|
|
|
|
| 29 |
|
| 30 |
### Model Sources [optional]
|
| 31 |
|
| 32 |
+
- **Dataset:** [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling), [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v)
|
| 33 |
- **Repository:** [Swin Transformer](https://github.com/microsoft/Swin-Transformer)
|
| 34 |
- **Paper:** [Ze Liu et al. (2021)](https://arxiv.org/abs/2103.14030)
|
| 35 |
|
|
|
|
| 94 |
|
| 95 |
### Training Data
|
| 96 |
|
| 97 |
+
The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
|
|
|
|
|
|
|
| 98 |
|
| 99 |
### Training Procedure
|
| 100 |
|
| 101 |
+
- Optimizer: Adam
|
| 102 |
+
- Learning rate: 0.001
|
| 103 |
+
- Batch size: 8
|
| 104 |
+
- Steps per epoch: 3,125
|
| 105 |
+
- Number of epochs: 100
|
| 106 |
+
- Total number of steps: 312,500
|
| 107 |
+
- Loss function: Combined L1 loss, Perpetual Loss, Style Transfer Loss, Total Variation loss
|
| 108 |
|
| 109 |
+
#### Preprocessing
|
|
|
|
|
|
|
| 110 |
|
| 111 |
+
Images and their corresponding style semantic maps were resized to fit the input-output window dimensions (512 x 512). Bit depth has been recorrected to 24bit (3 channel) for images with depth greater than 24bit.
|
| 112 |
|
| 113 |
#### Training Hyperparameters
|
| 114 |
|
| 115 |
+
- Precision:fp32
|
| 116 |
+
- Embedded dimensions: 768
|
| 117 |
+
- Hidden dimensions: 3072
|
| 118 |
+
- Attention Type: Linear Attention
|
| 119 |
+
- Number of attention heads: 16
|
| 120 |
+
- Number of attention layers: 8
|
| 121 |
+
- Number of transformer encoder layers (feed-forward): 8
|
| 122 |
+
- Number of transformer decoder layers (feed-forward): 8
|
| 123 |
+
- Activation function: ReLU
|
| 124 |
+
- Patch Size: 8
|
| 125 |
+
- Swin Window Size: 7
|
| 126 |
+
- Swin Shift Size: 2
|
| 127 |
+
-
|
| 128 |
|
| 129 |
#### Speeds, Sizes, Times [optional]
|
| 130 |
|