aoxo
/

RealFormer

Image-to-Image

English

art

Model card Files Files and versions

xet

Community

aoxo commited on Sep 22, 2024

Commit

4bc5e63

verified ·

1 Parent(s): 7e9318e

Update README.md

Browse files

Files changed (1) hide show

README.md +24 -9

README.md CHANGED Viewed

@@ -29,7 +29,7 @@ RealFormer is an innovative Vision Transformer (ViT) based architecture that com
 ### Model Sources [optional]
-- **Dataset:** [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v), [Pre-Training](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling)
 - **Repository:** [Swin Transformer](https://github.com/microsoft/Swin-Transformer)
 - **Paper:** [Ze Liu et al. (2021)](https://arxiv.org/abs/2103.14030)
@@ -94,22 +94,37 @@ visualize_tensor(output, "Output Image")
 ### Training Data
-The model was trained on two
-[More Information Needed]
 ### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 #### Speeds, Sizes, Times [optional]

 ### Model Sources [optional]
+- **Dataset:** [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling), [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v)
 - **Repository:** [Swin Transformer](https://github.com/microsoft/Swin-Transformer)
 - **Paper:** [Ze Liu et al. (2021)](https://arxiv.org/abs/2103.14030)
 ### Training Data
+The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
 ### Training Procedure
+- Optimizer: Adam
+- Learning rate: 0.001
+- Batch size: 8
+- Steps per epoch: 3,125
+- Number of epochs: 100
+- Total number of steps: 312,500
+- Loss function: Combined L1 loss, Perpetual Loss, Style Transfer Loss, Total Variation loss
+#### Preprocessing
+Images and their corresponding style semantic maps were resized to fit the input-output window dimensions (512 x 512). Bit depth has been recorrected to 24bit (3 channel) for images with depth greater than 24bit.
 #### Training Hyperparameters
+- Precision:fp32
+- Embedded dimensions: 768
+- Hidden dimensions: 3072
+- Attention Type: Linear Attention
+- Number of attention heads: 16
+- Number of attention layers: 8
+- Number of transformer encoder layers (feed-forward): 8
+- Number of transformer decoder layers (feed-forward): 8
+- Activation function: ReLU
+- Patch Size: 8
+- Swin Window Size: 7
+- Swin Shift Size: 2
+-
 #### Speeds, Sizes, Times [optional]