aoxo
/

RealFormer

Image-to-Image

English

art

Model card Files Files and versions

xet

Community

aoxo commited on Jan 15

Commit

56e28d9

verified ·

1 Parent(s): 6645fd2

Update README.md

Browse files

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -116,15 +116,17 @@ visualize_tensor(output, "Output Image")
 #### Preprocessing
 **Preprocessing of Large-Scale Image Data for Photorealism Enhancement**
 This section details our methodology for preprocessing a large-scale dataset of approximately **117 million game-rendered frames** from **9 AAA video games** and **1.24 billion real-world images** from Mapillary Vistas and Cityscapes, all in 4K resolution. The goal is to pair game frames with real images that exhibit the highest cosine similarity based on structural and visual features, ensuring alignment of fine details like object positions, level of detail and motion blur.
 Images and their corresponding style semantic maps were resized to **512 x 512** pixels and corrected to a **24-bit** depth (3 channels) if they exceeded this depth. We employ a novel **feature-mapped channel-split PSNR matching** approach using **EfficientNet** feature extraction, channel splitting, and dual metric computation of PSNR and cosine similarity. **Locality-Sensitive Hashing** (LSH) aids in efficiently identifying the **top-10 nearest neighbors** for each frame. This resulted in a massive dataset of **1.17** billion frame-image pairs and **12.4 billion** image-frame pairs. The final selection process involves assessing similarity consistency across channels to ensure accurate pairings. This scalable preprocessing pipeline enables efficient pairing while preserving critical visual details, laying the foundation for subsequent **contrastive learning** to enhance **photorealism in game-rendered frames**.
-![preprocessing](preprocessing.png)
 #### Training
 RealFormer was trained using a large-scale, temporally aware image–frame pairing pipeline designed to maximize photorealistic alignment between synthetic and real-world imagery. Training leverages both **intra-frame reconstruction** and **inter-frame consistency** to improve spatial realism and temporal coherence.
 During training, consecutive frames ((t, t+1)) are sampled from video sequences and passed through a dynamic patch embedding module. In parallel, a large buffer of real-world images is maintained, from which the **top-K nearest neighbors** are selected using **cosine similarity scores** computed over high-level visual features. These similarity scores are then used to weight a **multi-scale style encoder**, producing temporally indexed style feature vectors that guide normalization through **Style Adaptive Layer Normalization (SALN)**.
@@ -133,8 +135,6 @@ Encoded patches are processed by a stack of **Swin Transformer encoder blocks**,
 Training optimizes a **composite objective** consisting of pixel-level reconstruction loss (L1), perceptual loss, style loss, contrastive loss, and total variation regularization. This multi-objective setup ensures fidelity, realism, and stability across diverse visual domains, while scaling efficiently to billions of image–frame associations.
-![training](training.png)
 #### Training Hyperparameters
 **v1**

 #### Preprocessing
+![preprocessing](preprocessing.png)
 **Preprocessing of Large-Scale Image Data for Photorealism Enhancement**
 This section details our methodology for preprocessing a large-scale dataset of approximately **117 million game-rendered frames** from **9 AAA video games** and **1.24 billion real-world images** from Mapillary Vistas and Cityscapes, all in 4K resolution. The goal is to pair game frames with real images that exhibit the highest cosine similarity based on structural and visual features, ensuring alignment of fine details like object positions, level of detail and motion blur.
 Images and their corresponding style semantic maps were resized to **512 x 512** pixels and corrected to a **24-bit** depth (3 channels) if they exceeded this depth. We employ a novel **feature-mapped channel-split PSNR matching** approach using **EfficientNet** feature extraction, channel splitting, and dual metric computation of PSNR and cosine similarity. **Locality-Sensitive Hashing** (LSH) aids in efficiently identifying the **top-10 nearest neighbors** for each frame. This resulted in a massive dataset of **1.17** billion frame-image pairs and **12.4 billion** image-frame pairs. The final selection process involves assessing similarity consistency across channels to ensure accurate pairings. This scalable preprocessing pipeline enables efficient pairing while preserving critical visual details, laying the foundation for subsequent **contrastive learning** to enhance **photorealism in game-rendered frames**.
 #### Training
+![training](training.png)
 RealFormer was trained using a large-scale, temporally aware image–frame pairing pipeline designed to maximize photorealistic alignment between synthetic and real-world imagery. Training leverages both **intra-frame reconstruction** and **inter-frame consistency** to improve spatial realism and temporal coherence.
 During training, consecutive frames ((t, t+1)) are sampled from video sequences and passed through a dynamic patch embedding module. In parallel, a large buffer of real-world images is maintained, from which the **top-K nearest neighbors** are selected using **cosine similarity scores** computed over high-level visual features. These similarity scores are then used to weight a **multi-scale style encoder**, producing temporally indexed style feature vectors that guide normalization through **Style Adaptive Layer Normalization (SALN)**.
 Training optimizes a **composite objective** consisting of pixel-level reconstruction loss (L1), perceptual loss, style loss, contrastive loss, and total variation regularization. This multi-objective setup ensures fidelity, realism, and stability across diverse visual domains, while scaling efficiently to billions of image–frame associations.
 #### Training Hyperparameters
 **v1**