Image-to-Image
English
art
aoxo commited on
Commit
56e28d9
·
verified ·
1 Parent(s): 6645fd2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -116,15 +116,17 @@ visualize_tensor(output, "Output Image")
116
 
117
  #### Preprocessing
118
 
 
 
119
  **Preprocessing of Large-Scale Image Data for Photorealism Enhancement**
120
  This section details our methodology for preprocessing a large-scale dataset of approximately **117 million game-rendered frames** from **9 AAA video games** and **1.24 billion real-world images** from Mapillary Vistas and Cityscapes, all in 4K resolution. The goal is to pair game frames with real images that exhibit the highest cosine similarity based on structural and visual features, ensuring alignment of fine details like object positions, level of detail and motion blur.
121
 
122
  Images and their corresponding style semantic maps were resized to **512 x 512** pixels and corrected to a **24-bit** depth (3 channels) if they exceeded this depth. We employ a novel **feature-mapped channel-split PSNR matching** approach using **EfficientNet** feature extraction, channel splitting, and dual metric computation of PSNR and cosine similarity. **Locality-Sensitive Hashing** (LSH) aids in efficiently identifying the **top-10 nearest neighbors** for each frame. This resulted in a massive dataset of **1.17** billion frame-image pairs and **12.4 billion** image-frame pairs. The final selection process involves assessing similarity consistency across channels to ensure accurate pairings. This scalable preprocessing pipeline enables efficient pairing while preserving critical visual details, laying the foundation for subsequent **contrastive learning** to enhance **photorealism in game-rendered frames**.
123
 
124
- ![preprocessing](preprocessing.png)
125
-
126
  #### Training
127
 
 
 
128
  RealFormer was trained using a large-scale, temporally aware image–frame pairing pipeline designed to maximize photorealistic alignment between synthetic and real-world imagery. Training leverages both **intra-frame reconstruction** and **inter-frame consistency** to improve spatial realism and temporal coherence.
129
 
130
  During training, consecutive frames ((t, t+1)) are sampled from video sequences and passed through a dynamic patch embedding module. In parallel, a large buffer of real-world images is maintained, from which the **top-K nearest neighbors** are selected using **cosine similarity scores** computed over high-level visual features. These similarity scores are then used to weight a **multi-scale style encoder**, producing temporally indexed style feature vectors that guide normalization through **Style Adaptive Layer Normalization (SALN)**.
@@ -133,8 +135,6 @@ Encoded patches are processed by a stack of **Swin Transformer encoder blocks**,
133
 
134
  Training optimizes a **composite objective** consisting of pixel-level reconstruction loss (L1), perceptual loss, style loss, contrastive loss, and total variation regularization. This multi-objective setup ensures fidelity, realism, and stability across diverse visual domains, while scaling efficiently to billions of image–frame associations.
135
 
136
- ![training](training.png)
137
-
138
  #### Training Hyperparameters
139
 
140
  **v1**
 
116
 
117
  #### Preprocessing
118
 
119
+ ![preprocessing](preprocessing.png)
120
+
121
  **Preprocessing of Large-Scale Image Data for Photorealism Enhancement**
122
  This section details our methodology for preprocessing a large-scale dataset of approximately **117 million game-rendered frames** from **9 AAA video games** and **1.24 billion real-world images** from Mapillary Vistas and Cityscapes, all in 4K resolution. The goal is to pair game frames with real images that exhibit the highest cosine similarity based on structural and visual features, ensuring alignment of fine details like object positions, level of detail and motion blur.
123
 
124
  Images and their corresponding style semantic maps were resized to **512 x 512** pixels and corrected to a **24-bit** depth (3 channels) if they exceeded this depth. We employ a novel **feature-mapped channel-split PSNR matching** approach using **EfficientNet** feature extraction, channel splitting, and dual metric computation of PSNR and cosine similarity. **Locality-Sensitive Hashing** (LSH) aids in efficiently identifying the **top-10 nearest neighbors** for each frame. This resulted in a massive dataset of **1.17** billion frame-image pairs and **12.4 billion** image-frame pairs. The final selection process involves assessing similarity consistency across channels to ensure accurate pairings. This scalable preprocessing pipeline enables efficient pairing while preserving critical visual details, laying the foundation for subsequent **contrastive learning** to enhance **photorealism in game-rendered frames**.
125
 
 
 
126
  #### Training
127
 
128
+ ![training](training.png)
129
+
130
  RealFormer was trained using a large-scale, temporally aware image–frame pairing pipeline designed to maximize photorealistic alignment between synthetic and real-world imagery. Training leverages both **intra-frame reconstruction** and **inter-frame consistency** to improve spatial realism and temporal coherence.
131
 
132
  During training, consecutive frames ((t, t+1)) are sampled from video sequences and passed through a dynamic patch embedding module. In parallel, a large buffer of real-world images is maintained, from which the **top-K nearest neighbors** are selected using **cosine similarity scores** computed over high-level visual features. These similarity scores are then used to weight a **multi-scale style encoder**, producing temporally indexed style feature vectors that guide normalization through **Style Adaptive Layer Normalization (SALN)**.
 
135
 
136
  Training optimizes a **composite objective** consisting of pixel-level reconstruction loss (L1), perceptual loss, style loss, contrastive loss, and total variation regularization. This multi-objective setup ensures fidelity, realism, and stability across diverse visual domains, while scaling efficiently to billions of image–frame associations.
137
 
 
 
138
  #### Training Hyperparameters
139
 
140
  **v1**