--- license: mit datasets: - UserNae3/LLVIP pipeline_tag: image-to-image --- # Conditional GAN for Visible → Infrared (LLVIP) > **High-fidelity Visible-to-Infrared Translation using a Conditional GAN with Multi-Loss Optimization** --- ## Overview This project implements a **Conditional Generative Adversarial Network (cGAN)** trained to translate **visible-light (RGB)** images into **infrared (IR)** representations. It leverages **multi-loss optimization** — combining perceptual, pixel, adversarial, and edge-based objectives — to generate sharp, realistic IR outputs that preserve both **scene structure** and **thermal contrast**. A higher emphasis is given to **L1 loss**, ensuring that overall brightness and object boundaries remain consistent between visible and infrared domains. --- ## Dataset - **Dataset:** [LLVIP Dataset](https://huggingface.co/datasets/UserNae3/LLVIP) Paired **visible (RGB)** and **infrared (IR)** images under diverse lighting and background conditions. --- ## Model Architecture - **Type:** Conditional GAN (cGAN) - **Direction:** *Visible → Infrared* - **Framework:** TensorFlow - **Pipeline Tag:** `image-to-image` - **License:** MIT ### Generator - U-Net encoder–decoder with skip connections - Conditioned on RGB input - Output: single-channel IR image ### Discriminator - Evaluates realism for fine detail learning --- ## ⚙️ Training Configuration | Setting | Value | |----------|--------| | **Epochs** | 100 | | **Steps per Epoch** | 376 | | **Batch Size** | 4 | | **Optimizer** | Adam (β₁ = 0.5, β₂ = 0.999) | | **Learning Rate** | 2e-4 | | **Precision** | Mixed (32) | | **Hardware** | NVIDIA T4 (Kaggle GPU Runtime) | --- ## Multi-Loss Function Design | Loss Type | Description | Weight (λ) | Purpose | |------------|--------------|-------------|----------| | **L1 Loss** | Pixel-wise mean absolute error between generated and real IR | **100** | Ensures global brightness & shape consistency | | **Perceptual Loss (VGG)** | Feature loss from `conv5_block4` of pretrained VGG-19 | **10** | Captures high-level texture and semantic alignment | | **Adversarial Loss** | Binary cross-entropy loss from PatchGAN discriminator | **1** | Encourages realistic IR texture generation | | **Edge Loss** | Sobel/gradient difference between real & generated images | **5** | Enhances sharpness and edge clarity | --- The **total generator loss** is computed as: \[ L_{G} = \lambda_{L1}\,L_{L1} + \lambda_{\text{perc}}\,L_{\text{perc}} + \lambda_{\text{adv}}\,L_{\text{adv}} + \lambda_{\text{edge}}\,L_{\text{edge}} \] ## Evaluation Metrics | Metric | Definition | Result | |---------|-------------|--------| | **L1 Loss** | Mean absolute difference between generated and ground truth IR | **0.0611** | | **PSNR (Peak Signal-to-Noise Ratio)** | Measures reconstruction quality (higher is better) | **24.3096 dB** | | **SSIM (Structural Similarity Index Measure)** | Perceptual similarity between generated & target images | **0.8386** | --- ## Model Architectures | Model | Visualization | |-------|---------------| | **Generator** | ![Generator Architecture](generator.png) | | **Discriminator** | ![Discriminator Architecture](discriminator.png) | | **Combined GAN** | ![GAN Architecture Combined](gan_architecture_combined.png) | --- Data Exploration We analysed the LLVIP dataset and found that ~70% of image pairs are captured at < 50 lux lighting and ~30% at 50-200 lux. The average pedestrian height in IR channel was X pixels; outliers with <20 pixels height were excluded. ## Visual Results ### Training Progress (Sample Evolution)

### ✨ Final Convergence Samples | Early Epochs (Blurry, Low Brightness) | Later Epochs (Sharper, High Contrast) | |--------------------------------------|---------------------------------------| |

| ### Comparison: Input vs Ground Truth vs Generated | RGB Input- Ground Truth IR - Predicted IR | |

--- ## Loss Curves ### Generator & Discriminator Loss Training Loss Curve

### Validation Loss per Epoch Validation Loss Curve

All training metrics are logged in: --- ```bash / ├── logs.log └── loss_summary.csv ``` ## Observations - The model **captures IR brightness and object distinction**, but early epochs show slight blur due to L1-dominant stages. - **Contrast and edge sharpness improve** after ~70 epochs as adversarial and perceptual losses gain weight. - Background variations in LLVIP introduce challenges; future fine-tuning on domain-aligned subsets can further improve realism. - We compared three variants: (i) U-Net regression (L1 only) → SSIM = 0.80; - (ii) cGAN with L1+adv → SSIM = 0.83; (iii) cGAN with L1+adv+perc+edge (our final) → SSIM = 0.8386 --- ## Future Work - Apply **feature matching loss** for smoother discriminator gradients - Add **temporal or sequence consistency** for video IR translation - Adaptive loss balancing with epoch-based dynamic weighting --- Acknowledgements LLVIP Dataset for paired RGB–IR samples TensorFlow and VGG-19 for perceptual feature extraction Kaggle GPU for high-performance model training