|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- UserNae3/LLVIP |
|
|
pipeline_tag: image-to-image |
|
|
--- |
|
|
# Conditional GAN for Visible → Infrared (LLVIP) |
|
|
|
|
|
> **High-fidelity Visible-to-Infrared Translation using a Conditional GAN with Multi-Loss Optimization** |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
This project implements a **Conditional Generative Adversarial Network (cGAN)** trained to translate **visible-light (RGB)** images into **infrared (IR)** representations. |
|
|
|
|
|
It leverages **multi-loss optimization** — combining perceptual, pixel, adversarial, and edge-based objectives — to generate sharp, realistic IR outputs that preserve both **scene structure** and **thermal contrast**. |
|
|
|
|
|
A higher emphasis is given to **L1 loss**, ensuring that overall brightness and object boundaries remain consistent between visible and infrared domains. |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset |
|
|
|
|
|
- **Dataset:** [LLVIP Dataset](https://huggingface.co/datasets/UserNae3/LLVIP) |
|
|
Paired **visible (RGB)** and **infrared (IR)** images under diverse lighting and background conditions. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Type:** Conditional GAN (cGAN) |
|
|
- **Direction:** *Visible → Infrared* |
|
|
- **Framework:** TensorFlow |
|
|
- **Pipeline Tag:** `image-to-image` |
|
|
- **License:** MIT |
|
|
|
|
|
### Generator |
|
|
- U-Net encoder–decoder with skip connections |
|
|
- Conditioned on RGB input |
|
|
- Output: single-channel IR image |
|
|
|
|
|
### Discriminator |
|
|
- Evaluates realism for fine detail learning |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚙️ Training Configuration |
|
|
|
|
|
| Setting | Value | |
|
|
|----------|--------| |
|
|
| **Epochs** | 100 | |
|
|
| **Steps per Epoch** | 376 | |
|
|
| **Batch Size** | 4 | |
|
|
| **Optimizer** | Adam (β₁ = 0.5, β₂ = 0.999) | |
|
|
| **Learning Rate** | 2e-4 | |
|
|
| **Precision** | Mixed (32) | |
|
|
| **Hardware** | NVIDIA T4 (Kaggle GPU Runtime) | |
|
|
|
|
|
--- |
|
|
|
|
|
## Multi-Loss Function Design |
|
|
|
|
|
| Loss Type | Description | Weight (λ) | Purpose | |
|
|
|------------|--------------|-------------|----------| |
|
|
| **L1 Loss** | Pixel-wise mean absolute error between generated and real IR | **100** | Ensures global brightness & shape consistency | |
|
|
| **Perceptual Loss (VGG)** | Feature loss from `conv5_block4` of pretrained VGG-19 | **10** | Captures high-level texture and semantic alignment | |
|
|
| **Adversarial Loss** | Binary cross-entropy loss from PatchGAN discriminator | **1** | Encourages realistic IR texture generation | |
|
|
| **Edge Loss** | Sobel/gradient difference between real & generated images | **5** | Enhances sharpness and edge clarity | |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
The **total generator loss** is computed as: |
|
|
\[ |
|
|
L_{G} = \lambda_{L1}\,L_{L1} + \lambda_{\text{perc}}\,L_{\text{perc}} + \lambda_{\text{adv}}\,L_{\text{adv}} + \lambda_{\text{edge}}\,L_{\text{edge}} |
|
|
\] |
|
|
|
|
|
|
|
|
## Evaluation Metrics |
|
|
|
|
|
| Metric | Definition | Result | |
|
|
|---------|-------------|--------| |
|
|
| **L1 Loss** | Mean absolute difference between generated and ground truth IR | **0.0611** | |
|
|
| **PSNR (Peak Signal-to-Noise Ratio)** | Measures reconstruction quality (higher is better) | **24.3096 dB** | |
|
|
| **SSIM (Structural Similarity Index Measure)** | Perceptual similarity between generated & target images | **0.8386** | |
|
|
|
|
|
--- |
|
|
## Model Architectures |
|
|
|
|
|
| Model | Visualization | |
|
|
|-------|---------------| |
|
|
| **Generator** |  | |
|
|
| **Discriminator** |  | |
|
|
| **Combined GAN** |  | |
|
|
|
|
|
--- |
|
|
Data Exploration |
|
|
|
|
|
We analysed the LLVIP dataset and found that ~70% of image pairs are captured at < 50 lux lighting and ~30% at 50-200 lux. |
|
|
The average pedestrian height in IR channel was X pixels; outliers with <20 pixels height were excluded. |
|
|
|
|
|
|
|
|
## Visual Results |
|
|
|
|
|
### Training Progress (Sample Evolution) |
|
|
<img src="ezgif-58298bca2da920.gif" alt="Training Progress" width="700"/> |
|
|
|
|
|
### ✨ Final Convergence Samples |
|
|
| Early Epochs (Blurry, Low Brightness) | Later Epochs (Sharper, High Contrast) | |
|
|
|--------------------------------------|---------------------------------------| |
|
|
| <img src="./epoch_007.png" width="550"/> | <img src="epoch_100.png" width="550"/> | |
|
|
|
|
|
### Comparison: Input vs Ground Truth vs Generated |
|
|
| RGB Input- Ground Truth IR - Predicted IR | |
|
|
|
|
|
| <img src="test_1179.png" width="750"/> |
|
|
| <img src="test_001.png" width="750"/> |
|
|
| <img src="test_4884.png" width="750"/> |
|
|
| <img src="test_5269.png" width="750"/> |
|
|
| <img src="test_5361.png" width="750"/> |
|
|
| <img src="test_7255.png" width="750"/> |
|
|
| <img src="test_7362.png" width="750"/> |
|
|
| <img src="test_12015.png" width="750"/> |
|
|
--- |
|
|
|
|
|
## Loss Curves |
|
|
|
|
|
### Generator & Discriminator Loss |
|
|
<img src="./train_loss_curve.png" alt="Training Loss Curve" width="600"/> |
|
|
|
|
|
### Validation Loss per Epoch |
|
|
<img src="./val_loss_curve.png" alt="Validation Loss Curve" width="600"/> |
|
|
|
|
|
All training metrics are logged in: |
|
|
|
|
|
--- |
|
|
```bash |
|
|
/ |
|
|
├── logs.log |
|
|
└── loss_summary.csv |
|
|
``` |
|
|
## Observations |
|
|
|
|
|
- The model **captures IR brightness and object distinction**, but early epochs show slight blur due to L1-dominant stages. |
|
|
- **Contrast and edge sharpness improve** after ~70 epochs as adversarial and perceptual losses gain weight. |
|
|
- Background variations in LLVIP introduce challenges; future fine-tuning on domain-aligned subsets can further improve realism. |
|
|
- We compared three variants: (i) U-Net regression (L1 only) → SSIM = 0.80; |
|
|
- (ii) cGAN with L1+adv → SSIM = 0.83; (iii) cGAN with L1+adv+perc+edge (our final) → SSIM = 0.8386 |
|
|
--- |
|
|
|
|
|
## Future Work |
|
|
|
|
|
- Apply **feature matching loss** for smoother discriminator gradients |
|
|
- Add **temporal or sequence consistency** for video IR translation |
|
|
- Adaptive loss balancing with epoch-based dynamic weighting |
|
|
|
|
|
--- |
|
|
Acknowledgements |
|
|
|
|
|
LLVIP Dataset for paired RGB–IR samples |
|
|
|
|
|
TensorFlow and VGG-19 for perceptual feature extraction |
|
|
|
|
|
Kaggle GPU for high-performance model training |
|
|
|
|
|
|