hash-map's picture
Update README.md
808e3c3 verified
---
license: mit
datasets:
- UserNae3/LLVIP
pipeline_tag: image-to-image
---
# Conditional GAN for Visible → Infrared (LLVIP)
> **High-fidelity Visible-to-Infrared Translation using a Conditional GAN with Multi-Loss Optimization**
---
## Overview
This project implements a **Conditional Generative Adversarial Network (cGAN)** trained to translate **visible-light (RGB)** images into **infrared (IR)** representations.
It leverages **multi-loss optimization** — combining perceptual, pixel, adversarial, and edge-based objectives — to generate sharp, realistic IR outputs that preserve both **scene structure** and **thermal contrast**.
A higher emphasis is given to **L1 loss**, ensuring that overall brightness and object boundaries remain consistent between visible and infrared domains.
---
## Dataset
- **Dataset:** [LLVIP Dataset](https://huggingface.co/datasets/UserNae3/LLVIP)
Paired **visible (RGB)** and **infrared (IR)** images under diverse lighting and background conditions.
---
## Model Architecture
- **Type:** Conditional GAN (cGAN)
- **Direction:** *Visible → Infrared*
- **Framework:** TensorFlow
- **Pipeline Tag:** `image-to-image`
- **License:** MIT
### Generator
- U-Net encoder–decoder with skip connections
- Conditioned on RGB input
- Output: single-channel IR image
### Discriminator
- Evaluates realism for fine detail learning
---
## ⚙️ Training Configuration
| Setting | Value |
|----------|--------|
| **Epochs** | 100 |
| **Steps per Epoch** | 376 |
| **Batch Size** | 4 |
| **Optimizer** | Adam (β₁ = 0.5, β₂ = 0.999) |
| **Learning Rate** | 2e-4 |
| **Precision** | Mixed (32) |
| **Hardware** | NVIDIA T4 (Kaggle GPU Runtime) |
---
## Multi-Loss Function Design
| Loss Type | Description | Weight (λ) | Purpose |
|------------|--------------|-------------|----------|
| **L1 Loss** | Pixel-wise mean absolute error between generated and real IR | **100** | Ensures global brightness & shape consistency |
| **Perceptual Loss (VGG)** | Feature loss from `conv5_block4` of pretrained VGG-19 | **10** | Captures high-level texture and semantic alignment |
| **Adversarial Loss** | Binary cross-entropy loss from PatchGAN discriminator | **1** | Encourages realistic IR texture generation |
| **Edge Loss** | Sobel/gradient difference between real & generated images | **5** | Enhances sharpness and edge clarity |
---
The **total generator loss** is computed as:
\[
L_{G} = \lambda_{L1}\,L_{L1} + \lambda_{\text{perc}}\,L_{\text{perc}} + \lambda_{\text{adv}}\,L_{\text{adv}} + \lambda_{\text{edge}}\,L_{\text{edge}}
\]
## Evaluation Metrics
| Metric | Definition | Result |
|---------|-------------|--------|
| **L1 Loss** | Mean absolute difference between generated and ground truth IR | **0.0611** |
| **PSNR (Peak Signal-to-Noise Ratio)** | Measures reconstruction quality (higher is better) | **24.3096 dB** |
| **SSIM (Structural Similarity Index Measure)** | Perceptual similarity between generated & target images | **0.8386** |
---
## Model Architectures
| Model | Visualization |
|-------|---------------|
| **Generator** | ![Generator Architecture](generator.png) |
| **Discriminator** | ![Discriminator Architecture](discriminator.png) |
| **Combined GAN** | ![GAN Architecture Combined](gan_architecture_combined.png) |
---
Data Exploration
We analysed the LLVIP dataset and found that ~70% of image pairs are captured at < 50 lux lighting and ~30% at 50-200 lux.
The average pedestrian height in IR channel was X pixels; outliers with <20 pixels height were excluded.
## Visual Results
### Training Progress (Sample Evolution)
<img src="ezgif-58298bca2da920.gif" alt="Training Progress" width="700"/>
### ✨ Final Convergence Samples
| Early Epochs (Blurry, Low Brightness) | Later Epochs (Sharper, High Contrast) |
|--------------------------------------|---------------------------------------|
| <img src="./epoch_007.png" width="550"/> | <img src="epoch_100.png" width="550"/> |
### Comparison: Input vs Ground Truth vs Generated
| RGB Input- Ground Truth IR - Predicted IR |
| <img src="test_1179.png" width="750"/>
| <img src="test_001.png" width="750"/>
| <img src="test_4884.png" width="750"/>
| <img src="test_5269.png" width="750"/>
| <img src="test_5361.png" width="750"/>
| <img src="test_7255.png" width="750"/>
| <img src="test_7362.png" width="750"/>
| <img src="test_12015.png" width="750"/>
---
## Loss Curves
### Generator & Discriminator Loss
<img src="./train_loss_curve.png" alt="Training Loss Curve" width="600"/>
### Validation Loss per Epoch
<img src="./val_loss_curve.png" alt="Validation Loss Curve" width="600"/>
All training metrics are logged in:
---
```bash
/
├── logs.log
└── loss_summary.csv
```
## Observations
- The model **captures IR brightness and object distinction**, but early epochs show slight blur due to L1-dominant stages.
- **Contrast and edge sharpness improve** after ~70 epochs as adversarial and perceptual losses gain weight.
- Background variations in LLVIP introduce challenges; future fine-tuning on domain-aligned subsets can further improve realism.
- We compared three variants: (i) U-Net regression (L1 only) → SSIM = 0.80;
- (ii) cGAN with L1+adv → SSIM = 0.83; (iii) cGAN with L1+adv+perc+edge (our final) → SSIM = 0.8386
---
## Future Work
- Apply **feature matching loss** for smoother discriminator gradients
- Add **temporal or sequence consistency** for video IR translation
- Adaptive loss balancing with epoch-based dynamic weighting
---
Acknowledgements
LLVIP Dataset for paired RGB–IR samples
TensorFlow and VGG-19 for perceptual feature extraction
Kaggle GPU for high-performance model training