Update README.md

808e3c3 verified 4 months ago

5.76 kB

	---
	license: mit
	datasets:
	- UserNae3/LLVIP
	pipeline_tag: image-to-image
	---
	# Conditional GAN for Visible → Infrared (LLVIP)

	> High-fidelity Visible-to-Infrared Translation using a Conditional GAN with Multi-Loss Optimization

	---

	## Overview

	This project implements a Conditional Generative Adversarial Network (cGAN) trained to translate visible-light (RGB) images into infrared (IR) representations.

	It leverages multi-loss optimization — combining perceptual, pixel, adversarial, and edge-based objectives — to generate sharp, realistic IR outputs that preserve both scene structure and thermal contrast.

	A higher emphasis is given to L1 loss, ensuring that overall brightness and object boundaries remain consistent between visible and infrared domains.

	---

	## Dataset

	- Dataset: [LLVIP Dataset](https://huggingface.co/datasets/UserNae3/LLVIP)
	Paired visible (RGB) and infrared (IR) images under diverse lighting and background conditions.

	---

	## Model Architecture

	- Type: Conditional GAN (cGAN)
	- Direction: Visible → Infrared
	- Framework: TensorFlow
	- Pipeline Tag: `image-to-image`
	- License: MIT

	### Generator
	- U-Net encoder–decoder with skip connections
	- Conditioned on RGB input
	- Output: single-channel IR image

	### Discriminator
	- Evaluates realism for fine detail learning

	---

	## ⚙️ Training Configuration

	\| Setting \| Value \|
	\|----------\|--------\|
	\| Epochs \| 100 \|
	\| Steps per Epoch \| 376 \|
	\| Batch Size \| 4 \|
	\| Optimizer \| Adam (β₁ = 0.5, β₂ = 0.999) \|
	\| Learning Rate \| 2e-4 \|
	\| Precision \| Mixed (32) \|
	\| Hardware \| NVIDIA T4 (Kaggle GPU Runtime) \|

	---

	## Multi-Loss Function Design

	\| Loss Type \| Description \| Weight (λ) \| Purpose \|
	\|------------\|--------------\|-------------\|----------\|
	\| L1 Loss \| Pixel-wise mean absolute error between generated and real IR \| 100 \| Ensures global brightness & shape consistency \|
	\| Perceptual Loss (VGG) \| Feature loss from `conv5_block4` of pretrained VGG-19 \| 10 \| Captures high-level texture and semantic alignment \|
	\| Adversarial Loss \| Binary cross-entropy loss from PatchGAN discriminator \| 1 \| Encourages realistic IR texture generation \|
	\| Edge Loss \| Sobel/gradient difference between real & generated images \| 5 \| Enhances sharpness and edge clarity \|


	---

	The total generator loss is computed as:
	\[
	L_{G} = \lambda_{L1}\,L_{L1} + \lambda_{\text{perc}}\,L_{\text{perc}} + \lambda_{\text{adv}}\,L_{\text{adv}} + \lambda_{\text{edge}}\,L_{\text{edge}}
	\]


	## Evaluation Metrics

	\| Metric \| Definition \| Result \|
	\|---------\|-------------\|--------\|
	\| L1 Loss \| Mean absolute difference between generated and ground truth IR \| 0.0611 \|
	\| PSNR (Peak Signal-to-Noise Ratio) \| Measures reconstruction quality (higher is better) \| 24.3096 dB \|
	\| SSIM (Structural Similarity Index Measure) \| Perceptual similarity between generated & target images \| 0.8386 \|

	---
	## Model Architectures

	\| Model \| Visualization \|
	\|-------\|---------------\|
	\| Generator \| ![Generator Architecture](generator.png) \|
	\| Discriminator \| ![Discriminator Architecture](discriminator.png) \|
	\| Combined GAN \| ![GAN Architecture Combined](gan_architecture_combined.png) \|

	---
	Data Exploration

	We analysed the LLVIP dataset and found that ~70% of image pairs are captured at < 50 lux lighting and ~30% at 50-200 lux.
	The average pedestrian height in IR channel was X pixels; outliers with <20 pixels height were excluded.


	## Visual Results

	### Training Progress (Sample Evolution)
	<img src="ezgif-58298bca2da920.gif" alt="Training Progress" width="700"/>

	### ✨ Final Convergence Samples
	\| Early Epochs (Blurry, Low Brightness) \| Later Epochs (Sharper, High Contrast) \|
	\|--------------------------------------\|---------------------------------------\|
	\| <img src="./epoch_007.png" width="550"/> \| <img src="epoch_100.png" width="550"/> \|

	### Comparison: Input vs Ground Truth vs Generated
	\| RGB Input- Ground Truth IR - Predicted IR \|

	\| <img src="test_1179.png" width="750"/>
	\| <img src="test_001.png" width="750"/>
	\| <img src="test_4884.png" width="750"/>
	\| <img src="test_5269.png" width="750"/>
	\| <img src="test_5361.png" width="750"/>
	\| <img src="test_7255.png" width="750"/>
	\| <img src="test_7362.png" width="750"/>
	\| <img src="test_12015.png" width="750"/>
	---

	## Loss Curves

	### Generator & Discriminator Loss
	<img src="./train_loss_curve.png" alt="Training Loss Curve" width="600"/>

	### Validation Loss per Epoch
	<img src="./val_loss_curve.png" alt="Validation Loss Curve" width="600"/>

	All training metrics are logged in:

	---
	```bash
	/
	├── logs.log
	└── loss_summary.csv
	```
	## Observations

	- The model captures IR brightness and object distinction, but early epochs show slight blur due to L1-dominant stages.
	- Contrast and edge sharpness improve after ~70 epochs as adversarial and perceptual losses gain weight.
	- Background variations in LLVIP introduce challenges; future fine-tuning on domain-aligned subsets can further improve realism.
	- We compared three variants: (i) U-Net regression (L1 only) → SSIM = 0.80;
	- (ii) cGAN with L1+adv → SSIM = 0.83; (iii) cGAN with L1+adv+perc+edge (our final) → SSIM = 0.8386
	---

	## Future Work

	- Apply feature matching loss for smoother discriminator gradients
	- Add temporal or sequence consistency for video IR translation
	- Adaptive loss balancing with epoch-based dynamic weighting

	---
	Acknowledgements

	LLVIP Dataset for paired RGB–IR samples

	TensorFlow and VGG-19 for perceptual feature extraction

	Kaggle GPU for high-performance model training