gabehubner
/

vae-256px-8z

image-generation

Model card Files Files and versions

vae-256px-8z / README.md

gabehubner's picture

Update README.md

775aae5 verified 11 months ago

|

history blame contribute delete

1.94 kB

	---
	tags:
	- pytorch
	- vae
	- diffusion
	- image-generation
	- cc3m
	license: mit
	datasets:
	- pixparse/cc3m-wds
	library_name: diffusers
	---

	# UNet-Style VAE for 256x256 Image Reconstruction

	This model is a UNet-style Variational Autoencoder (VAE) trained on the [CC3M](https://huggingface.co/datasets/pixparse/cc3m-wds) dataset for high-quality image reconstruction and generation. It integrates adversarial, perceptual, and identity-preserving loss terms to improve semantic and visual fidelity.

	## Architecture

	- Encoder/Decoder: Multi-scale UNet architecture
	- Latent Space: 8-channel latent bottleneck with reparameterization (mu, logvar)
	- Losses:
	- L1 reconstruction loss
	- KL divergence with annealing
	- LPIPS perceptual loss (VGG backbone)
	- Identity loss via MoCo-v2 embeddings
	- Adversarial loss via Patch Discriminator w/ Spectral Norm

	$$
	\mathcal{L}_{total} = \mathcal{L}_{recon} + \mathcal{L}_{PIPS} + 0.5 * \mathcal{L}_{GAN} + 0.1 \mathcal{L}_{ID} + 10^{-6} \mathcal{L}_{KL}
	$$
	## Reconstructions
	\| Input \| Output \|
	\|-------\|--------\|
	\| ![input](./input_grid.png) \| ![output](./recon_grid.png) \|
	## Training Config
	\| Hyperparameter \| Value \|
	\|-----------------------\|----------------------------\|
	\| Dataset \| CC3M (850k images) \|
	\| Image Resolution \| 256 x 256 \|
	\| Batch Size \| 16 \|
	\| Optimizer \| AdamW \|
	\| Learning Rate \| 5e-5 \|
	\| Precision \| bf16 (mixed precision) \|
	\| Total Steps \| 210,000 \|
	\| GAN Start Step \| 50,000 \|
	\| KL Annealing \| Yes (10% of training) \|
	\| Augmentations \| Crop, flip, jitter, blur, rotation \|

	Trained using a cosine learning rate schedule with gradient clipping and automatic mixed precision (`torch.cuda.amp`)