--- datasets: - shreenithi20/fmnist-8x8-latents --- # Fashion MNIST Text-to-Image Diffusion Model A transformer-based diffusion model trained on Fashion MNIST latent representations for text-to-image generation. ## Model Information - **Architecture**: Transformer-based diffusion model - **Input**: 8×8×4 VAE latents - **Conditioning**: Text embeddings (class labels) - **Training Steps**: 8,500 - **Dataset**: [Fashion MNIST 8×8 Latents](https://huggingface.co/datasets/shreenithi20/fmnist-8x8-latents) - **Framework**: PyTorch ## Checkpoints - `model-1000.safetensors`: Early training (1k steps) - `model-3000.safetensors`: Mid training (3k steps) - `model-5000.safetensors`: Advanced training (5k steps) - `model-8500.safetensors`: Final model (8.5k steps) ## Usage ```python from transformers import AutoConfig, AutoModel import torch # Load model model = AutoModel.from_pretrained("shreenithi20/fmnist-t2i-diffusion") model.eval() # Generate images with torch.no_grad(): generated_latents = model.generate( text_embeddings=class_labels, num_inference_steps=25, guidance_scale=7.5 ) ``` ## Model Architecture - **Patch Size**: 1×1 - **Embedding Dimension**: 384 - **Transformer Layers**: 12 - **Attention Heads**: 6 - **Cross Attention Heads**: 4 - **MLP Multiplier**: 4 - **Timesteps**: Continuous (beta distribution) - **Beta Distribution**: a=1.0, b=2.5 ## Training Details - **Learning Rate**: 1e-3 (Constant) - **Batch Size**: 128 - **Optimizer**: AdamW - **Mixed Precision**: Yes - **Gradient Accumulation**: 1 ## Results The model generates high-quality Fashion MNIST images conditioned on class labels, with 8×8 latent resolution that can be decoded to 64×64 pixel images.