|
|
--- |
|
|
datasets: |
|
|
- shreenithi20/fmnist-8x8-latents |
|
|
--- |
|
|
# Fashion MNIST Text-to-Image Diffusion Model |
|
|
|
|
|
A transformer-based diffusion model trained on Fashion MNIST latent representations for text-to-image generation. |
|
|
|
|
|
## Model Information |
|
|
|
|
|
- **Architecture**: Transformer-based diffusion model |
|
|
- **Input**: 8×8×4 VAE latents |
|
|
- **Conditioning**: Text embeddings (class labels) |
|
|
- **Training Steps**: 8,500 |
|
|
- **Dataset**: [Fashion MNIST 8×8 Latents](https://huggingface.co/datasets/shreenithi20/fmnist-8x8-latents) |
|
|
- **Framework**: PyTorch |
|
|
|
|
|
## Checkpoints |
|
|
|
|
|
- `model-1000.safetensors`: Early training (1k steps) |
|
|
- `model-3000.safetensors`: Mid training (3k steps) |
|
|
- `model-5000.safetensors`: Advanced training (5k steps) |
|
|
- `model-8500.safetensors`: Final model (8.5k steps) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoConfig, AutoModel |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
model = AutoModel.from_pretrained("shreenithi20/fmnist-t2i-diffusion") |
|
|
model.eval() |
|
|
|
|
|
# Generate images |
|
|
with torch.no_grad(): |
|
|
generated_latents = model.generate( |
|
|
text_embeddings=class_labels, |
|
|
num_inference_steps=25, |
|
|
guidance_scale=7.5 |
|
|
) |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Patch Size**: 1×1 |
|
|
- **Embedding Dimension**: 384 |
|
|
- **Transformer Layers**: 12 |
|
|
- **Attention Heads**: 6 |
|
|
- **Cross Attention Heads**: 4 |
|
|
- **MLP Multiplier**: 4 |
|
|
- **Timesteps**: Continuous (beta distribution) |
|
|
- **Beta Distribution**: a=1.0, b=2.5 |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Learning Rate**: 1e-3 (Constant) |
|
|
- **Batch Size**: 128 |
|
|
- **Optimizer**: AdamW |
|
|
- **Mixed Precision**: Yes |
|
|
- **Gradient Accumulation**: 1 |
|
|
|
|
|
## Results |
|
|
|
|
|
The model generates high-quality Fashion MNIST images conditioned on class labels, with 8×8 latent resolution that can be decoded to 64×64 pixel images. |