File size: 1,741 Bytes
b3bf2eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c588c3e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
datasets:
- shreenithi20/fmnist-8x8-latents
---
# Fashion MNIST Text-to-Image Diffusion Model

A transformer-based diffusion model trained on Fashion MNIST latent representations for text-to-image generation.

## Model Information

- **Architecture**: Transformer-based diffusion model
- **Input**: 8×8×4 VAE latents
- **Conditioning**: Text embeddings (class labels)
- **Training Steps**: 8,500
- **Dataset**: [Fashion MNIST 8×8 Latents](https://huggingface.co/datasets/shreenithi20/fmnist-8x8-latents)
- **Framework**: PyTorch

## Checkpoints

- `model-1000.safetensors`: Early training (1k steps)
- `model-3000.safetensors`: Mid training (3k steps)  
- `model-5000.safetensors`: Advanced training (5k steps)
- `model-8500.safetensors`: Final model (8.5k steps) 

## Usage

```python
from transformers import AutoConfig, AutoModel
import torch

# Load model
model = AutoModel.from_pretrained("shreenithi20/fmnist-t2i-diffusion")
model.eval()

# Generate images
with torch.no_grad():
    generated_latents = model.generate(
        text_embeddings=class_labels,
        num_inference_steps=25,
        guidance_scale=7.5
    )
```

## Model Architecture

- **Patch Size**: 1×1
- **Embedding Dimension**: 384
- **Transformer Layers**: 12
- **Attention Heads**: 6
- **Cross Attention Heads**: 4
- **MLP Multiplier**: 4
- **Timesteps**: Continuous (beta distribution)
- **Beta Distribution**: a=1.0, b=2.5

## Training Details

- **Learning Rate**: 1e-3 (Constant)
- **Batch Size**: 128
- **Optimizer**: AdamW
- **Mixed Precision**: Yes
- **Gradient Accumulation**: 1

## Results

The model generates high-quality Fashion MNIST images conditioned on class labels, with 8×8 latent resolution that can be decoded to 64×64 pixel images.