shreenithi20 commited on
Commit
b3bf2eb
·
verified ·
1 Parent(s): c93e607

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -60
README.md CHANGED
@@ -1,61 +1,65 @@
1
- # Fashion MNIST Text-to-Image Diffusion Model
2
-
3
- A transformer-based diffusion model trained on Fashion MNIST latent representations for text-to-image generation.
4
-
5
- ## Model Information
6
-
7
- - **Architecture**: Transformer-based diffusion model
8
- - **Input**: 8×8×4 VAE latents
9
- - **Conditioning**: Text embeddings (class labels)
10
- - **Training Steps**: 8,500
11
- - **Dataset**: [Fashion MNIST 8×8 Latents](https://huggingface.co/datasets/shreenithi20/fmnist-8x8-latents)
12
- - **Framework**: PyTorch
13
-
14
- ## Checkpoints
15
-
16
- - `model-1000.safetensors`: Early training (1k steps)
17
- - `model-3000.safetensors`: Mid training (3k steps)
18
- - `model-5000.safetensors`: Advanced training (5k steps)
19
- - `model-8500.safetensors`: Final model (8.5k steps)
20
-
21
- ## Usage
22
-
23
- ```python
24
- from transformers import AutoConfig, AutoModel
25
- import torch
26
-
27
- # Load model
28
- model = AutoModel.from_pretrained("shreenithi20/fmnist-t2i-diffusion")
29
- model.eval()
30
-
31
- # Generate images
32
- with torch.no_grad():
33
- generated_latents = model.generate(
34
- text_embeddings=class_labels,
35
- num_inference_steps=25,
36
- guidance_scale=7.5
37
- )
38
- ```
39
-
40
- ## Model Architecture
41
-
42
- - **Patch Size**: 1×1
43
- - **Embedding Dimension**: 384
44
- - **Transformer Layers**: 12
45
- - **Attention Heads**: 6
46
- - **Cross Attention Heads**: 4
47
- - **MLP Multiplier**: 4
48
- - **Timesteps**: Continuous (beta distribution)
49
- - **Beta Distribution**: a=1.0, b=2.5
50
-
51
- ## Training Details
52
-
53
- - **Learning Rate**: 1e-3 (Constant)
54
- - **Batch Size**: 128
55
- - **Optimizer**: AdamW
56
- - **Mixed Precision**: Yes
57
- - **Gradient Accumulation**: 1
58
-
59
- ## Results
60
-
 
 
 
 
61
  The model generates high-quality Fashion MNIST images conditioned on class labels, with 8×8 latent resolution that can be decoded to 64×64 pixel images.
 
1
+ ---
2
+ datasets:
3
+ - shreenithi20/fmnist-8x8-latents
4
+ ---
5
+ # Fashion MNIST Text-to-Image Diffusion Model
6
+
7
+ A transformer-based diffusion model trained on Fashion MNIST latent representations for text-to-image generation.
8
+
9
+ ## Model Information
10
+
11
+ - **Architecture**: Transformer-based diffusion model
12
+ - **Input**: 8×8×4 VAE latents
13
+ - **Conditioning**: Text embeddings (class labels)
14
+ - **Training Steps**: 8,500
15
+ - **Dataset**: [Fashion MNIST 8×8 Latents](https://huggingface.co/datasets/shreenithi20/fmnist-8x8-latents)
16
+ - **Framework**: PyTorch
17
+
18
+ ## Checkpoints
19
+
20
+ - `model-1000.safetensors`: Early training (1k steps)
21
+ - `model-3000.safetensors`: Mid training (3k steps)
22
+ - `model-5000.safetensors`: Advanced training (5k steps)
23
+ - `model-8500.safetensors`: Final model (8.5k steps)
24
+
25
+ ## Usage
26
+
27
+ ```python
28
+ from transformers import AutoConfig, AutoModel
29
+ import torch
30
+
31
+ # Load model
32
+ model = AutoModel.from_pretrained("shreenithi20/fmnist-t2i-diffusion")
33
+ model.eval()
34
+
35
+ # Generate images
36
+ with torch.no_grad():
37
+ generated_latents = model.generate(
38
+ text_embeddings=class_labels,
39
+ num_inference_steps=25,
40
+ guidance_scale=7.5
41
+ )
42
+ ```
43
+
44
+ ## Model Architecture
45
+
46
+ - **Patch Size**: 1×1
47
+ - **Embedding Dimension**: 384
48
+ - **Transformer Layers**: 12
49
+ - **Attention Heads**: 6
50
+ - **Cross Attention Heads**: 4
51
+ - **MLP Multiplier**: 4
52
+ - **Timesteps**: Continuous (beta distribution)
53
+ - **Beta Distribution**: a=1.0, b=2.5
54
+
55
+ ## Training Details
56
+
57
+ - **Learning Rate**: 1e-3 (Constant)
58
+ - **Batch Size**: 128
59
+ - **Optimizer**: AdamW
60
+ - **Mixed Precision**: Yes
61
+ - **Gradient Accumulation**: 1
62
+
63
+ ## Results
64
+
65
  The model generates high-quality Fashion MNIST images conditioned on class labels, with 8×8 latent resolution that can be decoded to 64×64 pixel images.