Fine-tuned Stable Diffusion Text-to-Image Model
This is a fine-tuned Stable Diffusion model for text-to-image generation, trained on the Flickr30k dataset.
Model Architecture
- Base Model: Stable Diffusion v1.5 (runwayml/stable-diffusion-v1-5)
- Fine-tuned Component: UNet2D Conditional Model
- Frozen Components: VAE, CLIP Text Encoder
- Scheduler: DDPM Scheduler
Training Details
- Dataset: Flickr30k (500 samples)
- Training Steps: 500
- Epochs: 1
- Final Loss: 0.1645
- Training Time: 7.5 hours
- Batch Size: 1
- Learning Rate: 1e-5
- Optimizer: AdamW
Performance
The model was trained to understand and generate images from text descriptions, with particular strength in:
- Dynamic action scenes
- Multiple object interactions
- Facial expressions and emotions
- Various artistic styles
- Environmental details
Example Generations
The model can generate images for prompts like:
- "A cat jumping from a train and smiling"
- "A beautiful sunset over the ocean"
- "Two people walking in a park"
- "A modern city skyline at night"
Usage
from diffusers import StableDiffusionPipeline
import torch
# Load the fine-tuned model
pipe = StableDiffusionPipeline.from_pretrained(
"kunaliitkgp09/flickr30k-text-to-image",
torch_dtype=torch.float32
)
# Generate image
prompt = "A cat jumping from a train and smiling"
image = pipe(
prompt,
num_inference_steps=20,
guidance_scale=7.5,
width=512,
height=512
).images[0]
image.save("generated_image.png")
Model Files
unet/: Fine-tuned UNet2D conditional modelconfig.json: Model configurationREADME.md: This filemodel_index.json: Pipeline configuration
Training Process
The model was fine-tuned using a diffusion training approach:
- Load pre-trained Stable Diffusion v1.5 components
- Freeze VAE and text encoder for efficiency
- Fine-tune only the UNet on Flickr30k image-caption pairs
- Use noise prediction loss with DDPM scheduler
- Save checkpoints every 50 steps
Comparison with Base Model
While the base Stable Diffusion v1.5 model provides excellent general text-to-image generation, this fine-tuned version has been specifically adapted to the visual patterns and descriptions found in the Flickr30k dataset, potentially providing improved performance for similar types of scenes and descriptions.
Limitations
- Trained on a limited subset (500 samples) of Flickr30k
- Single epoch training - more training could improve quality
- Inherits limitations of base Stable Diffusion model
- May work best with prompt styles similar to Flickr30k captions
Citation
If you use this model, please cite:
@misc{flickr30k-text-to-image,
title={Fine-tuned Stable Diffusion for Text-to-Image Generation on Flickr30k},
author={Kunal Dhanda},
year={2024},
url={https://huggingface.co/kunaliitkgp09/flickr30k-text-to-image}
}
License
This model is based on Stable Diffusion v1.5 and follows the same licensing terms.
- Downloads last month
- -