YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Fine-tuned Stable Diffusion Text-to-Image Model

This is a fine-tuned Stable Diffusion model for text-to-image generation, trained on the Flickr30k dataset.

Model Architecture

Base Model: Stable Diffusion v1.5 (runwayml/stable-diffusion-v1-5)
Fine-tuned Component: UNet2D Conditional Model
Frozen Components: VAE, CLIP Text Encoder
Scheduler: DDPM Scheduler

Training Details

Dataset: Flickr30k (500 samples)
Training Steps: 500
Epochs: 1
Final Loss: 0.1645
Training Time: 7.5 hours
Batch Size: 1
Learning Rate: 1e-5
Optimizer: AdamW

Performance

The model was trained to understand and generate images from text descriptions, with particular strength in:

Dynamic action scenes
Multiple object interactions
Facial expressions and emotions
Various artistic styles
Environmental details

Example Generations

The model can generate images for prompts like:

"A cat jumping from a train and smiling"
"A beautiful sunset over the ocean"
"Two people walking in a park"
"A modern city skyline at night"

Usage

from diffusers import StableDiffusionPipeline
import torch

# Load the fine-tuned model
pipe = StableDiffusionPipeline.from_pretrained(
    "kunaliitkgp09/flickr30k-text-to-image",
    torch_dtype=torch.float32
)

# Generate image
prompt = "A cat jumping from a train and smiling"
image = pipe(
    prompt,
    num_inference_steps=20,
    guidance_scale=7.5,
    width=512,
    height=512
).images[0]

image.save("generated_image.png")

Model Files

unet/: Fine-tuned UNet2D conditional model
config.json: Model configuration
README.md: This file
model_index.json: Pipeline configuration

Training Process

The model was fine-tuned using a diffusion training approach:

Load pre-trained Stable Diffusion v1.5 components
Freeze VAE and text encoder for efficiency
Fine-tune only the UNet on Flickr30k image-caption pairs
Use noise prediction loss with DDPM scheduler
Save checkpoints every 50 steps

Comparison with Base Model

While the base Stable Diffusion v1.5 model provides excellent general text-to-image generation, this fine-tuned version has been specifically adapted to the visual patterns and descriptions found in the Flickr30k dataset, potentially providing improved performance for similar types of scenes and descriptions.

Limitations

Trained on a limited subset (500 samples) of Flickr30k
Single epoch training - more training could improve quality
Inherits limitations of base Stable Diffusion model
May work best with prompt styles similar to Flickr30k captions

Citation

If you use this model, please cite:

@misc{flickr30k-text-to-image,
  title={Fine-tuned Stable Diffusion for Text-to-Image Generation on Flickr30k},
  author={Kunal Dhanda},
  year={2024},
  url={https://huggingface.co/kunaliitkgp09/flickr30k-text-to-image}
}

License

This model is based on Stable Diffusion v1.5 and follows the same licensing terms.

Downloads last month: -