YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Fine-tuned Stable Diffusion Text-to-Image Model

This is a fine-tuned Stable Diffusion model for text-to-image generation, trained on the Flickr30k dataset.

Model Architecture

  • Base Model: Stable Diffusion v1.5 (runwayml/stable-diffusion-v1-5)
  • Fine-tuned Component: UNet2D Conditional Model
  • Frozen Components: VAE, CLIP Text Encoder
  • Scheduler: DDPM Scheduler

Training Details

  • Dataset: Flickr30k (500 samples)
  • Training Steps: 500
  • Epochs: 1
  • Final Loss: 0.1645
  • Training Time: 7.5 hours
  • Batch Size: 1
  • Learning Rate: 1e-5
  • Optimizer: AdamW

Performance

The model was trained to understand and generate images from text descriptions, with particular strength in:

  • Dynamic action scenes
  • Multiple object interactions
  • Facial expressions and emotions
  • Various artistic styles
  • Environmental details

Example Generations

The model can generate images for prompts like:

  • "A cat jumping from a train and smiling"
  • "A beautiful sunset over the ocean"
  • "Two people walking in a park"
  • "A modern city skyline at night"

Usage

from diffusers import StableDiffusionPipeline
import torch

# Load the fine-tuned model
pipe = StableDiffusionPipeline.from_pretrained(
    "kunaliitkgp09/flickr30k-text-to-image",
    torch_dtype=torch.float32
)

# Generate image
prompt = "A cat jumping from a train and smiling"
image = pipe(
    prompt,
    num_inference_steps=20,
    guidance_scale=7.5,
    width=512,
    height=512
).images[0]

image.save("generated_image.png")

Model Files

  • unet/: Fine-tuned UNet2D conditional model
  • config.json: Model configuration
  • README.md: This file
  • model_index.json: Pipeline configuration

Training Process

The model was fine-tuned using a diffusion training approach:

  1. Load pre-trained Stable Diffusion v1.5 components
  2. Freeze VAE and text encoder for efficiency
  3. Fine-tune only the UNet on Flickr30k image-caption pairs
  4. Use noise prediction loss with DDPM scheduler
  5. Save checkpoints every 50 steps

Comparison with Base Model

While the base Stable Diffusion v1.5 model provides excellent general text-to-image generation, this fine-tuned version has been specifically adapted to the visual patterns and descriptions found in the Flickr30k dataset, potentially providing improved performance for similar types of scenes and descriptions.

Limitations

  • Trained on a limited subset (500 samples) of Flickr30k
  • Single epoch training - more training could improve quality
  • Inherits limitations of base Stable Diffusion model
  • May work best with prompt styles similar to Flickr30k captions

Citation

If you use this model, please cite:

@misc{flickr30k-text-to-image,
  title={Fine-tuned Stable Diffusion for Text-to-Image Generation on Flickr30k},
  author={Kunal Dhanda},
  year={2024},
  url={https://huggingface.co/kunaliitkgp09/flickr30k-text-to-image}
}

License

This model is based on Stable Diffusion v1.5 and follows the same licensing terms.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support