anime-diffusion / README.md

dixisouls

docs: update README

46bbc9b 1 day ago

preview code

raw

history blame contribute delete

2.65 kB

metadata

license: mit
datasets:
  - none-yet/anime-captions
base_model:
  - CompVis/stable-diffusion-v1-4
pipeline_tag: text-to-image
tags:
  - diffusers
  - stable-diffusion
  - text-to-image

Anime-Diffusion UNet

A UNet2DConditionModel fine-tuned for anime-style image generation, based on Stable Diffusion v1.4.

Model Details

Architecture: UNet2DConditionModel from CompVis/stable-diffusion-v1-4
EMA Decay: 0.9995
Output Resolution: 512×512
Prediction Type: epsilon

Companion Models (required for inference)

Component	Model ID
VAE	stabilityai/sd-vae-ft-mse
Text Encoder	openai/clip-vit-large-patch14
Tokenizer	openai/clip-vit-large-patch14

Training Details

Dataset: none-yet/anime-captions (~337k image-caption pairs)
Steps: 10,000
Batch Size: 128 (32 per GPU × 4 GPUs)
Learning Rate: 1e-4 with cosine schedule (500 warmup steps)
Optimizer: AdamW (weight decay 0.01)
Mixed Precision: fp16
Noise Schedule: DDPM, 1000 linear timesteps
Gradient Clipping: 1.0

Usage

import torch
from diffusers import AutoencoderKL, DDIMScheduler, UNet2DConditionModel
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import CLIPTextModel, CLIPTokenizer

# Load models
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")

# Load fine-tuned EMA weights
weights_path = hf_hub_download(repo_id="dixisouls/anime-diffusion", filename="model.safetensors")
unet.load_state_dict(load_file(weights_path))

# Use DDIMScheduler for inference
scheduler = DDIMScheduler(
    num_train_timesteps=1000,
    beta_schedule="linear",
    clip_sample=False,
    prediction_type="epsilon",
)

See the companion HuggingFace Space for a full interactive demo.

Limitations

Trained exclusively on anime-style images; not suitable for photorealistic generation
Fixed output resolution of 512×512
Single-subject prompts work best; complex multi-character scenes may be inconsistent