--- license: mit datasets: - none-yet/anime-captions base_model: - CompVis/stable-diffusion-v1-4 pipeline_tag: text-to-image tags: - diffusers - stable-diffusion - text-to-image --- # Anime-Diffusion UNet A UNet2DConditionModel fine-tuned for anime-style image generation, based on Stable Diffusion v1.4. ## Model Details - **Architecture:** UNet2DConditionModel from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) - **EMA Decay:** 0.9995 - **Output Resolution:** 512×512 - **Prediction Type:** epsilon ### Companion Models (required for inference) | Component | Model ID | |-----------|----------| | VAE | [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) | | Text Encoder | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) | | Tokenizer | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) | ## Training Details - **Dataset:** [none-yet/anime-captions](https://huggingface.co/datasets/none-yet/anime-captions) (~337k image-caption pairs) - **Steps:** 10,000 - **Batch Size:** 128 (32 per GPU × 4 GPUs) - **Learning Rate:** 1e-4 with cosine schedule (500 warmup steps) - **Optimizer:** AdamW (weight decay 0.01) - **Mixed Precision:** fp16 - **Noise Schedule:** DDPM, 1000 linear timesteps - **Gradient Clipping:** 1.0 ## Usage ```python import torch from diffusers import AutoencoderKL, DDIMScheduler, UNet2DConditionModel from huggingface_hub import hf_hub_download from safetensors.torch import load_file from transformers import CLIPTextModel, CLIPTokenizer # Load models tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14") text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14") vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse") unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet") # Load fine-tuned EMA weights weights_path = hf_hub_download(repo_id="dixisouls/anime-diffusion", filename="model.safetensors") unet.load_state_dict(load_file(weights_path)) # Use DDIMScheduler for inference scheduler = DDIMScheduler( num_train_timesteps=1000, beta_schedule="linear", clip_sample=False, prediction_type="epsilon", ) ``` See the companion [HuggingFace Space](https://huggingface.co/spaces/dixisouls/stable-anime) for a full interactive demo. ## Limitations - Trained exclusively on anime-style images; not suitable for photorealistic generation - Fixed output resolution of 512×512 - Single-subject prompts work best; complex multi-character scenes may be inconsistent