| | --- |
| | license: mit |
| | datasets: |
| | - none-yet/anime-captions |
| | base_model: |
| | - CompVis/stable-diffusion-v1-4 |
| | pipeline_tag: text-to-image |
| | tags: |
| | - diffusers |
| | - stable-diffusion |
| | - text-to-image |
| | --- |
| | |
| | # Anime-Diffusion UNet |
| |
|
| | A UNet2DConditionModel fine-tuned for anime-style image generation, based on Stable Diffusion v1.4. |
| |
|
| | ## Model Details |
| |
|
| | - **Architecture:** UNet2DConditionModel from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) |
| | - **EMA Decay:** 0.9995 |
| | - **Output Resolution:** 512×512 |
| | - **Prediction Type:** epsilon |
| |
|
| | ### Companion Models (required for inference) |
| |
|
| | | Component | Model ID | |
| | |-----------|----------| |
| | | VAE | [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) | |
| | | Text Encoder | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) | |
| | | Tokenizer | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) | |
| |
|
| | ## Training Details |
| |
|
| | - **Dataset:** [none-yet/anime-captions](https://huggingface.co/datasets/none-yet/anime-captions) (~337k image-caption pairs) |
| | - **Steps:** 10,000 |
| | - **Batch Size:** 128 (32 per GPU × 4 GPUs) |
| | - **Learning Rate:** 1e-4 with cosine schedule (500 warmup steps) |
| | - **Optimizer:** AdamW (weight decay 0.01) |
| | - **Mixed Precision:** fp16 |
| | - **Noise Schedule:** DDPM, 1000 linear timesteps |
| | - **Gradient Clipping:** 1.0 |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | from diffusers import AutoencoderKL, DDIMScheduler, UNet2DConditionModel |
| | from huggingface_hub import hf_hub_download |
| | from safetensors.torch import load_file |
| | from transformers import CLIPTextModel, CLIPTokenizer |
| | |
| | # Load models |
| | tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14") |
| | text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14") |
| | vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse") |
| | unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet") |
| | |
| | # Load fine-tuned EMA weights |
| | weights_path = hf_hub_download(repo_id="dixisouls/anime-diffusion", filename="model.safetensors") |
| | unet.load_state_dict(load_file(weights_path)) |
| | |
| | # Use DDIMScheduler for inference |
| | scheduler = DDIMScheduler( |
| | num_train_timesteps=1000, |
| | beta_schedule="linear", |
| | clip_sample=False, |
| | prediction_type="epsilon", |
| | ) |
| | ``` |
| |
|
| | See the companion [HuggingFace Space](https://huggingface.co/spaces/dixisouls/stable-anime) for a full interactive demo. |
| |
|
| | ## Limitations |
| |
|
| | - Trained exclusively on anime-style images; not suitable for photorealistic generation |
| | - Fixed output resolution of 512×512 |
| | - Single-subject prompts work best; complex multi-character scenes may be inconsistent |