File size: 2,648 Bytes
d23ccdf
 
 
 
 
 
 
 
 
 
 
46bbc9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: mit
datasets:
- none-yet/anime-captions
base_model:
- CompVis/stable-diffusion-v1-4
pipeline_tag: text-to-image
tags:
- diffusers
- stable-diffusion
- text-to-image
---

# Anime-Diffusion UNet

A UNet2DConditionModel fine-tuned for anime-style image generation, based on Stable Diffusion v1.4.

## Model Details

- **Architecture:** UNet2DConditionModel from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)
- **EMA Decay:** 0.9995
- **Output Resolution:** 512×512
- **Prediction Type:** epsilon

### Companion Models (required for inference)

| Component | Model ID |
|-----------|----------|
| VAE | [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) |
| Text Encoder | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) |
| Tokenizer | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) |

## Training Details

- **Dataset:** [none-yet/anime-captions](https://huggingface.co/datasets/none-yet/anime-captions) (~337k image-caption pairs)
- **Steps:** 10,000
- **Batch Size:** 128 (32 per GPU × 4 GPUs)
- **Learning Rate:** 1e-4 with cosine schedule (500 warmup steps)
- **Optimizer:** AdamW (weight decay 0.01)
- **Mixed Precision:** fp16
- **Noise Schedule:** DDPM, 1000 linear timesteps
- **Gradient Clipping:** 1.0

## Usage

```python
import torch
from diffusers import AutoencoderKL, DDIMScheduler, UNet2DConditionModel
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import CLIPTextModel, CLIPTokenizer

# Load models
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")

# Load fine-tuned EMA weights
weights_path = hf_hub_download(repo_id="dixisouls/anime-diffusion", filename="model.safetensors")
unet.load_state_dict(load_file(weights_path))

# Use DDIMScheduler for inference
scheduler = DDIMScheduler(
    num_train_timesteps=1000,
    beta_schedule="linear",
    clip_sample=False,
    prediction_type="epsilon",
)
```

See the companion [HuggingFace Space](https://huggingface.co/spaces/dixisouls/stable-anime) for a full interactive demo.

## Limitations

- Trained exclusively on anime-style images; not suitable for photorealistic generation
- Fixed output resolution of 512×512
- Single-subject prompts work best; complex multi-character scenes may be inconsistent