docs: update README
Browse files
README.md
CHANGED
|
@@ -9,4 +9,70 @@ tags:
|
|
| 9 |
- diffusers
|
| 10 |
- stable-diffusion
|
| 11 |
- text-to-image
|
| 12 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
- diffusers
|
| 10 |
- stable-diffusion
|
| 11 |
- text-to-image
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Anime-Diffusion UNet
|
| 15 |
+
|
| 16 |
+
A UNet2DConditionModel fine-tuned for anime-style image generation, based on Stable Diffusion v1.4.
|
| 17 |
+
|
| 18 |
+
## Model Details
|
| 19 |
+
|
| 20 |
+
- **Architecture:** UNet2DConditionModel from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)
|
| 21 |
+
- **EMA Decay:** 0.9995
|
| 22 |
+
- **Output Resolution:** 512×512
|
| 23 |
+
- **Prediction Type:** epsilon
|
| 24 |
+
|
| 25 |
+
### Companion Models (required for inference)
|
| 26 |
+
|
| 27 |
+
| Component | Model ID |
|
| 28 |
+
|-----------|----------|
|
| 29 |
+
| VAE | [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) |
|
| 30 |
+
| Text Encoder | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) |
|
| 31 |
+
| Tokenizer | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) |
|
| 32 |
+
|
| 33 |
+
## Training Details
|
| 34 |
+
|
| 35 |
+
- **Dataset:** [none-yet/anime-captions](https://huggingface.co/datasets/none-yet/anime-captions) (~337k image-caption pairs)
|
| 36 |
+
- **Steps:** 10,000
|
| 37 |
+
- **Batch Size:** 128 (32 per GPU × 4 GPUs)
|
| 38 |
+
- **Learning Rate:** 1e-4 with cosine schedule (500 warmup steps)
|
| 39 |
+
- **Optimizer:** AdamW (weight decay 0.01)
|
| 40 |
+
- **Mixed Precision:** fp16
|
| 41 |
+
- **Noise Schedule:** DDPM, 1000 linear timesteps
|
| 42 |
+
- **Gradient Clipping:** 1.0
|
| 43 |
+
|
| 44 |
+
## Usage
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
import torch
|
| 48 |
+
from diffusers import AutoencoderKL, DDIMScheduler, UNet2DConditionModel
|
| 49 |
+
from huggingface_hub import hf_hub_download
|
| 50 |
+
from safetensors.torch import load_file
|
| 51 |
+
from transformers import CLIPTextModel, CLIPTokenizer
|
| 52 |
+
|
| 53 |
+
# Load models
|
| 54 |
+
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
|
| 55 |
+
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
|
| 56 |
+
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
|
| 57 |
+
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
|
| 58 |
+
|
| 59 |
+
# Load fine-tuned EMA weights
|
| 60 |
+
weights_path = hf_hub_download(repo_id="dixisouls/anime-diffusion", filename="model.safetensors")
|
| 61 |
+
unet.load_state_dict(load_file(weights_path))
|
| 62 |
+
|
| 63 |
+
# Use DDIMScheduler for inference
|
| 64 |
+
scheduler = DDIMScheduler(
|
| 65 |
+
num_train_timesteps=1000,
|
| 66 |
+
beta_schedule="linear",
|
| 67 |
+
clip_sample=False,
|
| 68 |
+
prediction_type="epsilon",
|
| 69 |
+
)
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
See the companion [HuggingFace Space](https://huggingface.co/spaces/dixisouls/stable-anime) for a full interactive demo.
|
| 73 |
+
|
| 74 |
+
## Limitations
|
| 75 |
+
|
| 76 |
+
- Trained exclusively on anime-style images; not suitable for photorealistic generation
|
| 77 |
+
- Fixed output resolution of 512×512
|
| 78 |
+
- Single-subject prompts work best; complex multi-character scenes may be inconsistent
|