dixisouls commited on
Commit
46bbc9b
·
1 Parent(s): d23ccdf

docs: update README

Browse files
Files changed (1) hide show
  1. README.md +67 -1
README.md CHANGED
@@ -9,4 +9,70 @@ tags:
9
  - diffusers
10
  - stable-diffusion
11
  - text-to-image
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - diffusers
10
  - stable-diffusion
11
  - text-to-image
12
+ ---
13
+
14
+ # Anime-Diffusion UNet
15
+
16
+ A UNet2DConditionModel fine-tuned for anime-style image generation, based on Stable Diffusion v1.4.
17
+
18
+ ## Model Details
19
+
20
+ - **Architecture:** UNet2DConditionModel from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)
21
+ - **EMA Decay:** 0.9995
22
+ - **Output Resolution:** 512×512
23
+ - **Prediction Type:** epsilon
24
+
25
+ ### Companion Models (required for inference)
26
+
27
+ | Component | Model ID |
28
+ |-----------|----------|
29
+ | VAE | [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) |
30
+ | Text Encoder | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) |
31
+ | Tokenizer | [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) |
32
+
33
+ ## Training Details
34
+
35
+ - **Dataset:** [none-yet/anime-captions](https://huggingface.co/datasets/none-yet/anime-captions) (~337k image-caption pairs)
36
+ - **Steps:** 10,000
37
+ - **Batch Size:** 128 (32 per GPU × 4 GPUs)
38
+ - **Learning Rate:** 1e-4 with cosine schedule (500 warmup steps)
39
+ - **Optimizer:** AdamW (weight decay 0.01)
40
+ - **Mixed Precision:** fp16
41
+ - **Noise Schedule:** DDPM, 1000 linear timesteps
42
+ - **Gradient Clipping:** 1.0
43
+
44
+ ## Usage
45
+
46
+ ```python
47
+ import torch
48
+ from diffusers import AutoencoderKL, DDIMScheduler, UNet2DConditionModel
49
+ from huggingface_hub import hf_hub_download
50
+ from safetensors.torch import load_file
51
+ from transformers import CLIPTextModel, CLIPTokenizer
52
+
53
+ # Load models
54
+ tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
55
+ text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
56
+ vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
57
+ unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
58
+
59
+ # Load fine-tuned EMA weights
60
+ weights_path = hf_hub_download(repo_id="dixisouls/anime-diffusion", filename="model.safetensors")
61
+ unet.load_state_dict(load_file(weights_path))
62
+
63
+ # Use DDIMScheduler for inference
64
+ scheduler = DDIMScheduler(
65
+ num_train_timesteps=1000,
66
+ beta_schedule="linear",
67
+ clip_sample=False,
68
+ prediction_type="epsilon",
69
+ )
70
+ ```
71
+
72
+ See the companion [HuggingFace Space](https://huggingface.co/spaces/dixisouls/stable-anime) for a full interactive demo.
73
+
74
+ ## Limitations
75
+
76
+ - Trained exclusively on anime-style images; not suitable for photorealistic generation
77
+ - Fixed output resolution of 512×512
78
+ - Single-subject prompts work best; complex multi-character scenes may be inconsistent