dixisouls
/

anime-diffusion

stable-diffusion

Model card Files Files and versions

anime-diffusion / README.md

dixisouls's picture

docs: update README

46bbc9b 2 days ago

|

history blame contribute delete

2.65 kB

	---
	license: mit
	datasets:
	- none-yet/anime-captions
	base_model:
	- CompVis/stable-diffusion-v1-4
	pipeline_tag: text-to-image
	tags:
	- diffusers
	- stable-diffusion
	- text-to-image
	---

	# Anime-Diffusion UNet

	A UNet2DConditionModel fine-tuned for anime-style image generation, based on Stable Diffusion v1.4.

	## Model Details

	- Architecture: UNet2DConditionModel from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)
	- EMA Decay: 0.9995
	- Output Resolution: 512×512
	- Prediction Type: epsilon

	### Companion Models (required for inference)

	\| Component \| Model ID \|
	\|-----------\|----------\|
	\| VAE \| [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) \|
	\| Text Encoder \| [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) \|
	\| Tokenizer \| [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) \|

	## Training Details

	- Dataset: [none-yet/anime-captions](https://huggingface.co/datasets/none-yet/anime-captions) (~337k image-caption pairs)
	- Steps: 10,000
	- Batch Size: 128 (32 per GPU × 4 GPUs)
	- Learning Rate: 1e-4 with cosine schedule (500 warmup steps)
	- Optimizer: AdamW (weight decay 0.01)
	- Mixed Precision: fp16
	- Noise Schedule: DDPM, 1000 linear timesteps
	- Gradient Clipping: 1.0

	## Usage

	```python
	import torch
	from diffusers import AutoencoderKL, DDIMScheduler, UNet2DConditionModel
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from transformers import CLIPTextModel, CLIPTokenizer

	# Load models
	tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
	text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
	vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
	unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")

	# Load fine-tuned EMA weights
	weights_path = hf_hub_download(repo_id="dixisouls/anime-diffusion", filename="model.safetensors")
	unet.load_state_dict(load_file(weights_path))

	# Use DDIMScheduler for inference
	scheduler = DDIMScheduler(
	num_train_timesteps=1000,
	beta_schedule="linear",
	clip_sample=False,
	prediction_type="epsilon",
	)
	```

	See the companion [HuggingFace Space](https://huggingface.co/spaces/dixisouls/stable-anime) for a full interactive demo.

	## Limitations

	- Trained exclusively on anime-style images; not suitable for photorealistic generation
	- Fixed output resolution of 512×512
	- Single-subject prompts work best; complex multi-character scenes may be inconsistent