Diffusion Transformer

A flow matching-based diffusion transformer for anime image generation.
This project is for research purposes only.

Links

License

This project is licensed under CC BY-NC 4.0.
For research and non-commercial use only.

Training Environment

  • GPU: NVIDIA A100 40GB (Google Colab)
  • Dataset: ~4.8M anime images
  • Processed: ~1.8M images (epoch 0, ongoing)
  • Throughput: ~1.3 it/s
  • Samples below are intermediate checkpoints β€” quality will improve as training continues.

Training & Samples

12k images 600k images 1.2M images 1.8M images
1k 50k 100k 150k
# sampler conditional
prompt    = "1girl, red hair, school uniform, happy, red eyes, open mouth, detailed face"
steps     = 100
cfg_scale = 2.0
seed      = 1234

Model Architecture

  • Backbone: Diffusion Transformer (DiT) with adaLN modulation
  • Parameters: ~550M
  • Framework: Flow Matching (velocity prediction)

architecture

Components

Component Model
VAE stabilityai/sd-vae-ft-mse
Text Encoder openai/clip-vit-large-patch14
Tokenizer openai/clip-vit-large-patch14

Sampler Details

  • Resolution: 512 Γ— 512 (single bucket)
  • Noise Schedule: Log-SNR uniform sampling with resolution-dependent shift
  • CFG: Classifier-free guidance
  • Prompts are tag-based (comma-separated danbooru-style tags)

Requirements

pip install torch transformers diffusers accelerate torchvision tqdm

Usage

python main.py
C:.
β”‚  main.py
β”‚  output.png
β”‚  README.md
β”‚  requirements.txt
β”‚
β”œβ”€app
β”‚  β”‚  clip.py
β”‚  β”‚  config.json
β”‚  β”‚  config.py
β”‚  β”‚  model.py
β”‚  β”‚  sampling.py
β”‚  β”‚  sd_vae.py
β”‚  └─ __init__.py
β”‚
β”œβ”€assets
β”‚      100k.png
β”‚      150k.png
β”‚      1k.png
β”‚      50k.png
β”‚
└─weights
       image.pth
Downloads last month
53
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support