Diffusion Transformer
A flow matching-based diffusion transformer for anime image generation.
This project is for research purposes only.
Links
- GitHub: https://github.com/FREEANIMA/diffusion_model_sampling
- Hugging Face: https://huggingface.co/honghong3/diffusion-transformer
License
This project is licensed under CC BY-NC 4.0.
For research and non-commercial use only.
Training Environment
- GPU: NVIDIA A100 40GB (Google Colab)
- Dataset: ~4.8M anime images
- Processed: ~1.8M images (epoch 0, ongoing)
- Throughput: ~1.3 it/s
- Samples below are intermediate checkpoints β quality will improve as training continues.
Training & Samples
# sampler conditional
prompt = "1girl, red hair, school uniform, happy, red eyes, open mouth, detailed face"
steps = 100
cfg_scale = 2.0
seed = 1234
Model Architecture
- Backbone: Diffusion Transformer (DiT) with adaLN modulation
- Parameters: ~550M
- Framework: Flow Matching (velocity prediction)
Components
| Component | Model |
|---|---|
| VAE | stabilityai/sd-vae-ft-mse |
| Text Encoder | openai/clip-vit-large-patch14 |
| Tokenizer | openai/clip-vit-large-patch14 |
Sampler Details
- Resolution: 512 Γ 512 (single bucket)
- Noise Schedule: Log-SNR uniform sampling with resolution-dependent shift
- CFG: Classifier-free guidance
- Prompts are tag-based (comma-separated danbooru-style tags)
Requirements
pip install torch transformers diffusers accelerate torchvision tqdm
Usage
python main.py
C:.
β main.py
β output.png
β README.md
β requirements.txt
β
ββapp
β β clip.py
β β config.json
β β config.py
β β model.py
β β sampling.py
β β sd_vae.py
β ββ __init__.py
β
ββassets
β 100k.png
β 150k.png
β 1k.png
β 50k.png
β
ββweights
image.pth
- Downloads last month
- 53




