Instructions to use madtune/pixeldit-diffusers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use madtune/pixeldit-diffusers with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("nvidia/PixelDiT-1300M-1024px", dtype=torch.bfloat16, device_map="cuda") pipe.load_lora_weights("madtune/pixeldit-diffusers") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
File size: 5,569 Bytes
0573629 fe7e8a6 fbc6fce fe7e8a6 0573629 fe7e8a6 17fb275 fe7e8a6 17fb275 fe7e8a6 17fb275 fe7e8a6 fbc6fce 17fb275 fbc6fce 17fb275 fbc6fce fe7e8a6 50e9bfe fe7e8a6 17fb275 fbc6fce 50e9bfe fbc6fce fe7e8a6 c50740e fe7e8a6 d506f16 fbc6fce c50740e fe7e8a6 d506f16 fe7e8a6 fbc6fce c50740e fbc6fce 17fb275 c50740e fbc6fce c50740e 17fb275 50e9bfe 17fb275 50e9bfe c50740e fbc6fce 17fb275 fbc6fce 17fb275 fbc6fce 17fb275 fbc6fce 17fb275 fbc6fce 17fb275 fbc6fce 17fb275 fbc6fce 17fb275 fbc6fce 17fb275 fbc6fce 17fb275 fbc6fce 17fb275 c50740e fbc6fce fe7e8a6 fbc6fce fe7e8a6 fbc6fce fe7e8a6 c50740e fe7e8a6 fbc6fce | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ---
license: other
tags:
- text-to-image
- diffusion
- pixeldit
- nvidia
- pixel-space
- lora
base_model: nvidia/PixelDiT-1300M-1024px
---

# PixelDiT 1.3B β Diffusers-Compatible Pipeline
> **Two RTX 3060s. Infinite Lore. Zero Fear.**
Unofficial HuggingFace diffusers-compatible conversion of NVIDIA's [PixelDiT-1300M-1024px](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) with dual text encoder support (Gemma-2-2B + Qwen3-2B), LoRA training, and ComfyUI integration.
All credit for the model architecture and weights goes to NVIDIA Research. This repo provides the pipeline wrapper, Qwen encoder integration, LoRA tooling, and scripts.
> **I do not own this model.** Original weights, architecture, and training are the work of NVIDIA Research. For non-commercial use only (NSCLv1).
---
## What is PixelDiT?
PixelDiT is a 1.3B parameter **pixel-space** diffusion transformer β no VAE, generates images directly in pixel space. Runs on **4GB VRAM**.
- **Architecture**: MMDiT patch blocks + pixel pathway (PiT blocks)
- **Text encoders**: Gemma-2-2B (photorealistic) or Qwen3-2B (creative/fantasy)
- **Native resolution**: 1024Γ1024 (non-square supported)
- **Samplers**: Euler (default), Heun, LCM
- **Minimum steps**: 45β50 β below 45 produces garbage output
- **LoRA**: full PEFT-compatible LoRA training + inference
---
## Install
```bash
python3 -m venv .venv && source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install "diffusers>=0.31.0" "transformers>=4.40.0,<5.0.0" accelerate safetensors pillow peft
git clone https://github.com/madtunebk/pixeldit-diffusers
cd pixeldit-diffusers
python scripts/setup_diffusers_pixeldit.py
```
---
## Quick Start
```bash
# Gemma encoder (photorealistic, default)
python generate.py --prompt "a viking warrior on a cliff at sunset, cinematic"
# Portrait mode
python generate.py --height 1280 --width 768 --steps 60 --cfg 8.5 --prompt "your prompt"
# Qwen encoder (creative/fantasy)
python generate.py --encoder qwen --proj qwen_proj.pt --prompt "A giant hamster emperor in a battle fortress"
# With LoRA
python generate.py --lora lora_yarn_out/best --prompt "a dark anime woman in a field, yarn art style"
# LCM fast mode (8 steps)
python generate.py --scheduler lcm --steps 8 --cfg 2.0 --prompt "your prompt"
```
---
## Python API
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from diffusers.pipelines.pixeldit import PixelDiTPipeline
tokenizer = AutoTokenizer.from_pretrained("Efficient-Large-Model/gemma-2-2b-it")
tokenizer.padding_side = "right"
text_encoder = (
AutoModelForCausalLM.from_pretrained("Efficient-Large-Model/gemma-2-2b-it", dtype=torch.bfloat16)
.get_decoder().eval()
)
pipe = PixelDiTPipeline.from_pretrained(
"madtune/pixeldit-diffusers",
text_encoder=text_encoder,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
image = pipe(
"a viking warrior on a cliff overlooking the stormy sea at sunset",
negative_prompt="blurry, low quality, deformed, watermark",
height=1024, width=1024,
num_inference_steps=50,
guidance_scale=7.5,
).images[0]
image.save("out.jpg")
```
---
## LoRA
### Train a style LoRA
```bash
# 1. Download images (Pexels API key required)
python scripts/download_unsplash.py --query "yarn wool textile" --n 150 --out /data/lora_yarn
# 2. Precompute embeddings
python scripts/precompute_lora_data.py --images /data/lora_yarn --out /data/lora_yarn_cache --trigger "yarn art style" --recaption
# 3. Train
python scripts/train_lora.py --data /data/lora_yarn_cache --out lora_yarn_out/ --epochs 50 --batch 2
```
### Load LoRA in pipeline
```python
pipe.load_lora_weights("lora_yarn_out/best")
pipe.set_adapters(["default"], adapter_weights=[1.0])
# merge multiple LoRAs
pipe.load_lora_weights("lora_style/best", adapter_name="style")
pipe.load_lora_weights("lora_char/best", adapter_name="char")
pipe.set_adapters(["style", "char"], adapter_weights=[1.0, 0.7])
# bake into weights
pipe.fuse_lora()
```
---
## Qwen Encoder
> **Coming soon.** Qwen3-2B integration (creative/fantasy prompts) is implemented in the pipeline but projection training scripts are not yet released. Watch this repo for updates.
---
## ComfyUI
```bash
ln -s /path/to/pixeldit-diffusers/comfyui_pixeldit /path/to/ComfyUI/custom_nodes/comfyui_pixeldit
```
Three nodes under **PixelDiT** category:
- **PixelDiT Text Encoder** β load Gemma or any compatible encoder
- **PixelDiT Model Loader** β loads transformer from HF
- **PixelDiT Sampler** β prompt β image, all params exposed
---
## Scripts
| Script | Purpose |
|---|---|
| `generate.py` | Main generation script |
| `scripts/upscale_images.py` | RealESRGAN 4Γ upscale before LoRA precompute |
| `scripts/precompute_lora_data.py` | Precompute image+caption pairs for LoRA training |
| `scripts/train_lora.py` | LoRA fine-tuning |
| `scripts/download_unsplash.py` | Download images from Pexels by search query |
| `scripts/setup_diffusers_pixeldit.py` | Install pipeline into active venv's diffusers |
See `howto_lora.md` for the full LoRA training walkthrough.
---
## Credits
- **Original model & all credit**: [NVIDIA Research](https://huggingface.co/nvidia/PixelDiT-1300M-1024px)
- **Paper**: *PixelDiT: Pixel-Space Diffusion Transformers for Text-to-Image Generation* β NVIDIA
- **This repo**: unofficial diffusers conversion, Qwen integration, LoRA tooling only
|