File size: 5,569 Bytes
0573629
fe7e8a6
 
 
 
 
 
 
fbc6fce
fe7e8a6
0573629
fe7e8a6
17fb275
fe7e8a6
17fb275
fe7e8a6
17fb275
fe7e8a6
fbc6fce
17fb275
fbc6fce
17fb275
fbc6fce
fe7e8a6
 
 
 
 
50e9bfe
fe7e8a6
 
17fb275
fbc6fce
 
50e9bfe
fbc6fce
fe7e8a6
 
 
c50740e
fe7e8a6
 
d506f16
 
fbc6fce
c50740e
fe7e8a6
d506f16
fe7e8a6
 
 
 
fbc6fce
c50740e
fbc6fce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17fb275
c50740e
 
 
 
 
 
 
 
fbc6fce
c50740e
 
 
 
 
 
 
 
 
 
 
 
17fb275
 
50e9bfe
17fb275
50e9bfe
c50740e
 
 
 
fbc6fce
17fb275
fbc6fce
17fb275
fbc6fce
17fb275
fbc6fce
 
 
17fb275
fbc6fce
 
17fb275
fbc6fce
 
 
 
 
 
 
 
 
17fb275
fbc6fce
 
 
 
17fb275
fbc6fce
 
17fb275
 
 
 
fbc6fce
 
 
 
 
 
17fb275
 
 
 
 
 
 
fbc6fce
17fb275
 
 
c50740e
 
fbc6fce
fe7e8a6
fbc6fce
 
 
 
 
 
 
 
fe7e8a6
fbc6fce
fe7e8a6
 
 
 
 
c50740e
fe7e8a6
fbc6fce
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
license: other
tags:
  - text-to-image
  - diffusion
  - pixeldit
  - nvidia
  - pixel-space
  - lora
base_model: nvidia/PixelDiT-1300M-1024px
---

![FourNeuron-PixelDiT Banner](assets/banner.png)

# PixelDiT 1.3B β€” Diffusers-Compatible Pipeline

> **Two RTX 3060s. Infinite Lore. Zero Fear.**

Unofficial HuggingFace diffusers-compatible conversion of NVIDIA's [PixelDiT-1300M-1024px](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) with dual text encoder support (Gemma-2-2B + Qwen3-2B), LoRA training, and ComfyUI integration.

All credit for the model architecture and weights goes to NVIDIA Research. This repo provides the pipeline wrapper, Qwen encoder integration, LoRA tooling, and scripts.

> **I do not own this model.** Original weights, architecture, and training are the work of NVIDIA Research. For non-commercial use only (NSCLv1).

---

## What is PixelDiT?

PixelDiT is a 1.3B parameter **pixel-space** diffusion transformer β€” no VAE, generates images directly in pixel space. Runs on **4GB VRAM**.

- **Architecture**: MMDiT patch blocks + pixel pathway (PiT blocks)
- **Text encoders**: Gemma-2-2B (photorealistic) or Qwen3-2B (creative/fantasy)
- **Native resolution**: 1024Γ—1024 (non-square supported)
- **Samplers**: Euler (default), Heun, LCM
- **Minimum steps**: 45–50 β€” below 45 produces garbage output
- **LoRA**: full PEFT-compatible LoRA training + inference

---

## Install

```bash
python3 -m venv .venv && source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install "diffusers>=0.31.0" "transformers>=4.40.0,<5.0.0" accelerate safetensors pillow peft
git clone https://github.com/madtunebk/pixeldit-diffusers
cd pixeldit-diffusers
python scripts/setup_diffusers_pixeldit.py
```

---

## Quick Start

```bash
# Gemma encoder (photorealistic, default)
python generate.py --prompt "a viking warrior on a cliff at sunset, cinematic"

# Portrait mode
python generate.py --height 1280 --width 768 --steps 60 --cfg 8.5 --prompt "your prompt"

# Qwen encoder (creative/fantasy)
python generate.py --encoder qwen --proj qwen_proj.pt --prompt "A giant hamster emperor in a battle fortress"

# With LoRA
python generate.py --lora lora_yarn_out/best --prompt "a dark anime woman in a field, yarn art style"

# LCM fast mode (8 steps)
python generate.py --scheduler lcm --steps 8 --cfg 2.0 --prompt "your prompt"
```

---

## Python API

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from diffusers.pipelines.pixeldit import PixelDiTPipeline

tokenizer = AutoTokenizer.from_pretrained("Efficient-Large-Model/gemma-2-2b-it")
tokenizer.padding_side = "right"
text_encoder = (
    AutoModelForCausalLM.from_pretrained("Efficient-Large-Model/gemma-2-2b-it", dtype=torch.bfloat16)
    .get_decoder().eval()
)

pipe = PixelDiTPipeline.from_pretrained(
    "madtune/pixeldit-diffusers",
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

image = pipe(
    "a viking warrior on a cliff overlooking the stormy sea at sunset",
    negative_prompt="blurry, low quality, deformed, watermark",
    height=1024, width=1024,
    num_inference_steps=50,
    guidance_scale=7.5,
).images[0]
image.save("out.jpg")
```

---

## LoRA

### Train a style LoRA

```bash
# 1. Download images (Pexels API key required)
python scripts/download_unsplash.py --query "yarn wool textile" --n 150 --out /data/lora_yarn

# 2. Precompute embeddings
python scripts/precompute_lora_data.py --images /data/lora_yarn --out /data/lora_yarn_cache --trigger "yarn art style" --recaption

# 3. Train
python scripts/train_lora.py --data /data/lora_yarn_cache --out lora_yarn_out/ --epochs 50 --batch 2
```

### Load LoRA in pipeline

```python
pipe.load_lora_weights("lora_yarn_out/best")
pipe.set_adapters(["default"], adapter_weights=[1.0])

# merge multiple LoRAs
pipe.load_lora_weights("lora_style/best", adapter_name="style")
pipe.load_lora_weights("lora_char/best",  adapter_name="char")
pipe.set_adapters(["style", "char"], adapter_weights=[1.0, 0.7])

# bake into weights
pipe.fuse_lora()
```

---

## Qwen Encoder

> **Coming soon.** Qwen3-2B integration (creative/fantasy prompts) is implemented in the pipeline but projection training scripts are not yet released. Watch this repo for updates.

---

## ComfyUI

```bash
ln -s /path/to/pixeldit-diffusers/comfyui_pixeldit /path/to/ComfyUI/custom_nodes/comfyui_pixeldit
```

Three nodes under **PixelDiT** category:
- **PixelDiT Text Encoder** β€” load Gemma or any compatible encoder
- **PixelDiT Model Loader** β€” loads transformer from HF
- **PixelDiT Sampler** β€” prompt β†’ image, all params exposed

---

## Scripts

| Script | Purpose |
|---|---|
| `generate.py` | Main generation script |
| `scripts/upscale_images.py` | RealESRGAN 4Γ— upscale before LoRA precompute |
| `scripts/precompute_lora_data.py` | Precompute image+caption pairs for LoRA training |
| `scripts/train_lora.py` | LoRA fine-tuning |
| `scripts/download_unsplash.py` | Download images from Pexels by search query |
| `scripts/setup_diffusers_pixeldit.py` | Install pipeline into active venv's diffusers |

See `howto_lora.md` for the full LoRA training walkthrough.

---

## Credits

- **Original model & all credit**: [NVIDIA Research](https://huggingface.co/nvidia/PixelDiT-1300M-1024px)
- **Paper**: *PixelDiT: Pixel-Space Diffusion Transformers for Text-to-Image Generation* β€” NVIDIA
- **This repo**: unofficial diffusers conversion, Qwen integration, LoRA tooling only