madtune commited on
Commit
fbc6fce
Β·
verified Β·
1 Parent(s): ecad6d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -36
README.md CHANGED
@@ -6,6 +6,7 @@ tags:
6
  - pixeldit
7
  - nvidia
8
  - pixel-space
 
9
  base_model: nvidia/PixelDiT-1300M-1024px
10
  ---
11
 
@@ -15,11 +16,11 @@ base_model: nvidia/PixelDiT-1300M-1024px
15
 
16
  > **Two RTX 3060s. Infinite Lore. Zero Fear.**
17
 
18
- Unofficial HuggingFace diffusers-compatible conversion of NVIDIA's [PixelDiT-1300M-1024px](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) with dual text encoder support (Gemma-2-2B + Qwen3-2B) and ComfyUI integration.
19
 
20
- All credit for the model architecture and weights goes to NVIDIA Research. This repo provides the pipeline wrapper, Qwen encoder integration, and tooling.
21
 
22
- > **I do not own this model.** Original weights, architecture, and training are the work of NVIDIA Research.
23
 
24
  ---
25
 
@@ -29,9 +30,10 @@ PixelDiT is a 1.3B parameter **pixel-space** diffusion transformer β€” no VAE, g
29
 
30
  - **Architecture**: MMDiT patch blocks + pixel pathway (PiT blocks)
31
  - **Text encoders**: Gemma-2-2B (photorealistic) or Qwen3-2B (creative/fantasy)
32
- - **Native resolution**: 1024Γ—1024
33
- - **Sampler**: Flow matching (FlowMatchEulerDiscreteScheduler, shift=4.0)
34
  - **Minimum steps**: 45–50 β€” below 45 produces garbage output
 
35
 
36
  ---
37
 
@@ -40,7 +42,7 @@ PixelDiT is a 1.3B parameter **pixel-space** diffusion transformer β€” no VAE, g
40
  ```bash
41
  python3 -m venv .venv && source .venv/bin/activate
42
  pip install torch --index-url https://download.pytorch.org/whl/cu121
43
- pip install diffusers transformers accelerate safetensors pillow
44
  git clone https://github.com/madtunebk/pixeldit-diffusers
45
  cd pixeldit-diffusers
46
  python scripts/setup_diffusers_pixeldit.py
@@ -48,9 +50,28 @@ python scripts/setup_diffusers_pixeldit.py
48
 
49
  ---
50
 
51
- ## Usage
52
 
53
- ### Gemma encoder (photorealistic)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  ```python
56
  import torch
@@ -60,7 +81,7 @@ from diffusers.pipelines.pixeldit import PixelDiTPipeline
60
  tokenizer = AutoTokenizer.from_pretrained("Efficient-Large-Model/gemma-2-2b-it")
61
  tokenizer.padding_side = "right"
62
  text_encoder = (
63
- AutoModelForCausalLM.from_pretrained("Efficient-Large-Model/gemma-2-2b-it", torch_dtype=torch.float32)
64
  .get_decoder().eval()
65
  )
66
 
@@ -82,33 +103,46 @@ image = pipe(
82
  image.save("out.jpg")
83
  ```
84
 
85
- ### Qwen encoder (creative / fantasy / absurd realism)
86
 
87
- ```python
88
- # pip install -r requirements.txt first
89
- python generate.py --encoder qwen --proj qwen_proj.pt --prompt "your epic prompt"
90
- ```
91
 
92
- Qwen excels at complex world-building prompts. The more detail you give it, the better.
93
 
94
- ---
 
 
95
 
96
- ## generate.py β€” Quick Start
 
97
 
98
- ```bash
99
- # Gemma (default, photorealistic)
100
- python generate.py --prompt "a leopard in the jungle, National Geographic"
 
 
 
 
 
 
101
 
102
- # Qwen (creative, fantasy)
103
- python generate.py --encoder qwen --proj qwen_proj.pt --cfg 7.5 --steps 50 \
104
- --prompt "A giant fluffy hamster emperor inside a colossal mechanical battle fortress"
 
105
 
106
- # Batch mode (runs all PROMPTS list)
107
- python generate.py --encoder qwen --proj qwen_proj.pt
108
  ```
109
 
110
  ---
111
 
 
 
 
 
 
 
112
  ## ComfyUI
113
 
114
  ```bash
@@ -116,23 +150,24 @@ ln -s /path/to/pixeldit-diffusers/comfyui_pixeldit /path/to/ComfyUI/custom_nodes
116
  ```
117
 
118
  Three nodes under **PixelDiT** category:
119
- - **PixelDiT Text Encoder** β€” load Gemma or swap any compatible encoder
120
  - **PixelDiT Model Loader** β€” loads transformer from HF
121
  - **PixelDiT Sampler** β€” prompt β†’ image, all params exposed
122
 
123
  ---
124
 
125
- ## LoRA fine-tuning
126
 
127
- ```python
128
- from peft import get_peft_model, LoraConfig
129
- from diffusers.pipelines.pixeldit import PixelDiTModel
 
 
 
 
 
130
 
131
- model = PixelDiTModel.from_pretrained("madtune/pixeldit-diffusers", subfolder="transformer")
132
- lora_cfg = LoraConfig(target_modules=["qkv_x", "qkv_y", "proj_x", "proj_y"])
133
- model = get_peft_model(model, lora_cfg)
134
- model.print_trainable_parameters()
135
- ```
136
 
137
  ---
138
 
@@ -140,4 +175,4 @@ model.print_trainable_parameters()
140
 
141
  - **Original model & all credit**: [NVIDIA Research](https://huggingface.co/nvidia/PixelDiT-1300M-1024px)
142
  - **Paper**: *PixelDiT: Pixel-Space Diffusion Transformers for Text-to-Image Generation* β€” NVIDIA
143
- - **This repo**: unofficial diffusers conversion, Qwen integration, and tooling only
 
6
  - pixeldit
7
  - nvidia
8
  - pixel-space
9
+ - lora
10
  base_model: nvidia/PixelDiT-1300M-1024px
11
  ---
12
 
 
16
 
17
  > **Two RTX 3060s. Infinite Lore. Zero Fear.**
18
 
19
+ Unofficial HuggingFace diffusers-compatible conversion of NVIDIA's [PixelDiT-1300M-1024px](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) with dual text encoder support (Gemma-2-2B + Qwen3-2B), LoRA training, and ComfyUI integration.
20
 
21
+ All credit for the model architecture and weights goes to NVIDIA Research. This repo provides the pipeline wrapper, Qwen encoder integration, LoRA tooling, and scripts.
22
 
23
+ > **I do not own this model.** Original weights, architecture, and training are the work of NVIDIA Research. For non-commercial use only (NSCLv1).
24
 
25
  ---
26
 
 
30
 
31
  - **Architecture**: MMDiT patch blocks + pixel pathway (PiT blocks)
32
  - **Text encoders**: Gemma-2-2B (photorealistic) or Qwen3-2B (creative/fantasy)
33
+ - **Native resolution**: 1024Γ—1024 (non-square supported)
34
+ - **Samplers**: Euler (default), Heun, LCM
35
  - **Minimum steps**: 45–50 β€” below 45 produces garbage output
36
+ - **LoRA**: full PEFT-compatible LoRA training + inference
37
 
38
  ---
39
 
 
42
  ```bash
43
  python3 -m venv .venv && source .venv/bin/activate
44
  pip install torch --index-url https://download.pytorch.org/whl/cu121
45
+ pip install "diffusers>=0.31.0" "transformers>=4.40.0,<5.0.0" accelerate safetensors pillow peft
46
  git clone https://github.com/madtunebk/pixeldit-diffusers
47
  cd pixeldit-diffusers
48
  python scripts/setup_diffusers_pixeldit.py
 
50
 
51
  ---
52
 
53
+ ## Quick Start
54
 
55
+ ```bash
56
+ # Gemma encoder (photorealistic, default)
57
+ python generate.py --prompt "a viking warrior on a cliff at sunset, cinematic"
58
+
59
+ # Portrait mode
60
+ python generate.py --height 1280 --width 768 --steps 60 --cfg 8.5 --prompt "your prompt"
61
+
62
+ # Qwen encoder (creative/fantasy)
63
+ python generate.py --encoder qwen --proj qwen_proj.pt --prompt "A giant hamster emperor in a battle fortress"
64
+
65
+ # With LoRA
66
+ python generate.py --lora lora_yarn_out/best --prompt "a dark anime woman in a field, yarn art style"
67
+
68
+ # LCM fast mode (8 steps)
69
+ python generate.py --scheduler lcm --steps 8 --cfg 2.0 --prompt "your prompt"
70
+ ```
71
+
72
+ ---
73
+
74
+ ## Python API
75
 
76
  ```python
77
  import torch
 
81
  tokenizer = AutoTokenizer.from_pretrained("Efficient-Large-Model/gemma-2-2b-it")
82
  tokenizer.padding_side = "right"
83
  text_encoder = (
84
+ AutoModelForCausalLM.from_pretrained("Efficient-Large-Model/gemma-2-2b-it", dtype=torch.bfloat16)
85
  .get_decoder().eval()
86
  )
87
 
 
103
  image.save("out.jpg")
104
  ```
105
 
106
+ ---
107
 
108
+ ## LoRA
 
 
 
109
 
110
+ ### Train a style LoRA
111
 
112
+ ```bash
113
+ # 1. Download images (Pexels API key required)
114
+ python scripts/download_unsplash.py --query "yarn wool textile" --n 150 --out /data/lora_yarn
115
 
116
+ # 2. Precompute embeddings
117
+ python scripts/precompute_lora_data.py --images /data/lora_yarn --out /data/lora_yarn_cache --trigger "yarn art style" --recaption
118
 
119
+ # 3. Train
120
+ python scripts/train_lora.py --data /data/lora_yarn_cache --out lora_yarn_out/ --epochs 50 --batch 2
121
+ ```
122
+
123
+ ### Load LoRA in pipeline
124
+
125
+ ```python
126
+ pipe.load_lora_weights("lora_yarn_out/best")
127
+ pipe.set_adapters(["default"], adapter_weights=[1.0])
128
 
129
+ # merge multiple LoRAs
130
+ pipe.load_lora_weights("lora_style/best", adapter_name="style")
131
+ pipe.load_lora_weights("lora_char/best", adapter_name="char")
132
+ pipe.set_adapters(["style", "char"], adapter_weights=[1.0, 0.7])
133
 
134
+ # bake into weights
135
+ pipe.fuse_lora()
136
  ```
137
 
138
  ---
139
 
140
+ ## Qwen Encoder
141
+
142
+ > **Coming soon.** Qwen3-2B integration (creative/fantasy prompts) is implemented in the pipeline but projection training scripts are not yet released. Watch this repo for updates.
143
+
144
+ ---
145
+
146
  ## ComfyUI
147
 
148
  ```bash
 
150
  ```
151
 
152
  Three nodes under **PixelDiT** category:
153
+ - **PixelDiT Text Encoder** β€” load Gemma or any compatible encoder
154
  - **PixelDiT Model Loader** β€” loads transformer from HF
155
  - **PixelDiT Sampler** β€” prompt β†’ image, all params exposed
156
 
157
  ---
158
 
159
+ ## Scripts
160
 
161
+ | Script | Purpose |
162
+ |---|---|
163
+ | `generate.py` | Main generation script |
164
+ | `scripts/upscale_images.py` | RealESRGAN 4Γ— upscale before LoRA precompute |
165
+ | `scripts/precompute_lora_data.py` | Precompute image+caption pairs for LoRA training |
166
+ | `scripts/train_lora.py` | LoRA fine-tuning |
167
+ | `scripts/download_unsplash.py` | Download images from Pexels by search query |
168
+ | `scripts/setup_diffusers_pixeldit.py` | Install pipeline into active venv's diffusers |
169
 
170
+ See `howto_lora.md` for the full LoRA training walkthrough.
 
 
 
 
171
 
172
  ---
173
 
 
175
 
176
  - **Original model & all credit**: [NVIDIA Research](https://huggingface.co/nvidia/PixelDiT-1300M-1024px)
177
  - **Paper**: *PixelDiT: Pixel-Space Diffusion Transformers for Text-to-Image Generation* β€” NVIDIA
178
+ - **This repo**: unofficial diffusers conversion, Qwen integration, LoRA tooling only