FourNeuron-PixelDiT Banner

PixelDiT 1.3B β€” Diffusers-Compatible Pipeline

Two RTX 3060s. Infinite Lore. Zero Fear.

Unofficial HuggingFace diffusers-compatible conversion of NVIDIA's PixelDiT-1300M-1024px with dual text encoder support (Gemma-2-2B + Qwen3-2B) and ComfyUI integration.

All credit for the model architecture and weights goes to NVIDIA Research. This repo provides the pipeline wrapper, Qwen encoder integration, and tooling.

I do not own this model. Original weights, architecture, and training are the work of NVIDIA Research.


What is PixelDiT?

PixelDiT is a 1.3B parameter pixel-space diffusion transformer β€” no VAE, generates images directly in pixel space. Runs on 4GB VRAM.

  • Architecture: MMDiT patch blocks + pixel pathway (PiT blocks)
  • Text encoders: Gemma-2-2B (photorealistic) or Qwen3-2B (creative/fantasy)
  • Native resolution: 1024Γ—1024
  • Sampler: Flow matching (FlowMatchEulerDiscreteScheduler, shift=4.0)
  • Minimum steps: 45–50 β€” below 45 produces garbage output

Install

python3 -m venv .venv && source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install diffusers transformers accelerate safetensors pillow
git clone https://github.com/madtunebk/pixeldit-diffusers
cd pixeldit-diffusers
python scripts/setup_diffusers_pixeldit.py

Usage

Gemma encoder (photorealistic)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from diffusers.pipelines.pixeldit import PixelDiTPipeline

tokenizer = AutoTokenizer.from_pretrained("Efficient-Large-Model/gemma-2-2b-it")
tokenizer.padding_side = "right"
text_encoder = (
    AutoModelForCausalLM.from_pretrained("Efficient-Large-Model/gemma-2-2b-it", torch_dtype=torch.float32)
    .get_decoder().eval()
)

pipe = PixelDiTPipeline.from_pretrained(
    "madtune/pixeldit-diffusers",
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

image = pipe(
    "a viking warrior on a cliff overlooking the stormy sea at sunset",
    negative_prompt="blurry, low quality, deformed, watermark",
    height=1024, width=1024,
    num_inference_steps=50,
    guidance_scale=7.5,
).images[0]
image.save("out.jpg")

Qwen encoder (creative / fantasy / absurd realism)

# pip install -r requirements.txt first
python generate.py --encoder qwen --proj qwen_proj.pt --prompt "your epic prompt"

Qwen excels at complex world-building prompts. The more detail you give it, the better.


generate.py β€” Quick Start

# Gemma (default, photorealistic)
python generate.py --prompt "a leopard in the jungle, National Geographic"

# Qwen (creative, fantasy)
python generate.py --encoder qwen --proj qwen_proj.pt --cfg 7.5 --steps 50 \
  --prompt "A giant fluffy hamster emperor inside a colossal mechanical battle fortress"

# Batch mode (runs all PROMPTS list)
python generate.py --encoder qwen --proj qwen_proj.pt

ComfyUI

ln -s /path/to/pixeldit-diffusers/comfyui_pixeldit /path/to/ComfyUI/custom_nodes/comfyui_pixeldit

Three nodes under PixelDiT category:

  • PixelDiT Text Encoder β€” load Gemma or swap any compatible encoder
  • PixelDiT Model Loader β€” loads transformer from HF
  • PixelDiT Sampler β€” prompt β†’ image, all params exposed

LoRA fine-tuning

from peft import get_peft_model, LoraConfig
from diffusers.pipelines.pixeldit import PixelDiTModel

model = PixelDiTModel.from_pretrained("madtune/pixeldit-diffusers", subfolder="transformer")
lora_cfg = LoraConfig(target_modules=["qkv_x", "qkv_y", "proj_x", "proj_y"])
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

Credits

  • Original model & all credit: NVIDIA Research
  • Paper: PixelDiT: Pixel-Space Diffusion Transformers for Text-to-Image Generation β€” NVIDIA
  • This repo: unofficial diffusers conversion, Qwen integration, and tooling only
Downloads last month
238
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for madtune/pixeldit-diffusers

Finetuned
(1)
this model

Space using madtune/pixeldit-diffusers 1