madtune commited on
Commit
c50740e
·
verified ·
1 Parent(s): 33047ba

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +55 -20
README.md CHANGED
@@ -11,9 +11,9 @@ base_model: nvidia/PixelDiT-1300M-1024px
11
 
12
  # PixelDiT 1.3B — Diffusers-Compatible Conversion
13
 
14
- This is an **unofficial** HuggingFace-compatible conversion of NVIDIA's [PixelDiT-1300M-1024px](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) model.
15
 
16
- All credit goes to the original authors at NVIDIA. This repo only provides a `PreTrainedModel` wrapper to enable `from_pretrained`, `save_pretrained`, and LoRA fine-tuning via `peft`.
17
 
18
  > **I do not own this model.** Original weights, architecture, and training are the work of NVIDIA Research. Please refer to their [original repository](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) for license terms.
19
 
@@ -21,41 +21,67 @@ All credit goes to the original authors at NVIDIA. This repo only provides a `Pr
21
 
22
  ## What is PixelDiT?
23
 
24
- PixelDiT is a 1.3B parameter pixel-space diffusion transformer — no VAE, generates images directly in pixel space. Text conditioning uses Gemma-2-2B with a chi_prompt prefix to produce rich visual descriptions.
25
 
26
  - **Architecture**: MMDiT patch blocks + pixel pathway (PiT blocks)
27
- - **Text encoder**: Gemma-2-2B (`Efficient-Large-Model/gemma-2-2b-it`)
28
  - **Resolution**: up to 1024×1024
29
- - **Sampler**: Flow matching (DPM-Solver++ recommended, 20 steps)
30
 
31
  ---
32
 
33
- ## Usage
34
-
35
- ```python
36
- from pixeldit import PixelDiTPipeline
37
-
38
- pipe = PixelDiTPipeline(pretrained="madtune/pixeldit-diffusers")
39
- img = pipe("a white horse running in a meadow at sunset", height=512, width=512)[0]
40
- img.save("out.jpg")
41
- ```
42
 
43
- Install the package:
44
  ```bash
45
- git clone https://github.com/madtune/pixeldit-diffusers
46
  cd pixeldit-diffusers
 
47
  pip install transformers accelerate safetensors pillow
48
  ```
49
 
50
  ---
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ## LoRA fine-tuning
53
 
54
  ```python
55
- from pixeldit import PixelDiTModel
56
  from peft import get_peft_model, LoraConfig
 
57
 
58
- model = PixelDiTModel.from_pretrained("madtune/pixeldit-diffusers")
59
  lora_cfg = LoraConfig(target_modules=["qkv_x", "qkv_y", "proj_x", "proj_y"])
60
  model = get_peft_model(model, lora_cfg)
61
  model.print_trainable_parameters()
@@ -63,8 +89,17 @@ model.print_trainable_parameters()
63
 
64
  ---
65
 
 
 
 
 
 
 
 
 
 
66
  ## Credits
67
 
68
- - **Original model**: [NVIDIA Research](https://huggingface.co/nvidia/PixelDiT-1300M-1024px)
69
- - **Diffusers conversion**: [madtune](https://huggingface.co/madtune)
70
  - **Paper**: *PixelDiT: Pixel-Space Diffusion Transformers for Text-to-Image Generation* — NVIDIA
 
 
11
 
12
  # PixelDiT 1.3B — Diffusers-Compatible Conversion
13
 
14
+ This is an **unofficial** HuggingFace diffusers-compatible conversion of NVIDIA's [PixelDiT-1300M-1024px](https://huggingface.co/nvidia/PixelDiT-1300M-1024px).
15
 
16
+ All credit goes to the original authors at NVIDIA. This repo only provides a `DiffusionPipeline` wrapper to enable standard diffusers usage, `from_pretrained`, and LoRA fine-tuning via `peft`.
17
 
18
  > **I do not own this model.** Original weights, architecture, and training are the work of NVIDIA Research. Please refer to their [original repository](https://huggingface.co/nvidia/PixelDiT-1300M-1024px) for license terms.
19
 
 
21
 
22
  ## What is PixelDiT?
23
 
24
+ PixelDiT is a 1.3B parameter **pixel-space** diffusion transformer — no VAE, generates images directly in pixel space. Runs on **4GB VRAM** at 512px.
25
 
26
  - **Architecture**: MMDiT patch blocks + pixel pathway (PiT blocks)
27
+ - **Text encoder**: Gemma-2-2B with chi_prompt instruction prefix
28
  - **Resolution**: up to 1024×1024
29
+ - **Sampler**: Flow matching (FlowMatchEulerDiscreteScheduler, shift=4.0)
30
 
31
  ---
32
 
33
+ ## Install
 
 
 
 
 
 
 
 
34
 
 
35
  ```bash
36
+ git clone https://github.com/madtunebk/pixeldit-diffusers
37
  cd pixeldit-diffusers
38
+ python setup_diffusers_pixeldit.py
39
  pip install transformers accelerate safetensors pillow
40
  ```
41
 
42
  ---
43
 
44
+ ## Usage
45
+
46
+ ```python
47
+ import torch
48
+ from transformers import AutoTokenizer, AutoModelForCausalLM
49
+ from diffusers.pipelines.pixeldit import PixelDiTPipeline
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained("Efficient-Large-Model/gemma-2-2b-it")
52
+ tokenizer.padding_side = "right"
53
+ text_encoder = (
54
+ AutoModelForCausalLM.from_pretrained("Efficient-Large-Model/gemma-2-2b-it", torch_dtype=torch.float32)
55
+ .get_decoder().eval()
56
+ )
57
+
58
+ pipe = PixelDiTPipeline.from_pretrained(
59
+ "madtune/pixeldit-diffusers",
60
+ text_encoder=text_encoder,
61
+ tokenizer=tokenizer,
62
+ torch_dtype=torch.bfloat16,
63
+ )
64
+ pipe.enable_model_cpu_offload()
65
+
66
+ image = pipe(
67
+ "a white horse galloping through a meadow at sunset, cinematic lighting",
68
+ negative_prompt="blurry, flat, low quality, cartoon",
69
+ height=512, width=512,
70
+ num_inference_steps=20,
71
+ guidance_scale=3.5,
72
+ ).images[0]
73
+ image.save("out.jpg")
74
+ ```
75
+
76
+ ---
77
+
78
  ## LoRA fine-tuning
79
 
80
  ```python
 
81
  from peft import get_peft_model, LoraConfig
82
+ from diffusers.pipelines.pixeldit import PixelDiTModel
83
 
84
+ model = PixelDiTModel.from_pretrained("madtune/pixeldit-diffusers", subfolder="transformer")
85
  lora_cfg = LoraConfig(target_modules=["qkv_x", "qkv_y", "proj_x", "proj_y"])
86
  model = get_peft_model(model, lora_cfg)
87
  model.print_trainable_parameters()
 
89
 
90
  ---
91
 
92
+ ## Sample outputs
93
+
94
+ | Prompt | Image |
95
+ |--------|-------|
96
+ | *a viking warrior at sunset* | cinematic, photorealistic |
97
+ | *elemental goddess with fire and ice powers* | epic fantasy, 1024px |
98
+
99
+ ---
100
+
101
  ## Credits
102
 
103
+ - **Original model & all credit**: [NVIDIA Research](https://huggingface.co/nvidia/PixelDiT-1300M-1024px)
 
104
  - **Paper**: *PixelDiT: Pixel-Space Diffusion Transformers for Text-to-Image Generation* — NVIDIA
105
+ - **This repo**: unofficial diffusers conversion only, no claim of authorship