File size: 1,839 Bytes
84f7e37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
library_name: diffusers
pipeline_tag: unconditional-image-generation
license: mit
tags:
- diffusers
- rae
- rae-dit
- diffusion-transformer
- imagenet-256
- arxiv:2510.11690
---

# RAE-DiT-S ep14 Diffusers conversion

This is a Diffusers-format conversion of the public RAE Stage-2 ImageNet-256 checkpoint `DiTDH-S_ep14`, bundled with the public Stage-1 RAE `nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08`.

It is intended as a lightweight test artifact for the Diffusers RAE-DiT PR: https://github.com/huggingface/diffusers/pull/13231

## Source assets

- Stage-1 RAE: `nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08`
- Stage-2 upstream weights: `nyu-visionx/RAE-collections`, file `DiTs/Dinov2/wReg_base/ImageNet256/DiTDH-S_ep14/stage2_model.pt`
- Upstream code/configs: https://github.com/bytetriper/RAE, config `configs/stage2/training/ImageNet256/DiTDH-S_DINOv2-B.yaml`

## Usage

Until PR #13231 is merged, install Diffusers from the PR branch first:

```bash
pip install git+https://github.com/plugyawn/diffusers.git@rae-dit-training
```

Then run:

```python
import torch
from diffusers import RAEDiTPipeline

repo_id = "plugyawn/rae-dit-s-ep14-diffusers"
pipe = RAEDiTPipeline.from_pretrained(repo_id, torch_dtype=torch.bfloat16).to("cuda")

generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe(
    class_labels=207,
    num_inference_steps=25,
    guidance_scale=1.0,
    generator=generator,
).images[0]
image.save("rae_dit_class207.png")
```

`class_labels` are ImageNet-1k class ids.

## Validation

The conversion was validated against the upstream implementation on an A100. With matched initial latent noise, class label, and schedule, the converted model matched upstream with approximately `max_abs_error=1.10e-5` on transformer outputs and `max_abs_error=6.46e-5` on a fixed-seed 25-step decoded sample.