|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- NextStep |
|
|
- Image Tokenizer |
|
|
--- |
|
|
# Improved Image Tokenizer |
|
|
|
|
|
This is an improved image tokenizer of NextStep-1, featuring a fine-tuned decoder with a frozen encoder. The decoder refinement **improves performance** while preserving robust reconstruction quality. We **recommend using this Image Tokenizer** for optimal results with NextStep-1 models. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```py |
|
|
import torch |
|
|
from PIL import Image |
|
|
import numpy as np |
|
|
import torchvision.transforms as transforms |
|
|
|
|
|
from autoencoder import AutoencoderKLNextStep |
|
|
|
|
|
device = "cuda" |
|
|
dtype = torch.bfloat16 |
|
|
|
|
|
model_path = "/path/to/vae_dir" |
|
|
vae = AutoencoderKLNextStep.from_pretrained(model_path).to(device=device, dtype=dtype) |
|
|
|
|
|
pil2tensor = transforms.Compose( |
|
|
[ |
|
|
transforms.ToTensor(), |
|
|
transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]), |
|
|
] |
|
|
) |
|
|
|
|
|
image = Image.open("/path/to/image.jpg") |
|
|
pixel_values = pil2tensor(image).unsqueeze(0).to(device=device, dtype=dtype) |
|
|
|
|
|
# encode |
|
|
latents = vae.encode(pixel_values).latent_dist.sample() |
|
|
|
|
|
# decode |
|
|
sampled_images = vae.decode(latents).sample |
|
|
sampled_images = sampled_images.detach().cpu().to(torch.float32) |
|
|
|
|
|
def tensor_to_pil(tensor): |
|
|
image = tensor.detach().cpu().to(torch.float32) |
|
|
image = (image / 2 + 0.5).clamp(0, 1) |
|
|
image = image.mul(255).round().to(dtype=torch.uint8) |
|
|
image = image.permute(1, 2, 0).numpy() |
|
|
return Image.fromarray(image, mode="RGB") |
|
|
|
|
|
rec_image = tensor_to_pil(sampled_images[0]) |
|
|
rec_image.save("/path/to/output.jpg") |
|
|
``` |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Reconstruction Performance on ImageNet-1K 256×256 |
|
|
|
|
|
| Tokenizer | Latent Shape | PSNR ↑ | SSIM ↑ | |
|
|
| ------------------------- | ------------ | --------- | -------- | |
|
|
| **Discrete Tokenizers** | | | | |
|
|
| SBER-MoVQGAN (270M) | 32×32 | 27.04 | 0.74 | |
|
|
| LlamaGen | 32×32 | 24.44 | 0.77 | |
|
|
| VAR | 680 | 22.12 | 0.62 | |
|
|
| TiTok-S-128 | 128 | 17.52 | 0.44 | |
|
|
| Sefltok | 1024 | 26.30 | 0.81 | |
|
|
| **Continuous Tokenizers** | | | | |
|
|
| Stable Diffusion 1.5 | 32×32×4 | 25.18 | 0.73 | |
|
|
| Stable Diffusion XL | 32×32×4 | 26.22 | 0.77 | |
|
|
| Stable Diffusion 3 Medium | 32×32×16 | 30.00 | 0.88 | |
|
|
| Flux.1-dev | 32×32×16 | 31.64 | 0.91 | |
|
|
| **NextStep-1** | **32×32×16** | **30.60** | **0.89** | |
|
|
|
|
|
### Robustness of NextStep-1-f8ch16-Tokenizer |
|
|
|
|
|
Impact of Noise Perturbation on Image Tokenizer Performance. The top panel displays |
|
|
quantitative metrics (rFID↓, PSNR↑, and SSIM↑) versus noise intensity. The bottom panel presents qualitative reconstruction examples at noise standard deviations of 0.2 and 0.5. |
|
|
|
|
|
<div align='center'> |
|
|
<img src="assets/robustness.png" class="interpolation-image" alt="arch." width="100%" /> |
|
|
</div> |
|
|
|