lavinal712
/

NextStep-1-f8ch16-Tokenizer-diffusers

Image Tokenizer

Model card Files Files and versions

NextStep-1-f8ch16-Tokenizer-diffusers / README.md

lavinal712's picture

init

161aead 4 months ago

|

history blame contribute delete

2.98 kB

	---
	license: apache-2.0
	tags:
	- NextStep
	- Image Tokenizer
	---
	# Improved Image Tokenizer

	This is an improved image tokenizer of NextStep-1, featuring a fine-tuned decoder with a frozen encoder. The decoder refinement improves performance while preserving robust reconstruction quality. We recommend using this Image Tokenizer for optimal results with NextStep-1 models.

	## Usage

	```py
	import torch
	from PIL import Image
	import numpy as np
	import torchvision.transforms as transforms

	from autoencoder import AutoencoderKLNextStep

	device = "cuda"
	dtype = torch.bfloat16

	model_path = "/path/to/vae_dir"
	vae = AutoencoderKLNextStep.from_pretrained(model_path).to(device=device, dtype=dtype)

	pil2tensor = transforms.Compose(
	[
	transforms.ToTensor(),
	transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),
	]
	)

	image = Image.open("/path/to/image.jpg")
	pixel_values = pil2tensor(image).unsqueeze(0).to(device=device, dtype=dtype)

	# encode
	latents = vae.encode(pixel_values).latent_dist.sample()

	# decode
	sampled_images = vae.decode(latents).sample
	sampled_images = sampled_images.detach().cpu().to(torch.float32)

	def tensor_to_pil(tensor):
	image = tensor.detach().cpu().to(torch.float32)
	image = (image / 2 + 0.5).clamp(0, 1)
	image = image.mul(255).round().to(dtype=torch.uint8)
	image = image.permute(1, 2, 0).numpy()
	return Image.fromarray(image, mode="RGB")

	rec_image = tensor_to_pil(sampled_images[0])
	rec_image.save("/path/to/output.jpg")
	```

	## Evaluation

	### Reconstruction Performance on ImageNet-1K 256×256

	\| Tokenizer \| Latent Shape \| PSNR ↑ \| SSIM ↑ \|
	\| ------------------------- \| ------------ \| --------- \| -------- \|
	\| Discrete Tokenizers \| \| \| \|
	\| SBER-MoVQGAN (270M) \| 32×32 \| 27.04 \| 0.74 \|
	\| LlamaGen \| 32×32 \| 24.44 \| 0.77 \|
	\| VAR \| 680 \| 22.12 \| 0.62 \|
	\| TiTok-S-128 \| 128 \| 17.52 \| 0.44 \|
	\| Sefltok \| 1024 \| 26.30 \| 0.81 \|
	\| Continuous Tokenizers \| \| \| \|
	\| Stable Diffusion 1.5 \| 32×32×4 \| 25.18 \| 0.73 \|
	\| Stable Diffusion XL \| 32×32×4 \| 26.22 \| 0.77 \|
	\| Stable Diffusion 3 Medium \| 32×32×16 \| 30.00 \| 0.88 \|
	\| Flux.1-dev \| 32×32×16 \| 31.64 \| 0.91 \|
	\| NextStep-1 \| 32×32×16 \| 30.60 \| 0.89 \|

	### Robustness of NextStep-1-f8ch16-Tokenizer

	Impact of Noise Perturbation on Image Tokenizer Performance. The top panel displays
	quantitative metrics (rFID↓, PSNR↑, and SSIM↑) versus noise intensity. The bottom panel presents qualitative reconstruction examples at noise standard deviations of 0.2 and 0.5.

	<div align='center'>
	<img src="assets/robustness.png" class="interpolation-image" alt="arch." width="100%" />
	</div>