GPSToken / README.md

nielsr HF Staff

Improve model card: Add metadata, abstract, key features, and usage

a89e0f2 verified 8 months ago

9.29 kB

license: apache-2.0
pipeline_tag: image-to-image
paper: https://huggingface.co/papers/2509.01109
repo_url: https://github.com/xtudbxk/GPSToken
project_page: https://openreview.net/forum?id=BxoEDR2yQM

GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation

This model was presented in the paper GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation.

Abstract

Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively.

arxiv version | GitHub Repository | Project Page

Zhengqiang Zhang^1,2 | Rongyuan Wu^1,2 | Lingchen Sun^1,2 | Lei Zhang^1,2,+

¹ The Hong Kong Polytechnic University ² OPPO Research Institute ⁺ Corresponding Author

Motivation: Beyond Fixed Grids

Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations. We propose GPSToken, a Gaussian Parameterized Spatially-adaptive Tokenization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method:

Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm;
Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features;
Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation;
Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training.

Core Highlights

✅ Spatially-Adaptive Representation

Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features.

✅ Dynamic & Scalable

Furthermore, GPSToken supports:

User-Controllable Adjustment: Manually allocate more tokens to user-interest areas for finer reconstruction.
Variable Token Count: Increase or decrease token count of each image for better efficiency-fidelity balance.
Scalable to Higher Resolution: maintain comparable performance at higher resolutions without retraining.

✅ Spatial-Texture Disentanglement

Each token encodes a disentangled representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation.

✅ SOTA Performance

Achieves psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65 on image reconstruction with only 256 tokens, outperforming prior methods.

Experimental Results

1. Image Reconstruction ($256\times 256$ on Imagenet val set)

GPSToken outperforms fixed-grid methods with same token count.

Method	Token Count	Params (M)	PSNR	SSIM	LPIPS	rFID	FID
SDXL-VAE	32x32	83.6	25.55	0.727	0.066	0.73	2.35
VAVAE	16x16	69.8	25.76	0.742	0.050	0.27	1.74
DCAE	8x8	323.4	23.62	0.644	0.092	0.98	2.59
TiTok-B64	64	204.8	17.01	0.390	0.263	1.75	2.50
TiTok-S128	128	83.7	17.66	0.413	0.220	1.73	3.25
MAETok	128	173.9	23.25	0.626	0.096	0.65	2.01
FlexTok	256	949.7	17.69	0.475	0.257	4.02	4.88
GPSToken-S64	64	127.5	22.18	0.578	0.111	1.31	3.02
GPSToken-M128	128	127.8	24.06	0.657	0.080	0.65	2.18
GPSToken-L256	256	128.7	28.81	0.809	0.043	0.22	1.65

2. Spatial-Adaptivity Visualization

Gaussian tokens automatically concentrate on high-complexity regions.

3. User-Controllable Adaptivity

We can manually guide tokens to focus on user interest regions.

4. Variable Token Count of GPS-Tokens

We can increase or decrease the count of tokens for encode one image.

5. Scales to Higher Resolutions

GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$.

Method	Tokens	PSNR ↑	SSIM ↑	LPIPS ↓	rFID ↓	rec. sFID ↓
512×512
SDXL-VAE	64×64	28.42	0.817	0.059	0.271	1.36
VQVAE-f16	32×32	21.83	0.604	0.172	2.29	7.95
GPSToken-M128	512	26.74	0.764	0.073	0.367	1.93
GPSToken-L256	1024	32.00	0.887	0.039	0.175	0.699
1024×1024
SDXL-VAE	128×128	33.27	0.909	0.057	0.113	0.561
VQVAE-f16	64×64	25.41	0.744	0.169	1.40	4.98
GPSToken-M128	2048	31.22	0.873	0.072	0.236	1.24
GPSToken-L256	4096	37.71	0.955	0.031	0.055	0.276

Quick Start

Model Zoo

Models	Token Count	Download (Hugging Face)
GPSToken-S64	64	xtudbxk/GPSToken
GPSToken-M128	128	xtudbxk/GPSToken
GPSToken-L256	256	xtudbxk/GPSToken

One can also download the models directly from their HuggingFace repository.

Inference scripts

python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx]

CITATION

@misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
      title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation}, 
      author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
      year={2025},
      eprint={2509.01109},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.01109}, 
}

CONTACT

Please leave an issue or contact Zhengqiang with zhengqiang.zhang@connect.polyu.hk