GPSToken / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add metadata, abstract, key features, and usage
a89e0f2 verified
|
raw
history blame
9.29 kB
metadata
license: apache-2.0
pipeline_tag: image-to-image
paper: https://huggingface.co/papers/2509.01109
repo_url: https://github.com/xtudbxk/GPSToken
project_page: https://openreview.net/forum?id=BxoEDR2yQM

GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation

This model was presented in the paper GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation.

Abstract

Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively.

arxiv version | GitHub Repository | Project Page

Zhengqiang Zhang1,2 | Rongyuan Wu1,2 | Lingchen Sun1,2 | Lei Zhang1,2,+

1 The Hong Kong Polytechnic University 2 OPPO Research Institute + Corresponding Author

Motivation: Beyond Fixed Grids

Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations. We propose GPSToken, a Gaussian Parameterized Spatially-adaptive Tokenization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method:

  • Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm;
  • Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features;
  • Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation;
  • Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training.

Core Highlights

βœ… Spatially-Adaptive Representation

  • Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features.

βœ… Dynamic & Scalable

Furthermore, GPSToken supports:

  • User-Controllable Adjustment: Manually allocate more tokens to user-interest areas for finer reconstruction.
  • Variable Token Count: Increase or decrease token count of each image for better efficiency-fidelity balance.
  • Scalable to Higher Resolution: maintain comparable performance at higher resolutions without retraining.

βœ… Spatial-Texture Disentanglement

  • Each token encodes a disentangled representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation.

βœ… SOTA Performance

  • Achieves psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65 on image reconstruction with only 256 tokens, outperforming prior methods.

Experimental Results

1. Image Reconstruction ($256\times 256$ on Imagenet val set)

GPSToken outperforms fixed-grid methods with same token count.

Method Token Count Params (M) PSNR SSIM LPIPS rFID FID
SDXL-VAE 32x32 83.6 25.55 0.727 0.066 0.73 2.35
VAVAE 16x16 69.8 25.76 0.742 0.050 0.27 1.74
DCAE 8x8 323.4 23.62 0.644 0.092 0.98 2.59
TiTok-B64 64 204.8 17.01 0.390 0.263 1.75 2.50
TiTok-S128 128 83.7 17.66 0.413 0.220 1.73 3.25
MAETok 128 173.9 23.25 0.626 0.096 0.65 2.01
FlexTok 256 949.7 17.69 0.475 0.257 4.02 4.88
GPSToken-S64 64 127.5 22.18 0.578 0.111 1.31 3.02
GPSToken-M128 128 127.8 24.06 0.657 0.080 0.65 2.18
GPSToken-L256 256 128.7 28.81 0.809 0.043 0.22 1.65

2. Spatial-Adaptivity Visualization

Gaussian tokens automatically concentrate on high-complexity regions.

3. User-Controllable Adaptivity

We can manually guide tokens to focus on user interest regions.

4. Variable Token Count of GPS-Tokens

We can increase or decrease the count of tokens for encode one image.

5. Scales to Higher Resolutions

GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$.

Method Tokens PSNR ↑ SSIM ↑ LPIPS ↓ rFID ↓ rec. sFID ↓
512Γ—512
SDXL-VAE 64Γ—64 28.42 0.817 0.059 0.271 1.36
VQVAE-f16 32Γ—32 21.83 0.604 0.172 2.29 7.95
GPSToken-M128 512 26.74 0.764 0.073 0.367 1.93
GPSToken-L256 1024 32.00 0.887 0.039 0.175 0.699
1024Γ—1024
SDXL-VAE 128Γ—128 33.27 0.909 0.057 0.113 0.561
VQVAE-f16 64Γ—64 25.41 0.744 0.169 1.40 4.98
GPSToken-M128 2048 31.22 0.873 0.072 0.236 1.24
GPSToken-L256 4096 37.71 0.955 0.031 0.055 0.276

Quick Start

Model Zoo

Models Token Count Download (Hugging Face)
GPSToken-S64 64 xtudbxk/GPSToken
GPSToken-M128 128 xtudbxk/GPSToken
GPSToken-L256 256 xtudbxk/GPSToken

One can also download the models directly from their HuggingFace repository.

Inference scripts

python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx]

CITATION

@misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
      title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation}, 
      author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
      year={2025},
      eprint={2509.01109},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.01109}, 
}

CONTACT

Please leave an issue or contact Zhengqiang with zhengqiang.zhang@connect.polyu.hk

visitors