| --- |
| license: apache-2.0 |
| pipeline_tag: image-to-image |
| paper: https://huggingface.co/papers/2509.01109 |
| repo_url: https://github.com/xtudbxk/GPSToken |
| project_page: https://openreview.net/forum?id=BxoEDR2yQM |
| --- |
| |
| # GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation |
|
|
| This model was presented in the paper [GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation](https://huggingface.co/papers/2509.01109). |
|
|
| ## Abstract |
| Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. |
|
|
| [arxiv version](https://arxiv.org/abs/2509.01109) | [GitHub Repository](https://github.com/xtudbxk/GPSToken) | [Project Page](https://openreview.net/forum?id=BxoEDR2yQM) |
|
|
| [Zhengqiang Zhang](https://scholar.google.com.hk/citations?hl=zh-CN&user=UX26wSMAAAAJ)<sup>1,2</sup> | [Rongyuan Wu](https://scholar.google.com.hk/citations?hl=zh-CN&user=A-U8zE8AAAAJ)<sup>1,2</sup> | [Lingchen Sun](https://scholar.google.com/citations?hl=zh-CN&tzom=-480&user=ZCDjTn8AAAAJ)<sup>1,2</sup> | [Lei Zhang](https://scholar.google.com.hk/citations?hl=zh-CN&user=tAK5l1IAAAAJ)<sup>1,2,+</sup> |
|
|
| <sup>1</sup> The Hong Kong Polytechnic University <sup>2</sup> OPPO Research Institute <sup>+</sup> Corresponding Author |
|
|
| ## Motivation: Beyond Fixed Grids |
| Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations. |
| We propose **GPSToken**, a **G**aussian **P**arameterized **S**patially-adaptive **Token**ization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method: |
| - Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm; |
| - Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features; |
| - Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation; |
| - Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training. |
|
|
| ## Core Highlights |
|
|
| #### β
Spatially-Adaptive Representation |
| - Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features. |
|
|
| #### β
Dynamic & Scalable |
| Furthermore, GPSToken supports: |
| - **User-Controllable Adjustment**: Manually allocate more tokens to user-interest areas for finer reconstruction. |
| - **Variable Token Count**: Increase or decrease token count of each image for better efficiency-fidelity balance. |
| - **Scalable to Higher Resolution**: maintain comparable performance at higher resolutions without retraining. |
|
|
| #### β
Spatial-Texture Disentanglement |
| - Each token encodes a **disentangled** representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation. |
|
|
| #### β
SOTA Performance |
| - Achieves **psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65** on image reconstruction with only **256 tokens**, outperforming prior methods. |
|
|
| ## Experimental Results |
|
|
| #### 1. Image Reconstruction ($256\times 256$ on Imagenet val set) |
|
|
| GPSToken outperforms fixed-grid methods with same token count. |
|
|
| | Method | Token Count | Params (M) | PSNR | SSIM | LPIPS | rFID | FID | |
| |------------------|-------------|-----------|-------|--------|--------|-------|-------| |
| | SDXL-VAE | 32x32 | 83.6 | 25.55 | 0.727 | 0.066 | 0.73 | 2.35 | |
| | VAVAE | 16x16 | 69.8 | 25.76 | 0.742 | 0.050 | 0.27 | 1.74 | |
| | DCAE | 8x8 | 323.4 | 23.62 | 0.644 | 0.092 | 0.98 | 2.59 | |
| | TiTok-B64 | 64 | 204.8 | 17.01 | 0.390 | 0.263 | 1.75 | 2.50 | |
| | TiTok-S128 | 128 | 83.7 | 17.66 | 0.413 | 0.220 | 1.73 | 3.25 | |
| | MAETok | 128 | 173.9 | 23.25 | 0.626 | 0.096 | 0.65 | 2.01 | |
| | FlexTok | 256 | 949.7 | 17.69 | 0.475 | 0.257 | 4.02 | 4.88 | |
| | **GPSToken-S64** | 64 | 127.5 | 22.18 | 0.578 | 0.111 | 1.31 | 3.02 | |
| | **GPSToken-M128**| 128 | 127.8 | 24.06 | 0.657 | 0.080 | 0.65 | 2.18 | |
| | **GPSToken-L256**| 256 | 128.7 | 28.81 | 0.809 | 0.043 | 0.22 | 1.65 | |
|
|
| #### 2. Spatial-Adaptivity Visualization |
| Gaussian tokens automatically concentrate on high-complexity regions. |
| <img src="https://huggingface.co/xtudbxk/GPSToken/raw/main/figures/appendix_reconv_gs.jpg" width="80%"> |
|
|
| #### 3. User-Controllable Adaptivity |
| We can manually guide tokens to focus on user interest regions. |
|  |
|
|
| #### 4. Variable Token Count of GPS-Tokens |
| We can **increase** or **decrease** the count of tokens for encode one image. |
|  |
|
|
| #### 5. Scales to Higher Resolutions |
| GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$. |
|
|
| | Method | Tokens | PSNR β | SSIM β | LPIPS β | rFID β | rec. sFID β | |
| |------------------|------------|--------|--------|---------|------------|-------------| |
| | **512Γ512** | | | | | | | |
| | SDXL-VAE | 64Γ64 | 28.42 | 0.817 | 0.059 | 0.271 | 1.36 | |
| | VQVAE-f16| 32Γ32 | 21.83 | 0.604 | 0.172 | 2.29 | 7.95 | |
| | GPSToken-M128 | 512 | 26.74 | 0.764 | 0.073 | 0.367 | 1.93 | |
| | GPSToken-L256 | 1024 | 32.00 | 0.887 | 0.039 | 0.175 | 0.699 | |
| | **1024Γ1024** | | | | | | | |
| | SDXL-VAE | 128Γ128 | 33.27 | 0.909 | 0.057 | 0.113 | 0.561 | |
| | VQVAE-f16 | 64Γ64 | 25.41 | 0.744 | 0.169 | 1.40 | 4.98 | |
| | GPSToken-M128 | 2048 | 31.22 | 0.873 | 0.072 | 0.236 | 1.24 | |
| | GPSToken-L256 | 4096 | 37.71 | 0.955 | 0.031 | 0.055 | 0.276 | |
|
|
| ## Quick Start |
| #### Model Zoo |
|
|
| |Models|Token Count|Download (Hugging Face)| |
| |---|---|---| |
| |GPSToken-S64|64|[xtudbxk/GPSToken](https://huggingface.co/xtudbxk/GPSToken)| |
| |GPSToken-M128|128|[xtudbxk/GPSToken](https://huggingface.co/xtudbxk/GPSToken)| |
| |GPSToken-L256|256|[xtudbxk/GPSToken](https://huggingface.co/xtudbxk/GPSToken)| |
|
|
| One can also download the models directly from their [HuggingFace repository](https://huggingface.co/xtudbxk/GPSToken). |
|
|
| #### Inference scripts |
| ```bash |
| python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx] |
| ``` |
|
|
| ## CITATION |
|
|
| ```bibtex |
| @misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive, |
| title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation}, |
| author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang}, |
| year={2025}, |
| eprint={2509.01109}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2509.01109}, |
| } |
| ``` |
|
|
| ## CONTACT |
|
|
| Please leave an issue or contact Zhengqiang with [zhengqiang.zhang@connect.polyu.hk](mailto:zhengqiang.zhang@connect.polyu.hk) |
|
|
|  |