File size: 14,012 Bytes
0cff417 dd62b60 0cff417 dd62b60 0cff417 dd62b60 0cff417 dd62b60 0cff417 dd62b60 0cff417 dd62b60 0cff417 dd62b60 0cff417 dd62b60 0cff417 dd62b60 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | ---
license: apache-2.0
pipeline_tag: image-to-image
---
# GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
π [Paper](https://huggingface.co/papers/2509.01109) | π» [Code](https://github.com/xtudbxk/GPSToken)
This is the official Hugging Face model repository for GPSToken, as presented in the paper "GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation".
## Abstract
Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. We propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively.
## News
- **2025.09.19**: GPSToken has been accepted by [NIPS 2025](https://openreview.net/forum?id=BxoEDR2yQM)! πππ
- **2025.09.16**: Update models to [HuggingFace](https://huggingface.co/xtudbxk/GPSToken).
- **2025.09.05**: Update code for higher resolution, including GPS-tokens merging (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/gpstoken.py#L113)) for reducing boundary artifacts and resized GroupNorm layer (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/vqvae.py#L310)) for easing color shifts.
## Motivation: Beyond Fixed Grids
Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations.
We propose **GPSToken**, a **G**aussian **P**arameterized **S**patially-adaptive **Token**ization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method:
- Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm;
- Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features;
- Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation;
- Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training.
<div align="center">
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/gpstoken.jpg" width="90%">
</div>
## Core Highlights
#### β
Spatially-Adaptive Representation
- Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features.
#### β
Dynamic & Scalable
Furthermore, GPSToken supports:
- **User-Controllable Adjustment**: Manually allocate more tokens to user-interest areas for finer reconstruction.
- **Variable Token Count**: Increase or decrease token count of each image for better efficiency-fidelity balance.
- **Scalable to Higher Resolution**: maintain comparable performance at higher resolutions without retraining.
#### β
Spatial-Texture Disentanglement
- Each token encodes a **disentangled** representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation.
#### β
SOTA Performance
- Achieves **psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65** on image reconstruction with only **256 tokens**, outperforming prior methods.
## GPS-Tokens: Mathematical Form and CUDA-Based Rendering Algorithm
Each token is represented by a **bounded 2D Gaussian function** and a individual feature, encoding spatial geometry and texture separately.
#### π Standard 2D Gaussian (Unnormalized)
The core form of the $i$-th Gaussian is:

- $(\mu_{x,i}, \mu_{y,i})$: center (position)
- $\sigma_{x,i}, \sigma_{y,i} > 0$: standard deviations (scale)
- $\rho_i \in [-1, 1]$: correlation coefficient (orientation)
> This is the unnormalized density β avoids costly $Z$ computation.
#### π Bounded Support for Efficiency
To focus on local regions and enable fast GPU rendering, we define the **modified splatting kernel**:

- $s$: spatial support factor (empirically set to $s=5$)
β Covers >99.999% of Gaussian mass, negligible truncation error.
#### π§© Token Representation
An image is encoded as $l$ GPS-tokens: $\mathbf{z} = \{\mathbf{z}_1, \dots, \mathbf{z}_l\}$, where each $\mathbf{z}_i = \\{\mathbf{g}_i, \mathbf{f}_i\\}$ contains:
| Component | Symbol & Type | Role |
|---------------|-----------------------------------|-------------------------------|
| **Geometry** | $\mathbf{g}_i = (\mu_x, \mu_y, \sigma_x, \sigma_y, \rho)$ | Spatial layout (2D Gaussian params) |
| **Texture** | $\mathbf{f}_i \in \mathbb{R}^{c-5}$ | Visual features (from CNN/Transformer) |
**Disentangled design**: geometry and texture can be manipulated independently.
#### β‘ CUDA-Based Rendering Algorithm
We implement a **CUDA-accelerated rendering algorithm** to parallelize the forward and backward processes of the bounded Gaussian splatting kernel. Implementation details are provided in the `gscuda` folder.
## ποΈ Framework: From Image to GPS-Tokens
GPSToken pipeline: **Initialization β Refinement β Rendering β Reconstruction**
<div align="center">
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/framework.jpg" width="90%">
</div>
#### Spatially-adaptive Token Initialization
We use an iterative algorithm to partition the image into regions based on texture complexity. Each region's location and size initialize the Gaussian parameters of corresponding GPS-tokens, enabling a coarse spatially-adaptive representation.
#### Spatially-adaptive Token Refinement
After obtaining the initialized Gaussian parameters, we employ a transformer-based encoder to refine these parameters to achieve fine-grained spatial adaptation, while simultaneously extracting the corresponding texture features $\mathbf{f}$ for each region using RoIAlign layers. After encoder refinement, the parameters better match local textures.
#### End-to-end Reconstruction
During decoding, we first render the GPSTokens into a 2D feature map, then decode them into the reconstructed image. Following existing works, we use a combination of reconstruction loss $L_{\text{rec}}$, perceptual loss $L_{\text{perc}}$, and adversarial loss $L_{\text{adv}}$ during training.
## π Experimental Results
#### 1. Image Reconstruction ($256\times 256$ on Imagenet val set)
GPSToken outperforms fixed-grid methods with same token count.
| Method | Token Count | Params (M) | PSNR | SSIM | LPIPS | rFID | FID |
|------------------|-------------|-----------|-------|--------|--------|-------|-------|
| SDXL-VAE | 32x32 | 83.6 | 25.55 | 0.727 | 0.066 | 0.73 | 2.35 |
| VAVAE | 16x16 | 69.8 | 25.76 | 0.742 | 0.050 | 0.27 | 1.74 |
| DCAE | 8x8 | 323.4 | 23.62 | 0.644 | 0.092 | 0.98 | 2.59 |
| TiTok-B64 | 64 | 204.8 | 17.01 | 0.390 | 0.263 | 1.75 | 2.50 |
| TiTok-S128 | 128 | 83.7 | 17.66 | 0.413 | 0.220 | 1.73 | 3.25 |
| MAETok | 128 | 173.9 | 23.25 | 0.626 | 0.096 | 0.65 | 2.01 |
| FlexTok | 256 | 949.7 | 17.69 | 0.475 | 0.257 | 4.02 | 4.88 |
| **GPSToken-S64** | 64 | 127.5 | 22.18 | 0.578 | 0.111 | 1.31 | 3.02 |
| **GPSToken-M128**| 128 | 127.8 | 24.06 | 0.657 | 0.080 | 0.65 | 2.18 |
| **GPSToken-L256**| 256 | 128.7 | 28.81 | 0.809 | 0.043 | 0.22 | 1.65 |
#### 2. Spatial-Adaptivity Visualization
Gaussian tokens automatically concentrate on high-complexity regions.
<div align="center">
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/appendix_reconv_gs.jpg" width="80%">
</div>
> *from left to right*: visualization of intialized GS params, visualization of refined GS params, reconstructed imgs, GT imgs.
#### 3. User-Controllable Adaptivity
We can manually guide tokens to focus on user interest regions.
<div align="center">
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application.jpg">
</div>
> *from left to right*: input img, visualization of initialized GS params, reconstructed img, visualization of adjusted GS params, reconstructed img using adjusted GS params.
#### 4. Variable Token Count of GPS-Tokens
We can **increase** or **decrease** the count of tokens for encode one image.
<div align="center">
<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application2.jpg">
</div>
> We use GPSToken-M128, which is trained only under 128 tokens, for demonstration.
#### 5. Scales to Higher Resolutions
GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$.
| Method | Tokens | PSNR β | SSIM β | LPIPS β | rFID β | rec. sFID β |
|------------------|------------|--------|--------|---------|------------|-------------|
| **512Γ512** | | | | | | |
| SDXL-VAE | 64Γ64 | 28.42 | 0.817 | 0.059 | 0.271 | 1.36 |
| VQVAE-f16| 32Γ32 | 21.83 | 0.604 | 0.172 | 2.29 | 7.95 |
| GPSToken-M128 | 512 | 26.74 | 0.764 | 0.073 | 0.367 | 1.93 |
| GPSToken-L256 | 1024 | 32.00 | 0.887 | 0.039 | 0.175 | 0.699 |
| **1024Γ1024** | | | | | | |
| SDXL-VAE | 128Γ128 | 33.27 | 0.909 | 0.057 | 0.113 | 0.561 |
| VQVAE-f16 | 64Γ64 | 25.41 | 0.744 | 0.169 | 1.40 | 4.98 |
| GPSToken-M128 | 2048 | 31.22 | 0.873 | 0.072 | 0.236 | 1.24 |
| GPSToken-L256 | 4096 | 37.71 | 0.955 | 0.031 | 0.055 | 0.276 |
## π Quick Start
### Model Zoo
One can download the models directly from Hugging Face:
| Models | Token Count | Hugging Face Link |
|---------------|-------------|---------------------------------------------------------------------------------------------------|
| GPSToken-S64 | 64 | [`xtudbxk/GPSToken/tree/main/GPSToken-S64`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-S64) |
| GPSToken-M128 | 128 | [`xtudbxk/GPSToken/tree/main/GPSToken-M128`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-M128) |
| GPSToken-L256 | 256 | [`xtudbxk/GPSToken/tree/main/GPSToken-L256`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-L256) |
### Inference scripts
```bash
python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx]
```
## CITATION
If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.
```bibtex
@misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation},
author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
year={2025},
eprint={2509.01109},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.01109},
}
```
## CONTACT
Please leave an issue or contact zhengqiang with [zhengqiang.zhang@connect.polyu.hk](mailto:zhengqiang.zhang@connect.polyu.hk) |