File size: 14,012 Bytes
0cff417
dd62b60
 
0cff417
 
 
 
dd62b60
0cff417
dd62b60
0cff417
dd62b60
0cff417
dd62b60
0cff417
dd62b60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0cff417
dd62b60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0cff417
dd62b60
 
 
 
 
0cff417
 
 
 
 
 
 
 
 
dd62b60
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
license: apache-2.0
pipeline_tag: image-to-image
---

# GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation

πŸ“š [Paper](https://huggingface.co/papers/2509.01109) | πŸ’» [Code](https://github.com/xtudbxk/GPSToken)

This is the official Hugging Face model repository for GPSToken, as presented in the paper "GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation".

## Abstract

Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. We propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively.

## News

- **2025.09.19**: GPSToken has been accepted by [NIPS 2025](https://openreview.net/forum?id=BxoEDR2yQM)! πŸŽ‰πŸŽ‰πŸŽ‰
- **2025.09.16**: Update models to [HuggingFace](https://huggingface.co/xtudbxk/GPSToken).
- **2025.09.05**: Update code for higher resolution, including GPS-tokens merging (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/gpstoken.py#L113)) for reducing boundary artifacts and resized GroupNorm layer (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/vqvae.py#L310)) for easing color shifts.

## Motivation: Beyond Fixed Grids

Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations.
We propose **GPSToken**, a **G**aussian **P**arameterized **S**patially-adaptive **Token**ization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method:
- Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm;
- Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features;
- Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation;
- Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training.

<div align="center">
  <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/gpstoken.jpg" width="90%">
</div>

## Core Highlights

#### βœ… Spatially-Adaptive Representation
- Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features.

#### βœ… Dynamic & Scalable  
Furthermore, GPSToken supports:
- **User-Controllable Adjustment**: Manually allocate more tokens to user-interest areas for finer reconstruction.  
- **Variable Token Count**: Increase or decrease token count of each image for better efficiency-fidelity balance.  
- **Scalable to Higher Resolution**: maintain comparable performance at higher resolutions without retraining.

#### βœ… Spatial-Texture Disentanglement
- Each token encodes a **disentangled** representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation.

#### βœ… SOTA Performance
- Achieves **psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65** on image reconstruction with only **256 tokens**, outperforming prior methods.

## GPS-Tokens: Mathematical Form and CUDA-Based Rendering Algorithm

Each token is represented by a **bounded 2D Gaussian function** and a individual feature, encoding spatial geometry and texture separately.

#### πŸ“ Standard 2D Gaussian (Unnormalized)

The core form of the $i$-th Gaussian is:

![Standard 2D Gaussian](https://latex.codecogs.com/png.latex?%5Chat%7Bp%7D_i%28x%2C%20y%29%20%3D%20%5Cexp%5Cleft%28-%5Cfrac%7B1%7D%7B2%281-%5Crho_i%5E2%29%7D%20%5Cleft%28%20%5Cfrac%7B%28x-%5Cmu_%7Bx%2Ci%7D%29%5E2%7D%7B%5Csigma_%7Bx%2Ci%7D%5E2%7D%20-%20%5Cfrac%7B2%5Crho_i%28x-%5Cmu_%7Bx%2Ci%7D%29%28y-%5Cmu_%7By%2Ci%7D%29%7D%7B%5Csigma_%7Bx%2Ci%7D%5Csigma_%7By%2Ci%7D%7D%20+%20%5Cfrac%7B%28y-%5Cmu_%7By%2Ci%7D%29%5E2%7D%7B%5Csigma_%7By%2Ci%7D%5E2%7D%20%5Cright%29%5Cright%29)

- $(\mu_{x,i}, \mu_{y,i})$: center (position)
- $\sigma_{x,i}, \sigma_{y,i} > 0$: standard deviations (scale)   
- $\rho_i \in [-1, 1]$: correlation coefficient (orientation)

> This is the unnormalized density β€” avoids costly $Z$ computation.

#### πŸ“ Bounded Support for Efficiency

To focus on local regions and enable fast GPU rendering, we define the **modified splatting kernel**:

![Bounded Gaussian Kernel](https://latex.codecogs.com/png.latex?%5Cmathbf%7Bg%7D_i%28x%2C%20y%29%20%3D%20%5Cbegin%7Bcases%7D%20%5Chat%7Bp%7D_i%28x%2C%20y%29%2C%20%26%20%5Ctext%7Bif%20%7D%20%7Cx%20-%20%5Cmu_%7Bx%2Ci%7D%7C%20%5Cleq%20s%5Csigma_%7Bx%2Ci%7D%20%5Ctext%7B%20and%20%7D%20%7Cy%20-%20%5Cmu_%7By%2Ci%7D%7C%20%5Cleq%20s%5Csigma_%7By%2Ci%7D%20%5C%5C%200%2C%20%26%20%5Ctext%7Botherwise%7D%20%5Cend%7Bcases%7D)

- $s$: spatial support factor (empirically set to $s=5$) 
  β†’ Covers >99.999% of Gaussian mass, negligible truncation error.  

#### 🧩 Token Representation

An image is encoded as $l$ GPS-tokens: $\mathbf{z} = \{\mathbf{z}_1, \dots, \mathbf{z}_l\}$,  where each $\mathbf{z}_i = \\{\mathbf{g}_i, \mathbf{f}_i\\}$ contains:

| Component     | Symbol & Type                     | Role                          |
|---------------|-----------------------------------|-------------------------------|
| **Geometry**  | $\mathbf{g}_i = (\mu_x, \mu_y, \sigma_x, \sigma_y, \rho)$ | Spatial layout (2D Gaussian params)  |
| **Texture**   | $\mathbf{f}_i \in \mathbb{R}^{c-5}$ | Visual features (from CNN/Transformer) |

**Disentangled design**: geometry and texture can be manipulated independently.  

#### ⚑ CUDA-Based Rendering Algorithm
We implement a **CUDA-accelerated rendering algorithm** to parallelize the forward and backward processes of the bounded Gaussian splatting kernel. Implementation details are provided in the `gscuda` folder.

## πŸ—οΈ Framework: From Image to GPS-Tokens

GPSToken pipeline: **Initialization β†’ Refinement β†’ Rendering β†’ Reconstruction**
<div align="center">
  <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/framework.jpg" width="90%">
</div>

#### Spatially-adaptive Token Initialization
We use an iterative algorithm to partition the image into regions based on texture complexity. Each region's location and size initialize the Gaussian parameters of corresponding GPS-tokens, enabling a coarse spatially-adaptive representation.

#### Spatially-adaptive Token Refinement
After obtaining the initialized Gaussian parameters, we employ a transformer-based encoder to refine these parameters to achieve fine-grained spatial adaptation, while simultaneously extracting the corresponding texture features $\mathbf{f}$ for each region using RoIAlign layers. After encoder refinement, the parameters better match local textures.

#### End-to-end Reconstruction
During decoding, we first render the GPSTokens into a 2D feature map, then decode them into the reconstructed image. Following existing works, we use a combination of reconstruction loss $L_{\text{rec}}$, perceptual loss $L_{\text{perc}}$, and adversarial loss $L_{\text{adv}}$ during training.

## πŸ“Š Experimental Results

#### 1. Image Reconstruction ($256\times 256$ on Imagenet val set)

GPSToken outperforms fixed-grid methods with same token count.

| Method           | Token Count | Params (M) | PSNR  | SSIM   | LPIPS  | rFID  | FID   |
|------------------|-------------|-----------|-------|--------|--------|-------|-------|
| SDXL-VAE         | 32x32       | 83.6      | 25.55 | 0.727  | 0.066  | 0.73  | 2.35  |
| VAVAE            | 16x16       | 69.8      | 25.76 | 0.742  | 0.050  | 0.27  | 1.74  |
| DCAE             | 8x8         | 323.4     | 23.62 | 0.644  | 0.092  | 0.98  | 2.59  |
| TiTok-B64        | 64          | 204.8     | 17.01 | 0.390  | 0.263  | 1.75  | 2.50  |
| TiTok-S128       | 128         | 83.7      | 17.66 | 0.413  | 0.220  | 1.73  | 3.25  |
| MAETok           | 128         | 173.9     | 23.25 | 0.626  | 0.096  | 0.65  | 2.01  |
| FlexTok          | 256         | 949.7     | 17.69 | 0.475  | 0.257  | 4.02  | 4.88  |
| **GPSToken-S64** | 64          | 127.5     | 22.18 | 0.578  | 0.111  | 1.31  | 3.02  |
| **GPSToken-M128**| 128         | 127.8     | 24.06 | 0.657  | 0.080  | 0.65  | 2.18  |
| **GPSToken-L256**| 256         | 128.7     | 28.81 | 0.809  | 0.043  | 0.22  | 1.65  |

#### 2. Spatial-Adaptivity Visualization
Gaussian tokens automatically concentrate on high-complexity regions.
<div align="center">
  <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/appendix_reconv_gs.jpg" width="80%">
</div>
> *from left to right*: visualization of intialized GS params, visualization of refined GS params, reconstructed imgs, GT imgs.

#### 3. User-Controllable Adaptivity
We can manually guide tokens to focus on user interest regions.
<div align="center">
  <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application.jpg">
</div>
> *from left to right*: input img, visualization of initialized GS params, reconstructed img, visualization of adjusted GS params, reconstructed img using adjusted GS params.

#### 4. Variable Token Count of GPS-Tokens
We can **increase** or **decrease** the count of tokens for encode one image.
<div align="center">
  <img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application2.jpg">
</div>
> We use GPSToken-M128, which is trained only under 128 tokens, for demonstration.

#### 5. Scales to Higher Resolutions
GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$.

| Method           | Tokens     | PSNR ↑ | SSIM ↑ | LPIPS ↓ | rFID ↓ | rec. sFID ↓ |
|------------------|------------|--------|--------|---------|------------|-------------|
| **512Γ—512**      |            |        |        |         |            |             |
| SDXL-VAE  | 64Γ—64      | 28.42  | 0.817  | 0.059   | 0.271      | 1.36        |
| VQVAE-f16| 32Γ—32      | 21.83  | 0.604  | 0.172   | 2.29       | 7.95        |
| GPSToken-M128    | 512        | 26.74  | 0.764  | 0.073   | 0.367      | 1.93        |
| GPSToken-L256    | 1024       | 32.00  | 0.887  | 0.039   | 0.175      | 0.699       |
| **1024Γ—1024**    |            |        |        |         |            |             |
| SDXL-VAE   | 128Γ—128    | 33.27  | 0.909  | 0.057   | 0.113      | 0.561       |
| VQVAE-f16 | 64Γ—64      | 25.41  | 0.744  | 0.169   | 1.40       | 4.98        |
| GPSToken-M128    | 2048       | 31.22  | 0.873  | 0.072   | 0.236      | 1.24        |
| GPSToken-L256    | 4096       | 37.71  | 0.955  | 0.031   | 0.055      | 0.276       |

## πŸš€ Quick Start

### Model Zoo

One can download the models directly from Hugging Face:

| Models        | Token Count | Hugging Face Link                                                                                 |
|---------------|-------------|---------------------------------------------------------------------------------------------------|
| GPSToken-S64  | 64          | [`xtudbxk/GPSToken/tree/main/GPSToken-S64`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-S64) |
| GPSToken-M128 | 128         | [`xtudbxk/GPSToken/tree/main/GPSToken-M128`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-M128) |
| GPSToken-L256 | 256         | [`xtudbxk/GPSToken/tree/main/GPSToken-L256`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-L256) |

### Inference scripts
```bash
python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx]
```

## CITATION

If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.
```bibtex
@misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
      title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation}, 
      author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
      year={2025},
      eprint={2509.01109},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.01109}, 
}
```

## CONTACT

Please leave an issue or contact zhengqiang with [zhengqiang.zhang@connect.polyu.hk](mailto:zhengqiang.zhang@connect.polyu.hk)