GPSToken / README.md

nielsr HF Staff

Improve model card: Add pipeline tag, update license, and enrich description with quick start

dd62b60 verified 8 months ago

14 kB

	---
	license: apache-2.0
	pipeline_tag: image-to-image
	---

	# GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation

	📚 [Paper](https://huggingface.co/papers/2509.01109) \| 💻 [Code](https://github.com/xtudbxk/GPSToken)

	This is the official Hugging Face model repository for GPSToken, as presented in the paper "GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation".

	## Abstract

	Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. We propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively.

	## News

	- 2025.09.19: GPSToken has been accepted by [NIPS 2025](https://openreview.net/forum?id=BxoEDR2yQM)! 🎉🎉🎉
	- 2025.09.16: Update models to [HuggingFace](https://huggingface.co/xtudbxk/GPSToken).
	- 2025.09.05: Update code for higher resolution, including GPS-tokens merging (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/gpstoken.py#L113)) for reducing boundary artifacts and resized GroupNorm layer (see [here](https://github.com/xtudbxk/GPSToken/blob/main/models/vqvae.py#L310)) for easing color shifts.

	## Motivation: Beyond Fixed Grids

	Effective and efficient tokenization is crucial for image representation and generation. Conventional uniform 2D/1D grid tokenization lacks flexibility in handling regions with varying shapes, textures, and locations.
	We propose GPSToken, a Gaussian Parameterized Spatially-adaptive Tokenization framework, enabling non-uniform tokenization via parametric 2D Gaussians. Our method:
	- Partitions images into complexity-balanced regions of varying shapes and positions using an entropy-driven algorithm;
	- Represents each region as a 2D Gaussian (mean for position, covariance for shape) and texture features;
	- Trains a transformer to optimize Gaussian parameters and texture features for content-aware adaptation;
	- Reconstructs the image via a differentiable splatting-based renderer, enabling end-to-end training.

	<div align="center">
	<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/gpstoken.jpg" width="90%">
	</div>

	## Core Highlights

	#### ✅ Spatially-Adaptive Representation
	- Iteratively split the image into entropy-balanced regions of varying positions and shapes -- finer partitions in complex textures -- and represent each region with a 2D Gaussian (mean for position, variance for extent) and corresponding texture features.

	#### ✅ Dynamic & Scalable
	Furthermore, GPSToken supports:
	- User-Controllable Adjustment: Manually allocate more tokens to user-interest areas for finer reconstruction.
	- Variable Token Count: Increase or decrease token count of each image for better efficiency-fidelity balance.
	- Scalable to Higher Resolution: maintain comparable performance at higher resolutions without retraining.

	#### ✅ Spatial-Texture Disentanglement
	- Each token encodes a disentangled representation: Gaussian parameters for spatial geometry and a separate vector for textural features, enabling independent manipulation for downstream tasks like generation.

	#### ✅ SOTA Performance
	- Achieves psnr=28.81, ssim=0.809, rFID = 0.22, FID=1.65 on image reconstruction with only 256 tokens, outperforming prior methods.

	## GPS-Tokens: Mathematical Form and CUDA-Based Rendering Algorithm

	Each token is represented by a bounded 2D Gaussian function and a individual feature, encoding spatial geometry and texture separately.

	#### 📐 Standard 2D Gaussian (Unnormalized)

	The core form of the $i$-th Gaussian is:

	![Standard 2D Gaussian](https://latex.codecogs.com/png.latex?%5Chat%7Bp%7D_i%28x%2C%20y%29%20%3D%20%5Cexp%5Cleft%28-%5Cfrac%7B1%7D%7B2%281-%5Crho_i%5E2%29%7D%20%5Cleft%28%20%5Cfrac%7B%28x-%5Cmu_%7Bx%2Ci%7D%29%5E2%7D%7B%5Csigma_%7Bx%2Ci%7D%5E2%7D%20-%20%5Cfrac%7B2%5Crho_i%28x-%5Cmu_%7Bx%2Ci%7D%29%28y-%5Cmu_%7By%2Ci%7D%29%7D%7B%5Csigma_%7Bx%2Ci%7D%5Csigma_%7By%2Ci%7D%7D%20+%20%5Cfrac%7B%28y-%5Cmu_%7By%2Ci%7D%29%5E2%7D%7B%5Csigma_%7By%2Ci%7D%5E2%7D%20%5Cright%29%5Cright%29)

	- $(\mu_{x,i}, \mu_{y,i})$: center (position)
	- $\sigma_{x,i}, \sigma_{y,i} > 0$: standard deviations (scale)
	- $\rho_i \in [-1, 1]$: correlation coefficient (orientation)

	> This is the unnormalized density — avoids costly $Z$ computation.

	#### 📏 Bounded Support for Efficiency

	To focus on local regions and enable fast GPU rendering, we define the modified splatting kernel:

	![Bounded Gaussian Kernel](https://latex.codecogs.com/png.latex?%5Cmathbf%7Bg%7D_i%28x%2C%20y%29%20%3D%20%5Cbegin%7Bcases%7D%20%5Chat%7Bp%7D_i%28x%2C%20y%29%2C%20%26%20%5Ctext%7Bif%20%7D%20%7Cx%20-%20%5Cmu_%7Bx%2Ci%7D%7C%20%5Cleq%20s%5Csigma_%7Bx%2Ci%7D%20%5Ctext%7B%20and%20%7D%20%7Cy%20-%20%5Cmu_%7By%2Ci%7D%7C%20%5Cleq%20s%5Csigma_%7By%2Ci%7D%20%5C%5C%200%2C%20%26%20%5Ctext%7Botherwise%7D%20%5Cend%7Bcases%7D)

	- $s$: spatial support factor (empirically set to $s=5$)
	→ Covers >99.999% of Gaussian mass, negligible truncation error.

	#### 🧩 Token Representation

	An image is encoded as $l$ GPS-tokens: $\mathbf{z} = \{\mathbf{z}_1, \dots, \mathbf{z}_l\}$, where each $\mathbf{z}_i = \\{\mathbf{g}_i, \mathbf{f}_i\\}$ contains:

	\| Component \| Symbol & Type \| Role \|
	\|---------------\|-----------------------------------\|-------------------------------\|
	\| Geometry \| $\mathbf{g}_i = (\mu_x, \mu_y, \sigma_x, \sigma_y, \rho)$ \| Spatial layout (2D Gaussian params) \|
	\| Texture \| $\mathbf{f}_i \in \mathbb{R}^{c-5}$ \| Visual features (from CNN/Transformer) \|

	Disentangled design: geometry and texture can be manipulated independently.

	#### ⚡ CUDA-Based Rendering Algorithm
	We implement a CUDA-accelerated rendering algorithm to parallelize the forward and backward processes of the bounded Gaussian splatting kernel. Implementation details are provided in the `gscuda` folder.

	## 🏗️ Framework: From Image to GPS-Tokens

	GPSToken pipeline: Initialization → Refinement → Rendering → Reconstruction
	<div align="center">
	<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/framework.jpg" width="90%">
	</div>

	#### Spatially-adaptive Token Initialization
	We use an iterative algorithm to partition the image into regions based on texture complexity. Each region's location and size initialize the Gaussian parameters of corresponding GPS-tokens, enabling a coarse spatially-adaptive representation.

	#### Spatially-adaptive Token Refinement
	After obtaining the initialized Gaussian parameters, we employ a transformer-based encoder to refine these parameters to achieve fine-grained spatial adaptation, while simultaneously extracting the corresponding texture features $\mathbf{f}$ for each region using RoIAlign layers. After encoder refinement, the parameters better match local textures.

	#### End-to-end Reconstruction
	During decoding, we first render the GPSTokens into a 2D feature map, then decode them into the reconstructed image. Following existing works, we use a combination of reconstruction loss $L_{\text{rec}}$, perceptual loss $L_{\text{perc}}$, and adversarial loss $L_{\text{adv}}$ during training.

	## 📊 Experimental Results

	#### 1. Image Reconstruction ($256\times 256$ on Imagenet val set)

	GPSToken outperforms fixed-grid methods with same token count.

	\| Method \| Token Count \| Params (M) \| PSNR \| SSIM \| LPIPS \| rFID \| FID \|
	\|------------------\|-------------\|-----------\|-------\|--------\|--------\|-------\|-------\|
	\| SDXL-VAE \| 32x32 \| 83.6 \| 25.55 \| 0.727 \| 0.066 \| 0.73 \| 2.35 \|
	\| VAVAE \| 16x16 \| 69.8 \| 25.76 \| 0.742 \| 0.050 \| 0.27 \| 1.74 \|
	\| DCAE \| 8x8 \| 323.4 \| 23.62 \| 0.644 \| 0.092 \| 0.98 \| 2.59 \|
	\| TiTok-B64 \| 64 \| 204.8 \| 17.01 \| 0.390 \| 0.263 \| 1.75 \| 2.50 \|
	\| TiTok-S128 \| 128 \| 83.7 \| 17.66 \| 0.413 \| 0.220 \| 1.73 \| 3.25 \|
	\| MAETok \| 128 \| 173.9 \| 23.25 \| 0.626 \| 0.096 \| 0.65 \| 2.01 \|
	\| FlexTok \| 256 \| 949.7 \| 17.69 \| 0.475 \| 0.257 \| 4.02 \| 4.88 \|
	\| GPSToken-S64 \| 64 \| 127.5 \| 22.18 \| 0.578 \| 0.111 \| 1.31 \| 3.02 \|
	\| GPSToken-M128\| 128 \| 127.8 \| 24.06 \| 0.657 \| 0.080 \| 0.65 \| 2.18 \|
	\| GPSToken-L256\| 256 \| 128.7 \| 28.81 \| 0.809 \| 0.043 \| 0.22 \| 1.65 \|

	#### 2. Spatial-Adaptivity Visualization
	Gaussian tokens automatically concentrate on high-complexity regions.
	<div align="center">
	<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/appendix_reconv_gs.jpg" width="80%">
	</div>
	> from left to right: visualization of intialized GS params, visualization of refined GS params, reconstructed imgs, GT imgs.

	#### 3. User-Controllable Adaptivity
	We can manually guide tokens to focus on user interest regions.
	<div align="center">
	<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application.jpg">
	</div>
	> from left to right: input img, visualization of initialized GS params, reconstructed img, visualization of adjusted GS params, reconstructed img using adjusted GS params.

	#### 4. Variable Token Count of GPS-Tokens
	We can increase or decrease the count of tokens for encode one image.
	<div align="center">
	<img src="https://huggingface.co/xtudbxk/GPSToken/resolve/main/figures/further_application2.jpg">
	</div>
	> We use GPSToken-M128, which is trained only under 128 tokens, for demonstration.

	#### 5. Scales to Higher Resolutions
	GPSToken can generalize to higher resolution, e.g., $512\times 512$ or $1024\times 1024$, with models trained only on $256\times 256$.

	\| Method \| Tokens \| PSNR ↑ \| SSIM ↑ \| LPIPS ↓ \| rFID ↓ \| rec. sFID ↓ \|
	\|------------------\|------------\|--------\|--------\|---------\|------------\|-------------\|
	\| 512×512 \| \| \| \| \| \| \|
	\| SDXL-VAE \| 64×64 \| 28.42 \| 0.817 \| 0.059 \| 0.271 \| 1.36 \|
	\| VQVAE-f16\| 32×32 \| 21.83 \| 0.604 \| 0.172 \| 2.29 \| 7.95 \|
	\| GPSToken-M128 \| 512 \| 26.74 \| 0.764 \| 0.073 \| 0.367 \| 1.93 \|
	\| GPSToken-L256 \| 1024 \| 32.00 \| 0.887 \| 0.039 \| 0.175 \| 0.699 \|
	\| 1024×1024 \| \| \| \| \| \| \|
	\| SDXL-VAE \| 128×128 \| 33.27 \| 0.909 \| 0.057 \| 0.113 \| 0.561 \|
	\| VQVAE-f16 \| 64×64 \| 25.41 \| 0.744 \| 0.169 \| 1.40 \| 4.98 \|
	\| GPSToken-M128 \| 2048 \| 31.22 \| 0.873 \| 0.072 \| 0.236 \| 1.24 \|
	\| GPSToken-L256 \| 4096 \| 37.71 \| 0.955 \| 0.031 \| 0.055 \| 0.276 \|

	## 🚀 Quick Start

	### Model Zoo

	One can download the models directly from Hugging Face:

	\| Models \| Token Count \| Hugging Face Link \|
	\|---------------\|-------------\|---------------------------------------------------------------------------------------------------\|
	\| GPSToken-S64 \| 64 \| [`xtudbxk/GPSToken/tree/main/GPSToken-S64`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-S64) \|
	\| GPSToken-M128 \| 128 \| [`xtudbxk/GPSToken/tree/main/GPSToken-M128`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-M128) \|
	\| GPSToken-L256 \| 256 \| [`xtudbxk/GPSToken/tree/main/GPSToken-L256`](https://huggingface.co/xtudbxk/GPSToken/tree/main/GPSToken-L256) \|

	### Inference scripts
	```bash
	python3 inference_gsptoken.py --model_path [model_path] --data_path [data_path] --config configs/gpstoken_l256.yaml --data_size 256 --output [xxx]
	```

	## CITATION

	If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.
	```bibtex
	@misc{zhang2025gpstokengaussianparameterizedspatiallyadaptive,
	title={GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation},
	author={Zhengqiang Zhang and Rongyuan Wu and Lingchen Sun and Lei Zhang},
	year={2025},
	eprint={2509.01109},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2509.01109},
	}
	```

	## CONTACT

	Please leave an issue or contact zhengqiang with [zhengqiang.zhang@connect.polyu.hk](mailto:zhengqiang.zhang@connect.polyu.hk)