File size: 11,100 Bytes
74176b3 a73d8fb fb067d6 74176b3 a73d8fb 74176b3 a73d8fb 74176b3 a73d8fb 74176b3 a73d8fb 74176b3 a73d8fb 74176b3 26200d1 74176b3 a73d8fb 74176b3 a73d8fb 74176b3 a73d8fb 26200d1 a73d8fb 26200d1 a73d8fb 26200d1 a73d8fb 26200d1 a73d8fb 26200d1 a73d8fb 60e925e a73d8fb 60e925e a73d8fb 74176b3 a73d8fb 74176b3 e3bd54f 74176b3 fb067d6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 |
---
license: mit
language:
- en
tags:
- motion
- human-motion
- motion-to-text
- vq-vae
- gpt2
datasets:
- humanml3d
pipeline_tag: text-generation
library_name: transformers
base_model:
- openai-community/gpt2
---
# GeoMotionGPT
GeoMotionGPT is a motion-to-text model that converts human motion sequences into natural language descriptions.
## Model Components
This model integrates two components:
### 1. Motion Tokenizer (DVQ-GSST)
- **Architecture**: Decoder-only Vector Quantizer with Gumbel-Softmax Straight-Through quantization
- **Codebook Size**: 512 tokens
- **Input**: 263-dimensional motion features (HumanML3D format)
- **Temporal Downsampling**: 8x (3 downsampling layers with stride 2)
### 2. Language Model (Fine-tuned GPT-2)
- **Base Model**: GPT-2 (124M parameters)
- **Task**: Motion-to-Text generation
- **Training**: Fine-tuned with orthogonality regularization (Ξ»=0.01)
- **Total Vocab**: 50772 tokens (50257 text + 512 motion + 3 special)
## Quick Start: Motion Tokenization
```python
from transformers import AutoModelForCausalLM
import torch
# Load the model
model = AutoModelForCausalLM.from_pretrained(
"zy22b/GeoMotionGPT",
trust_remote_code=True
)
# Access the motion tokenizer
motion_tokenizer = model.motion_tokenizer
# Example: Tokenize motion (batch, time, 263)
motion = torch.randn(1, 100, 263) # Random motion features
with torch.no_grad():
tokens = motion_tokenizer.encode(motion) # -> (batch, time//8)
print(f"Motion tokens shape: {tokens.shape}")
# Example: Decode tokens back to motion
with torch.no_grad():
reconstructed = motion_tokenizer.decode(tokens) # -> (batch, time, 263)
print(f"Reconstructed shape: {reconstructed.shape}")
```
## Usage with HumanML3D Data
```python
import numpy as np
import torch
from transformers import AutoModelForCausalLM
# Load model
model = AutoModelForCausalLM.from_pretrained(
"zy22b/GeoMotionGPT",
trust_remote_code=True
)
motion_tokenizer = model.motion_tokenizer
# Load HumanML3D motion file
motion = np.load("datasets/humanml3d/new_joint_vecs/000000.npy") # (T, 263)
# Load normalization parameters
mean = np.load("datasets/humanml3d/Mean.npy")
std = np.load("datasets/humanml3d/Std.npy")
# Normalize
motion_norm = (motion - mean) / std
# Convert to tensor and add batch dimension
motion_tensor = torch.FloatTensor(motion_norm).unsqueeze(0) # (1, T, 263)
# Tokenize
with torch.no_grad():
tokens = motion_tokenizer.encode(motion_tensor)
print(f"Input shape: {motion_tensor.shape}") # e.g., torch.Size([1, 116, 263])
print(f"Token shape: {tokens.shape}") # e.g., torch.Size([1, 14])
print(f"Tokens: {tokens[0].tolist()}") # e.g., [138, 104, 508, ...]
```
## Full Motion-to-Text Generation
The following is a complete, self-contained script for motion-to-text generation.
**Requirements:**
```bash
pip install torch transformers numpy safetensors huggingface_hub
```
**Complete Code:**
```python
"""
GeoMotionGPT: Complete Motion-to-Text Generation
Self-contained script - requires only: torch, transformers, numpy, safetensors
"""
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer, GPT2LMHeadModel, GPT2Config
from safetensors.torch import load_file
class GeoMotionGPTGenerator:
"""Complete Motion-to-Text Generator using HuggingFace models."""
def __init__(self, device="cuda" if torch.cuda.is_available() else "cpu"):
self.device = torch.device(device)
self.text_vocab_size = 50257
self.motion_codebook_size = 512
# Load motion tokenizer from HuggingFace
print("Loading motion tokenizer from HuggingFace...")
hf_model = AutoModelForCausalLM.from_pretrained(
"zy22b/GeoMotionGPT",
trust_remote_code=True
)
self.motion_tokenizer = hf_model.motion_tokenizer.to(self.device)
self.motion_tokenizer.eval()
# Load GPT2 tokenizer
print("Loading GPT2 tokenizer...")
self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
self.tokenizer.pad_token = self.tokenizer.eos_token
# Add motion tokens to tokenizer
motion_tokens = [f'<motion_id_{i}>' for i in range(self.motion_codebook_size)]
special_tokens = ['<start_of_motion>', '<end_of_motion>', '<masked_motion>', '<pad_motion>']
self.tokenizer.add_tokens(motion_tokens)
self.tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
# Build language model using transformers GPT2LMHeadModel
print("Building language model...")
config = GPT2Config(
vocab_size=50772, # 50257 text + 512 motion + 3 special
n_positions=1024,
n_embd=768,
n_layer=12,
n_head=12,
)
self.language_model = GPT2LMHeadModel(config).to(self.device)
# Load weights
print("Loading language model weights...")
self._load_weights()
self.language_model.eval()
print("Model ready!")
def _load_weights(self):
"""Load language model weights from HuggingFace."""
from huggingface_hub import hf_hub_download
# Download weights
weights_path = hf_hub_download(
repo_id="zy22b/GeoMotionGPT",
filename="model.safetensors"
)
state_dict = load_file(weights_path)
# Map weights to model (remove 'language_model.' prefix)
new_state_dict = {}
for k, v in state_dict.items():
if k.startswith("language_model."):
new_key = k[len("language_model."):]
new_state_dict[new_key] = v
self.language_model.load_state_dict(new_state_dict, strict=False)
def motion_tokens_to_string(self, tokens: torch.Tensor) -> str:
"""Convert motion token IDs to string format."""
token_list = tokens[0].cpu().tolist()
mot_start = f'<motion_id_{self.motion_codebook_size}>' # <start_of_motion>
mot_end = f'<motion_id_{self.motion_codebook_size + 1}>' # <end_of_motion>
motion_str = ''.join([f'<motion_id_{int(t)}>' for t in token_list])
return mot_start + motion_str + mot_end
def generate_text(self, motion: np.ndarray, mean: np.ndarray, std: np.ndarray,
max_new_tokens: int = 40) -> str:
"""
Generate text description from motion sequence.
Args:
motion: Raw motion array of shape (T, 263)
mean: Mean for normalization (263,)
std: Std for normalization (263,)
max_new_tokens: Maximum tokens to generate
Returns:
Generated text description
"""
# Normalize motion
motion_norm = (motion - mean) / std
motion_tensor = torch.FloatTensor(motion_norm).unsqueeze(0).to(self.device)
# Tokenize motion
with torch.no_grad():
motion_tokens = self.motion_tokenizer.encode(motion_tensor)
# Convert to string and create prompt
motion_string = self.motion_tokens_to_string(motion_tokens)
prompt = f"Generate text: {motion_string} \n "
# Tokenize prompt
input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(self.device)
# Generate using GPT2's built-in generate
with torch.no_grad():
output_ids = self.language_model.generate(
input_ids,
max_new_tokens=max_new_tokens,
do_sample=False,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.pad_token_id,
)
# Decode output
generated_ids = output_ids[0, input_ids.shape[1]:]
# Filter to text tokens only
text_ids = [tid.item() for tid in generated_ids if tid.item() < self.text_vocab_size]
generated_text = self.tokenizer.decode(text_ids, skip_special_tokens=True)
return generated_text.strip(), motion_tokens[0].tolist()
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="GeoMotionGPT Motion-to-Text Generation")
parser.add_argument("--motion_file", type=str, required=True,
help="Path to HumanML3D motion .npy file")
parser.add_argument("--mean_file", type=str, default="Mean.npy",
help="Path to Mean.npy")
parser.add_argument("--std_file", type=str, default="Std.npy",
help="Path to Std.npy")
parser.add_argument("--device", type=str, default="cuda",
help="Device to use (cuda/cpu)")
args = parser.parse_args()
# Initialize generator
generator = GeoMotionGPTGenerator(device=args.device)
# Load data
motion = np.load(args.motion_file)
mean = np.load(args.mean_file)
std = np.load(args.std_file)
print(f"\nInput motion shape: {motion.shape}")
# Generate text
text, tokens = generator.generate_text(motion, mean, std)
print(f"Motion tokens ({len(tokens)}): {tokens}")
print(f"\nGenerated text: {text}")
```
**Example Usage:**
```bash
python geomotiongpt_inference.py \
--motion_file datasets/humanml3d/new_joint_vecs/000000.npy \
--mean_file datasets/humanml3d/Mean.npy \
--std_file datasets/humanml3d/Std.npy
```
**Expected Output:**
```
Loading motion tokenizer from HuggingFace...
Loading GPT2 tokenizer...
Building language model...
Loading language model weights...
Model ready!
Input motion shape: (116, 263)
Motion tokens (14): [138, 104, 508, 21, 498, 229, 144, 484, 393, 393, 144, 144, 144, 414]
Generated text: a person kicks something with their left foot.
```
## Model Architecture
```
GeoMotionGPTForCausalLM
βββ motion_tokenizer (MotionTokenizer)
β βββ quantizer (MotionQuantizer)
β β βββ 1D CNN with ResNet blocks
β βββ decoder (MotionDecoder)
β β βββ 1D Transposed CNN with ResNet blocks
β βββ codebook
β βββ 512-entry codebook
βββ language_model (GPT2LMHeadModel)
βββ 12-layer transformer
```
## Training Details
- **Motion Tokenizer**: Trained on HumanML3D dataset with DVQ-GSST quantization
- **Language Model**: Fine-tuned GPT-2 with:
- Orthogonality loss (Ξ»=0.01) for motion token embeddings
- Codebook-initialized motion embeddings
- AdamW optimizer (lr=1e-4)
## Citation
If you use this model, please cite:
```bibtex
@misc{ye2026geomotiongpt,
title={GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models},
author={Zhankai Ye and Bofan Li and Yukai Jin and Shuoqiu Li and Wei Wang and Yanfu Zhang and Shangqian Gao and Xin Liu},
year={2026},
eprint={2601.07632},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.07632},
}
```
## License
MIT License |