File size: 11,100 Bytes
74176b3
 
 
 
 
 
 
 
 
 
 
 
 
a73d8fb
fb067d6
 
74176b3
 
 
 
 
 
 
 
a73d8fb
74176b3
a73d8fb
 
74176b3
 
a73d8fb
74176b3
a73d8fb
 
74176b3
 
a73d8fb
74176b3
26200d1
74176b3
 
a73d8fb
74176b3
 
a73d8fb
 
 
 
 
74176b3
a73d8fb
 
 
 
 
26200d1
 
a73d8fb
 
 
26200d1
 
 
a73d8fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26200d1
a73d8fb
 
26200d1
 
a73d8fb
 
 
 
 
 
 
 
 
 
 
26200d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a73d8fb
 
 
 
 
 
 
60e925e
a73d8fb
 
 
60e925e
a73d8fb
 
 
74176b3
 
 
 
a73d8fb
74176b3
 
 
 
 
 
 
 
 
e3bd54f
74176b3
 
 
 
 
 
 
 
 
 
 
 
fb067d6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
license: mit
language:
- en
tags:
- motion
- human-motion
- motion-to-text
- vq-vae
- gpt2
datasets:
- humanml3d
pipeline_tag: text-generation
library_name: transformers
base_model:
- openai-community/gpt2
---

# GeoMotionGPT

GeoMotionGPT is a motion-to-text model that converts human motion sequences into natural language descriptions.

## Model Components

This model integrates two components:

### 1. Motion Tokenizer (DVQ-GSST)
- **Architecture**: Decoder-only Vector Quantizer with Gumbel-Softmax Straight-Through quantization
- **Codebook Size**: 512 tokens
- **Input**: 263-dimensional motion features (HumanML3D format)
- **Temporal Downsampling**: 8x (3 downsampling layers with stride 2)

### 2. Language Model (Fine-tuned GPT-2)
- **Base Model**: GPT-2 (124M parameters)
- **Task**: Motion-to-Text generation
- **Training**: Fine-tuned with orthogonality regularization (Ξ»=0.01)
- **Total Vocab**: 50772 tokens (50257 text + 512 motion + 3 special)

## Quick Start: Motion Tokenization

```python
from transformers import AutoModelForCausalLM
import torch

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    "zy22b/GeoMotionGPT", 
    trust_remote_code=True
)

# Access the motion tokenizer
motion_tokenizer = model.motion_tokenizer

# Example: Tokenize motion (batch, time, 263)
motion = torch.randn(1, 100, 263)  # Random motion features
with torch.no_grad():
    tokens = motion_tokenizer.encode(motion)  # -> (batch, time//8)
print(f"Motion tokens shape: {tokens.shape}")

# Example: Decode tokens back to motion
with torch.no_grad():
    reconstructed = motion_tokenizer.decode(tokens)  # -> (batch, time, 263)
print(f"Reconstructed shape: {reconstructed.shape}")
```

## Usage with HumanML3D Data

```python
import numpy as np
import torch
from transformers import AutoModelForCausalLM

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "zy22b/GeoMotionGPT", 
    trust_remote_code=True
)
motion_tokenizer = model.motion_tokenizer

# Load HumanML3D motion file
motion = np.load("datasets/humanml3d/new_joint_vecs/000000.npy")  # (T, 263)

# Load normalization parameters
mean = np.load("datasets/humanml3d/Mean.npy")
std = np.load("datasets/humanml3d/Std.npy")

# Normalize
motion_norm = (motion - mean) / std

# Convert to tensor and add batch dimension
motion_tensor = torch.FloatTensor(motion_norm).unsqueeze(0)  # (1, T, 263)

# Tokenize
with torch.no_grad():
    tokens = motion_tokenizer.encode(motion_tensor)
    
print(f"Input shape: {motion_tensor.shape}")   # e.g., torch.Size([1, 116, 263])
print(f"Token shape: {tokens.shape}")          # e.g., torch.Size([1, 14])
print(f"Tokens: {tokens[0].tolist()}")         # e.g., [138, 104, 508, ...]
```

## Full Motion-to-Text Generation

The following is a complete, self-contained script for motion-to-text generation. 

**Requirements:**
```bash
pip install torch transformers numpy safetensors huggingface_hub
```

**Complete Code:**

```python
"""
GeoMotionGPT: Complete Motion-to-Text Generation
Self-contained script - requires only: torch, transformers, numpy, safetensors
"""

import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer, GPT2LMHeadModel, GPT2Config
from safetensors.torch import load_file


class GeoMotionGPTGenerator:
    """Complete Motion-to-Text Generator using HuggingFace models."""
    
    def __init__(self, device="cuda" if torch.cuda.is_available() else "cpu"):
        self.device = torch.device(device)
        self.text_vocab_size = 50257
        self.motion_codebook_size = 512
        
        # Load motion tokenizer from HuggingFace
        print("Loading motion tokenizer from HuggingFace...")
        hf_model = AutoModelForCausalLM.from_pretrained(
            "zy22b/GeoMotionGPT", 
            trust_remote_code=True
        )
        self.motion_tokenizer = hf_model.motion_tokenizer.to(self.device)
        self.motion_tokenizer.eval()
        
        # Load GPT2 tokenizer
        print("Loading GPT2 tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
        self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Add motion tokens to tokenizer
        motion_tokens = [f'<motion_id_{i}>' for i in range(self.motion_codebook_size)]
        special_tokens = ['<start_of_motion>', '<end_of_motion>', '<masked_motion>', '<pad_motion>']
        self.tokenizer.add_tokens(motion_tokens)
        self.tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})
        
        # Build language model using transformers GPT2LMHeadModel
        print("Building language model...")
        config = GPT2Config(
            vocab_size=50772,  # 50257 text + 512 motion + 3 special
            n_positions=1024,
            n_embd=768,
            n_layer=12,
            n_head=12,
        )
        self.language_model = GPT2LMHeadModel(config).to(self.device)
        
        # Load weights
        print("Loading language model weights...")
        self._load_weights()
        self.language_model.eval()
        print("Model ready!")
        
    def _load_weights(self):
        """Load language model weights from HuggingFace."""
        from huggingface_hub import hf_hub_download
        
        # Download weights
        weights_path = hf_hub_download(
            repo_id="zy22b/GeoMotionGPT",
            filename="model.safetensors"
        )
        
        state_dict = load_file(weights_path)
        
        # Map weights to model (remove 'language_model.' prefix)
        new_state_dict = {}
        for k, v in state_dict.items():
            if k.startswith("language_model."):
                new_key = k[len("language_model."):]
                new_state_dict[new_key] = v
            
        self.language_model.load_state_dict(new_state_dict, strict=False)
    
    def motion_tokens_to_string(self, tokens: torch.Tensor) -> str:
        """Convert motion token IDs to string format."""
        token_list = tokens[0].cpu().tolist()
        mot_start = f'<motion_id_{self.motion_codebook_size}>'      # <start_of_motion>
        mot_end = f'<motion_id_{self.motion_codebook_size + 1}>'    # <end_of_motion>
        motion_str = ''.join([f'<motion_id_{int(t)}>' for t in token_list])
        return mot_start + motion_str + mot_end
    
    def generate_text(self, motion: np.ndarray, mean: np.ndarray, std: np.ndarray, 
                      max_new_tokens: int = 40) -> str:
        """
        Generate text description from motion sequence.
        
        Args:
            motion: Raw motion array of shape (T, 263)
            mean: Mean for normalization (263,)
            std: Std for normalization (263,)
            max_new_tokens: Maximum tokens to generate
            
        Returns:
            Generated text description
        """
        # Normalize motion
        motion_norm = (motion - mean) / std
        motion_tensor = torch.FloatTensor(motion_norm).unsqueeze(0).to(self.device)
        
        # Tokenize motion
        with torch.no_grad():
            motion_tokens = self.motion_tokenizer.encode(motion_tensor)
        
        # Convert to string and create prompt
        motion_string = self.motion_tokens_to_string(motion_tokens)
        prompt = f"Generate text: {motion_string} \n "
        
        # Tokenize prompt
        input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(self.device)
        
        # Generate using GPT2's built-in generate
        with torch.no_grad():
            output_ids = self.language_model.generate(
                input_ids,
                max_new_tokens=max_new_tokens,
                do_sample=False,
                eos_token_id=self.tokenizer.eos_token_id,
                pad_token_id=self.tokenizer.pad_token_id,
            )
        
        # Decode output
        generated_ids = output_ids[0, input_ids.shape[1]:]
        
        # Filter to text tokens only
        text_ids = [tid.item() for tid in generated_ids if tid.item() < self.text_vocab_size]
        generated_text = self.tokenizer.decode(text_ids, skip_special_tokens=True)
        
        return generated_text.strip(), motion_tokens[0].tolist()


if __name__ == "__main__":
    import argparse
    
    parser = argparse.ArgumentParser(description="GeoMotionGPT Motion-to-Text Generation")
    parser.add_argument("--motion_file", type=str, required=True,
                        help="Path to HumanML3D motion .npy file")
    parser.add_argument("--mean_file", type=str, default="Mean.npy",
                        help="Path to Mean.npy")
    parser.add_argument("--std_file", type=str, default="Std.npy", 
                        help="Path to Std.npy")
    parser.add_argument("--device", type=str, default="cuda",
                        help="Device to use (cuda/cpu)")
    args = parser.parse_args()
    
    # Initialize generator
    generator = GeoMotionGPTGenerator(device=args.device)
    
    # Load data
    motion = np.load(args.motion_file)
    mean = np.load(args.mean_file)
    std = np.load(args.std_file)
    
    print(f"\nInput motion shape: {motion.shape}")
    
    # Generate text
    text, tokens = generator.generate_text(motion, mean, std)
    
    print(f"Motion tokens ({len(tokens)}): {tokens}")
    print(f"\nGenerated text: {text}")
```

**Example Usage:**
```bash
python geomotiongpt_inference.py \
    --motion_file datasets/humanml3d/new_joint_vecs/000000.npy \
    --mean_file datasets/humanml3d/Mean.npy \
    --std_file datasets/humanml3d/Std.npy
```

**Expected Output:**
```
Loading motion tokenizer from HuggingFace...
Loading GPT2 tokenizer...
Building language model...
Loading language model weights...
Model ready!

Input motion shape: (116, 263)
Motion tokens (14): [138, 104, 508, 21, 498, 229, 144, 484, 393, 393, 144, 144, 144, 414]

Generated text: a person kicks something with their left foot.
```

## Model Architecture

```
GeoMotionGPTForCausalLM
β”œβ”€β”€ motion_tokenizer (MotionTokenizer)
β”‚   β”œβ”€β”€ quantizer (MotionQuantizer)
β”‚   β”‚   └── 1D CNN with ResNet blocks
β”‚   β”œβ”€β”€ decoder (MotionDecoder) 
β”‚   β”‚   └── 1D Transposed CNN with ResNet blocks
β”‚   └── codebook
β”‚       └── 512-entry codebook
└── language_model (GPT2LMHeadModel)
    └── 12-layer transformer
```

## Training Details

- **Motion Tokenizer**: Trained on HumanML3D dataset with DVQ-GSST quantization
- **Language Model**: Fine-tuned GPT-2 with:
  - Orthogonality loss (Ξ»=0.01) for motion token embeddings
  - Codebook-initialized motion embeddings
  - AdamW optimizer (lr=1e-4)

## Citation

If you use this model, please cite:
```bibtex
@misc{ye2026geomotiongpt,
      title={GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models}, 
      author={Zhankai Ye and Bofan Li and Yukai Jin and Shuoqiu Li and Wei Wang and Yanfu Zhang and Shangqian Gao and Xin Liu},
      year={2026},
      eprint={2601.07632},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.07632}, 
}
```

## License

MIT License