zy22b commited on
Commit
a73d8fb
·
verified ·
1 Parent(s): 74176b3

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -11,6 +11,7 @@ tags:
11
  datasets:
12
  - humanml3d
13
  pipeline_tag: text-generation
 
14
  ---
15
 
16
  # GeoMotionGPT
@@ -19,36 +20,98 @@ GeoMotionGPT is a motion-to-text model that converts human motion sequences into
19
 
20
  ## Model Components
21
 
22
- This repository contains two model components:
23
 
24
- ### 1. Motion Tokenizer (`motion_tokenizer/`)
25
- - **Architecture**: Decoder-only Vector Quantizer (DVQ) with Gumbel-Softmax Straight-Through (GSST) quantization
26
  - **Codebook Size**: 512 tokens
27
  - **Input**: 263-dimensional motion features (HumanML3D format)
28
- - **Temporal Downsampling**: 4x
29
 
30
- ### 2. Language Model (`language_model/`)
31
- - **Base Model**: GPT-2
32
  - **Task**: Motion-to-Text generation
33
  - **Training**: Fine-tuned with orthogonality regularization (λ=0.01)
34
- - **Motion Tokens**: 512 additional tokens for motion representation
35
 
36
- ## Usage
37
 
38
  ```python
 
39
  import torch
40
- from safetensors.torch import load_file
41
 
42
- # Load motion tokenizer
43
- motion_tokenizer_weights = load_file("motion_tokenizer/model.safetensors")
 
 
 
44
 
45
- # Load language model
46
- lm_weights = load_file("language_model/model.safetensors")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  ```
48
 
49
  ## Training Details
50
 
51
- - **Motion Tokenizer**: Trained on HumanML3D dataset with DVQ quantization
52
  - **Language Model**: Fine-tuned GPT-2 with:
53
  - Orthogonality loss (λ=0.01) for motion token embeddings
54
  - Codebook-initialized motion embeddings
 
11
  datasets:
12
  - humanml3d
13
  pipeline_tag: text-generation
14
+ library_name: transformers
15
  ---
16
 
17
  # GeoMotionGPT
 
20
 
21
  ## Model Components
22
 
23
+ This model integrates two components:
24
 
25
+ ### 1. Motion Tokenizer (DVQ-GSST)
26
+ - **Architecture**: Decoder-only Vector Quantizer with Gumbel-Softmax Straight-Through quantization
27
  - **Codebook Size**: 512 tokens
28
  - **Input**: 263-dimensional motion features (HumanML3D format)
29
+ - **Temporal Downsampling**: 8x (3 downsampling layers with stride 2)
30
 
31
+ ### 2. Language Model (Fine-tuned GPT-2)
32
+ - **Base Model**: GPT-2 (124M parameters)
33
  - **Task**: Motion-to-Text generation
34
  - **Training**: Fine-tuned with orthogonality regularization (λ=0.01)
35
+ - **Total Vocab**: 50772 tokens (50257 text + 512 motion + 3 special)
36
 
37
+ ## Quick Start
38
 
39
  ```python
40
+ from transformers import AutoModelForCausalLM
41
  import torch
 
42
 
43
+ # Load the model
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ "zy22b/GeoMotionGPT",
46
+ trust_remote_code=True
47
+ )
48
 
49
+ # Access the motion tokenizer
50
+ motion_tokenizer = model.motion_tokenizer
51
+
52
+ # Example: Tokenize motion (batch, time, 263)
53
+ motion = torch.randn(1, 100, 263) # Random motion features
54
+ tokens = motion_tokenizer.encode(motion) # -> (batch, time//8)
55
+ print(f"Motion tokens shape: {tokens.shape}")
56
+
57
+ # Example: Decode tokens back to motion
58
+ reconstructed = motion_tokenizer.decode(tokens) # -> (batch, time, 263)
59
+ ```
60
+
61
+ ## Usage with HumanML3D Data
62
+
63
+ ```python
64
+ import numpy as np
65
+ import torch
66
+ from transformers import AutoModelForCausalLM
67
+
68
+ # Load model
69
+ model = AutoModelForCausalLM.from_pretrained(
70
+ "zy22b/GeoMotionGPT",
71
+ trust_remote_code=True
72
+ )
73
+ motion_tokenizer = model.motion_tokenizer
74
+
75
+ # Load HumanML3D motion file
76
+ motion = np.load("path/to/new_joint_vecs/000000.npy") # (T, 263)
77
+
78
+ # Load normalization parameters
79
+ mean = np.load("path/to/Mean.npy")
80
+ std = np.load("path/to/Std.npy")
81
+
82
+ # Normalize
83
+ motion_norm = (motion - mean) / std
84
+
85
+ # Convert to tensor and add batch dimension
86
+ motion_tensor = torch.FloatTensor(motion_norm).unsqueeze(0) # (1, T, 263)
87
+
88
+ # Tokenize
89
+ with torch.no_grad():
90
+ tokens = motion_tokenizer.encode(motion_tensor)
91
+
92
+ print(f"Input shape: {motion_tensor.shape}")
93
+ print(f"Token shape: {tokens.shape}")
94
+ print(f"Tokens: {tokens[0].tolist()}")
95
+ ```
96
+
97
+ ## Model Architecture
98
+
99
+ ```
100
+ GeoMotionGPTForCausalLM
101
+ ├── motion_tokenizer (MotionTokenizer)
102
+ │ ├── encoder (MotionEncoder)
103
+ │ │ └── 1D CNN with ResNet blocks
104
+ │ ├── decoder (MotionDecoder)
105
+ │ │ └── 1D Transposed CNN with ResNet blocks
106
+ │ └── quantizer (GumbelSoftmaxQuantizer)
107
+ │ └── 512-entry codebook
108
+ └── language_model (GPT2LMHeadModel)
109
+ └── 12-layer transformer
110
  ```
111
 
112
  ## Training Details
113
 
114
+ - **Motion Tokenizer**: Trained on HumanML3D dataset with DVQ-GSST quantization
115
  - **Language Model**: Fine-tuned GPT-2 with:
116
  - Orthogonality loss (λ=0.01) for motion token embeddings
117
  - Codebook-initialized motion embeddings
__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # GeoMotionGPT Model Package
2
+ from .configuration_geomotiongpt import GeoMotionGPTConfig
3
+ from .modeling_geomotiongpt import GeoMotionGPTForCausalLM, GeoMotionGPTPreTrainedModel, MotionTokenizer
__pycache__/configuration_geomotiongpt.cpython-311.pyc ADDED
Binary file (4.96 kB). View file
 
__pycache__/modeling_geomotiongpt.cpython-311.pyc ADDED
Binary file (26.6 kB). View file
 
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GeoMotionGPTForCausalLM"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_geomotiongpt.GeoMotionGPTConfig",
7
+ "AutoModelForCausalLM": "modeling_geomotiongpt.GeoMotionGPTForCausalLM"
8
+ },
9
+ "model_type": "geomotiongpt",
10
+ "motion_vocab_size": 512,
11
+ "motion_input_dim": 263,
12
+ "motion_hidden_dim": 512,
13
+ "motion_down_t": 3,
14
+ "motion_depth": 3,
15
+ "motion_dilation_growth_rate": 3,
16
+ "text_vocab_size": 50257,
17
+ "vocab_size": 50772,
18
+ "n_positions": 1024,
19
+ "n_embd": 768,
20
+ "n_layer": 12,
21
+ "n_head": 12,
22
+ "n_inner": null,
23
+ "activation_function": "gelu_new",
24
+ "resid_pdrop": 0.1,
25
+ "embd_pdrop": 0.1,
26
+ "attn_pdrop": 0.1,
27
+ "layer_norm_epsilon": 1e-05,
28
+ "initializer_range": 0.02,
29
+ "mot_factor": 1.0,
30
+ "attention_mode": "all",
31
+ "lambda_ortho": 0.01,
32
+ "bos_token_id": 50256,
33
+ "eos_token_id": 50256,
34
+ "pad_token_id": 50256,
35
+ "torch_dtype": "float32",
36
+ "transformers_version": "4.41.0"
37
+ }
configuration_geomotiongpt.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ GeoMotionGPT Configuration
3
+
4
+ This module contains the configuration class for GeoMotionGPT, a motion-to-text model
5
+ that combines a VQ-VAE motion tokenizer with a fine-tuned GPT-2 language model.
6
+ """
7
+
8
+ from transformers import PretrainedConfig
9
+
10
+
11
+ class GeoMotionGPTConfig(PretrainedConfig):
12
+ """
13
+ Configuration class for GeoMotionGPT model.
14
+
15
+ GeoMotionGPT consists of two components:
16
+ 1. Motion Tokenizer (DVQ-GSST): Converts 263-dim HumanML3D motion features to discrete tokens
17
+ 2. Language Model (GPT-2): Generates text descriptions from motion tokens
18
+
19
+ Args:
20
+ motion_vocab_size (`int`, *optional*, defaults to 512):
21
+ Size of the motion codebook vocabulary.
22
+ motion_input_dim (`int`, *optional*, defaults to 263):
23
+ Input dimension of motion features (HumanML3D format).
24
+ motion_hidden_dim (`int`, *optional*, defaults to 512):
25
+ Hidden dimension for motion encoder.
26
+ motion_down_t (`int`, *optional*, defaults to 3):
27
+ Number of temporal downsampling layers.
28
+ motion_depth (`int`, *optional*, defaults to 3):
29
+ Depth of ResNet blocks in encoder.
30
+ text_vocab_size (`int`, *optional*, defaults to 50257):
31
+ Size of the text vocabulary (GPT-2).
32
+ n_positions (`int`, *optional*, defaults to 1024):
33
+ Maximum sequence length.
34
+ n_embd (`int`, *optional*, defaults to 768):
35
+ Embedding dimension for GPT-2.
36
+ n_layer (`int`, *optional*, defaults to 12):
37
+ Number of transformer layers.
38
+ n_head (`int`, *optional*, defaults to 12):
39
+ Number of attention heads.
40
+ mot_factor (`float`, *optional*, defaults to 1.0):
41
+ Factor for motion embedding dimension.
42
+ attention_mode (`str`, *optional*, defaults to "all"):
43
+ Cross-modal attention mode.
44
+ lambda_ortho (`float`, *optional*, defaults to 0.01):
45
+ Orthogonality regularization weight.
46
+
47
+ Example:
48
+ ```python
49
+ from transformers import AutoConfig
50
+
51
+ config = AutoConfig.from_pretrained("zy22b/GeoMotionGPT", trust_remote_code=True)
52
+ print(config.motion_vocab_size) # 512
53
+ ```
54
+ """
55
+
56
+ model_type = "geomotiongpt"
57
+
58
+ def __init__(
59
+ self,
60
+ # Motion tokenizer config
61
+ motion_vocab_size: int = 512,
62
+ motion_input_dim: int = 263,
63
+ motion_hidden_dim: int = 512,
64
+ motion_down_t: int = 3,
65
+ motion_depth: int = 3,
66
+ motion_dilation_growth_rate: int = 3,
67
+ # Language model config (GPT-2)
68
+ text_vocab_size: int = 50257,
69
+ n_positions: int = 1024,
70
+ n_embd: int = 768,
71
+ n_layer: int = 12,
72
+ n_head: int = 12,
73
+ n_inner: int = None,
74
+ activation_function: str = "gelu_new",
75
+ resid_pdrop: float = 0.1,
76
+ embd_pdrop: float = 0.1,
77
+ attn_pdrop: float = 0.1,
78
+ layer_norm_epsilon: float = 1e-5,
79
+ initializer_range: float = 0.02,
80
+ # Multi-modal config
81
+ mot_factor: float = 1.0,
82
+ attention_mode: str = "all",
83
+ lambda_ortho: float = 0.01,
84
+ # Special tokens
85
+ bos_token_id: int = 50256,
86
+ eos_token_id: int = 50256,
87
+ pad_token_id: int = 50256,
88
+ **kwargs
89
+ ):
90
+ # Motion tokenizer parameters
91
+ self.motion_vocab_size = motion_vocab_size
92
+ self.motion_input_dim = motion_input_dim
93
+ self.motion_hidden_dim = motion_hidden_dim
94
+ self.motion_down_t = motion_down_t
95
+ self.motion_depth = motion_depth
96
+ self.motion_dilation_growth_rate = motion_dilation_growth_rate
97
+
98
+ # Language model parameters
99
+ self.text_vocab_size = text_vocab_size
100
+ self.vocab_size = text_vocab_size + motion_vocab_size + 3 # +3 for special motion tokens (BOT, EOT, PAD)
101
+ self.n_positions = n_positions
102
+ self.n_embd = n_embd
103
+ self.n_layer = n_layer
104
+ self.n_head = n_head
105
+ self.n_inner = n_inner
106
+ self.activation_function = activation_function
107
+ self.resid_pdrop = resid_pdrop
108
+ self.embd_pdrop = embd_pdrop
109
+ self.attn_pdrop = attn_pdrop
110
+ self.layer_norm_epsilon = layer_norm_epsilon
111
+ self.initializer_range = initializer_range
112
+
113
+ # Multi-modal parameters
114
+ self.mot_factor = mot_factor
115
+ self.attention_mode = attention_mode
116
+ self.lambda_ortho = lambda_ortho
117
+
118
+ super().__init__(
119
+ bos_token_id=bos_token_id,
120
+ eos_token_id=eos_token_id,
121
+ pad_token_id=pad_token_id,
122
+ **kwargs
123
+ )
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95961f9795c0c9620cca77ed684da258cb181970bc1612ff404e28b84ee1e473
3
+ size 766672340
modeling_geomotiongpt.py ADDED
@@ -0,0 +1,523 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ GeoMotionGPT Model
3
+
4
+ This module contains the model implementation for GeoMotionGPT, integrating:
5
+ 1. Motion Tokenizer (DVQ-GSST VQ-VAE)
6
+ 2. Language Model (fine-tuned GPT-2 for motion-to-text)
7
+
8
+ Usage:
9
+ ```python
10
+ from transformers import AutoModelForCausalLM
11
+
12
+ model = AutoModelForCausalLM.from_pretrained("zy22b/GeoMotionGPT", trust_remote_code=True)
13
+ motion_tokenizer = model.motion_tokenizer
14
+
15
+ # Tokenize motion
16
+ motion_tokens = motion_tokenizer.encode(motion_features)
17
+
18
+ # Generate text
19
+ text = model.generate_from_motion(motion_tokens)
20
+ ```
21
+ """
22
+
23
+ import torch
24
+ import torch.nn as nn
25
+ import torch.nn.functional as F
26
+ from typing import Optional, Tuple, List, Union
27
+ from transformers import PreTrainedModel, GPT2LMHeadModel, GPT2Config
28
+ from transformers.modeling_outputs import CausalLMOutputWithCrossAttentions
29
+
30
+ # Handle both package and standalone imports
31
+ try:
32
+ from .configuration_geomotiongpt import GeoMotionGPTConfig
33
+ except ImportError:
34
+ from configuration_geomotiongpt import GeoMotionGPTConfig
35
+
36
+
37
+ # =====================================================
38
+ # Motion Tokenizer Components (DVQ-GSST)
39
+ # =====================================================
40
+
41
+ class Swish(nn.Module):
42
+ """Swish activation function."""
43
+ def forward(self, x):
44
+ return x * torch.sigmoid(x)
45
+
46
+
47
+ class ResConv1DBlock(nn.Module):
48
+ """Single residual convolution block."""
49
+
50
+ def __init__(self, n_in, n_state, dilation=1, activation='relu', norm=None):
51
+ super().__init__()
52
+ padding = dilation
53
+ self.norm = norm
54
+
55
+ if norm == "LN":
56
+ self.norm1 = nn.LayerNorm(n_in)
57
+ self.norm2 = nn.LayerNorm(n_in)
58
+ elif norm == "GN":
59
+ self.norm1 = nn.GroupNorm(num_groups=32, num_channels=n_in, eps=1e-6, affine=True)
60
+ self.norm2 = nn.GroupNorm(num_groups=32, num_channels=n_in, eps=1e-6, affine=True)
61
+ elif norm == "BN":
62
+ self.norm1 = nn.BatchNorm1d(num_features=n_in, eps=1e-6, affine=True)
63
+ self.norm2 = nn.BatchNorm1d(num_features=n_in, eps=1e-6, affine=True)
64
+ else:
65
+ self.norm1 = nn.Identity()
66
+ self.norm2 = nn.Identity()
67
+
68
+ if activation == "relu":
69
+ self.activation1 = nn.ReLU()
70
+ self.activation2 = nn.ReLU()
71
+ elif activation == "silu":
72
+ self.activation1 = Swish()
73
+ self.activation2 = Swish()
74
+ elif activation == "gelu":
75
+ self.activation1 = nn.GELU()
76
+ self.activation2 = nn.GELU()
77
+
78
+ self.conv1 = nn.Conv1d(n_in, n_state, 3, 1, padding, dilation)
79
+ self.conv2 = nn.Conv1d(n_state, n_in, 1, 1, 0)
80
+
81
+ def forward(self, x):
82
+ x_orig = x
83
+ if self.norm == "LN":
84
+ x = self.norm1(x.transpose(-2, -1))
85
+ x = self.activation1(x.transpose(-2, -1))
86
+ else:
87
+ x = self.norm1(x)
88
+ x = self.activation1(x)
89
+ x = self.conv1(x)
90
+ if self.norm == "LN":
91
+ x = self.norm2(x.transpose(-2, -1))
92
+ x = self.activation2(x.transpose(-2, -1))
93
+ else:
94
+ x = self.norm2(x)
95
+ x = self.activation2(x)
96
+ x = self.conv2(x)
97
+ return x + x_orig
98
+
99
+
100
+ class Resnet1D(nn.Module):
101
+ """1D ResNet block composed of multiple ResConv1DBlocks."""
102
+
103
+ def __init__(self, n_in, n_depth, dilation_growth_rate=1,
104
+ reverse_dilation=True, activation='relu', norm=None):
105
+ super().__init__()
106
+ blocks = [
107
+ ResConv1DBlock(n_in, n_in, dilation=dilation_growth_rate ** depth,
108
+ activation=activation, norm=norm)
109
+ for depth in range(n_depth)
110
+ ]
111
+ if reverse_dilation:
112
+ blocks = blocks[::-1]
113
+ self.model = nn.Sequential(*blocks)
114
+
115
+ def forward(self, x):
116
+ return self.model(x)
117
+
118
+
119
+ class MotionEncoder(nn.Module):
120
+ """Encoder for motion features with temporal downsampling."""
121
+
122
+ def __init__(self, input_dim=263, hidden_dim=512, nb_code=512,
123
+ down_t=3, stride_t=2, depth=3, dilation_growth_rate=3,
124
+ activation='relu', norm=None):
125
+ super().__init__()
126
+ blocks = []
127
+ filter_t, pad_t = stride_t * 2, stride_t // 2
128
+ blocks.append(nn.Conv1d(input_dim, hidden_dim, 3, 1, 1))
129
+ blocks.append(nn.ReLU())
130
+ for _ in range(down_t):
131
+ block = nn.Sequential(
132
+ nn.Conv1d(hidden_dim, hidden_dim, filter_t, stride_t, pad_t),
133
+ Resnet1D(hidden_dim, depth, dilation_growth_rate,
134
+ reverse_dilation=False, activation=activation, norm=norm),
135
+ )
136
+ blocks.append(block)
137
+ blocks.append(nn.Conv1d(hidden_dim, nb_code, 3, 1, 1))
138
+ self.model = nn.Sequential(*blocks)
139
+
140
+ def forward(self, x):
141
+ return self.model(x)
142
+
143
+
144
+ class MotionDecoder(nn.Module):
145
+ """Decoder for reconstructing motion from quantized features."""
146
+
147
+ def __init__(self, output_dim=263, hidden_dim=512, code_dim=512,
148
+ down_t=3, stride_t=2, depth=3, dilation_growth_rate=3,
149
+ activation='relu', norm=None):
150
+ super().__init__()
151
+ blocks = []
152
+ blocks.append(nn.Conv1d(code_dim, hidden_dim, 3, 1, 1))
153
+ blocks.append(nn.ReLU())
154
+ for _ in range(down_t):
155
+ block = nn.Sequential(
156
+ Resnet1D(hidden_dim, depth, dilation_growth_rate,
157
+ reverse_dilation=True, activation=activation, norm=norm),
158
+ nn.Upsample(scale_factor=2, mode='nearest'),
159
+ nn.Conv1d(hidden_dim, hidden_dim, 3, 1, 1)
160
+ )
161
+ blocks.append(block)
162
+ blocks.append(nn.Conv1d(hidden_dim, hidden_dim, 3, 1, 1))
163
+ blocks.append(nn.ReLU())
164
+ blocks.append(nn.Conv1d(hidden_dim, output_dim, 3, 1, 1))
165
+ self.model = nn.Sequential(*blocks)
166
+
167
+ def forward(self, x):
168
+ return self.model(x)
169
+
170
+
171
+ class GumbelSoftmaxQuantizer(nn.Module):
172
+ """Gumbel-Softmax Straight-Through quantizer for VQ-VAE."""
173
+
174
+ def __init__(self, nb_code=512, code_dim=512):
175
+ super().__init__()
176
+ self.nb_code = nb_code
177
+ self.code_dim = code_dim
178
+ self.codebook = nn.Embedding(nb_code, code_dim)
179
+ nn.init.uniform_(self.codebook.weight, -1.0 / nb_code, 1.0 / nb_code)
180
+ self.tau = 0.4
181
+
182
+ def quantize(self, x):
183
+ """Quantize encoder output to discrete indices."""
184
+ return x.argmax(dim=-1)
185
+
186
+ def dequantize(self, indices):
187
+ """Convert indices back to embeddings."""
188
+ return self.codebook(indices)
189
+
190
+ def forward(self, x_encoder):
191
+ """Forward pass with Gumbel-Softmax sampling."""
192
+ N, C, T = x_encoder.shape
193
+ x = x_encoder.permute(0, 2, 1).contiguous().view(-1, C)
194
+
195
+ # Gumbel-Softmax with straight-through
196
+ y_hard_st = F.gumbel_softmax(x, tau=self.tau, hard=True, dim=-1)
197
+ x_quantized = torch.matmul(y_hard_st, self.codebook.weight)
198
+
199
+ return x_quantized.view(N, T, -1).permute(0, 2, 1).contiguous()
200
+
201
+
202
+ class MotionTokenizer(nn.Module):
203
+ """
204
+ DVQ-GSST Motion Tokenizer.
205
+
206
+ Converts continuous motion features (263-dim HumanML3D format) to discrete tokens.
207
+
208
+ Args:
209
+ config: GeoMotionGPTConfig containing motion tokenizer parameters
210
+
211
+ Example:
212
+ ```python
213
+ motion = torch.randn(1, 100, 263) # (batch, time, features)
214
+ tokens = motion_tokenizer.encode(motion) # (batch, time//8)
215
+ ```
216
+ """
217
+
218
+ def __init__(self, config: GeoMotionGPTConfig):
219
+ super().__init__()
220
+ self.config = config
221
+
222
+ self.encoder = MotionEncoder(
223
+ input_dim=config.motion_input_dim,
224
+ hidden_dim=config.motion_hidden_dim,
225
+ nb_code=config.motion_vocab_size,
226
+ down_t=config.motion_down_t,
227
+ depth=config.motion_depth,
228
+ dilation_growth_rate=config.motion_dilation_growth_rate,
229
+ )
230
+
231
+ self.decoder = MotionDecoder(
232
+ output_dim=config.motion_input_dim,
233
+ hidden_dim=config.motion_hidden_dim,
234
+ code_dim=config.motion_vocab_size,
235
+ down_t=config.motion_down_t,
236
+ depth=config.motion_depth,
237
+ dilation_growth_rate=config.motion_dilation_growth_rate,
238
+ )
239
+
240
+ self.quantizer = GumbelSoftmaxQuantizer(
241
+ nb_code=config.motion_vocab_size,
242
+ code_dim=config.motion_vocab_size,
243
+ )
244
+
245
+ def encode(self, motion: torch.Tensor) -> torch.Tensor:
246
+ """
247
+ Encode motion features to discrete tokens.
248
+
249
+ Args:
250
+ motion: Motion features of shape (batch, time, 263)
251
+
252
+ Returns:
253
+ Token indices of shape (batch, time // downsample_ratio)
254
+ """
255
+ # (batch, time, 263) -> (batch, 263, time)
256
+ x = motion.permute(0, 2, 1).float()
257
+
258
+ # Encode
259
+ x_enc = self.encoder(x) # (batch, nb_code, time')
260
+
261
+ # (batch, nb_code, time') -> (batch, time', nb_code)
262
+ x_enc = x_enc.permute(0, 2, 1).contiguous()
263
+ N, T, C = x_enc.shape
264
+
265
+ # Get token indices
266
+ indices = self.quantizer.quantize(x_enc.view(-1, C))
267
+ return indices.view(N, T)
268
+
269
+ def decode(self, tokens: torch.Tensor) -> torch.Tensor:
270
+ """
271
+ Decode tokens back to motion features.
272
+
273
+ Args:
274
+ tokens: Token indices of shape (batch, time')
275
+
276
+ Returns:
277
+ Motion features of shape (batch, time, 263)
278
+ """
279
+ # Get embeddings from tokens
280
+ x = self.quantizer.dequantize(tokens) # (batch, time', code_dim)
281
+
282
+ # (batch, time', code_dim) -> (batch, code_dim, time')
283
+ x = x.permute(0, 2, 1).contiguous()
284
+
285
+ # Decode
286
+ x_out = self.decoder(x) # (batch, 263, time)
287
+
288
+ # (batch, 263, time) -> (batch, time, 263)
289
+ return x_out.permute(0, 2, 1)
290
+
291
+ def forward(self, motion: torch.Tensor):
292
+ """Forward pass for training (encode -> quantize -> decode)."""
293
+ x = motion.permute(0, 2, 1).float()
294
+ x_enc = self.encoder(x)
295
+ x_quant = self.quantizer(x_enc)
296
+ x_dec = self.decoder(x_quant)
297
+ return x_dec.permute(0, 2, 1)
298
+
299
+
300
+ # =====================================================
301
+ # Main GeoMotionGPT Model
302
+ # =====================================================
303
+
304
+ class GeoMotionGPTPreTrainedModel(PreTrainedModel):
305
+ """Base class for GeoMotionGPT models."""
306
+
307
+ config_class = GeoMotionGPTConfig
308
+ base_model_prefix = "geomotiongpt"
309
+ supports_gradient_checkpointing = True
310
+
311
+ def _init_weights(self, module):
312
+ """Initialize weights."""
313
+ if isinstance(module, (nn.Linear, nn.Conv1d)):
314
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
315
+ if module.bias is not None:
316
+ module.bias.data.zero_()
317
+ elif isinstance(module, nn.Embedding):
318
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
319
+ if module.padding_idx is not None:
320
+ module.weight.data[module.padding_idx].zero_()
321
+ elif isinstance(module, nn.LayerNorm):
322
+ module.bias.data.zero_()
323
+ module.weight.data.fill_(1.0)
324
+
325
+
326
+ class GeoMotionGPTForCausalLM(GeoMotionGPTPreTrainedModel):
327
+ """
328
+ GeoMotionGPT Model for motion-to-text generation.
329
+
330
+ This model combines:
331
+ 1. A VQ-VAE motion tokenizer (DVQ-GSST) for converting motion to discrete tokens
332
+ 2. A fine-tuned GPT-2 model for generating text from motion tokens
333
+
334
+ Example:
335
+ ```python
336
+ from transformers import AutoModelForCausalLM
337
+ import torch
338
+
339
+ # Load model
340
+ model = AutoModelForCausalLM.from_pretrained(
341
+ "zy22b/GeoMotionGPT",
342
+ trust_remote_code=True
343
+ )
344
+
345
+ # Access motion tokenizer
346
+ motion_tokenizer = model.motion_tokenizer
347
+
348
+ # Tokenize motion (batch, time, 263) -> (batch, tokens)
349
+ motion = torch.randn(1, 100, 263)
350
+ motion_tokens = motion_tokenizer.encode(motion)
351
+
352
+ # Generate text from motion tokens
353
+ text = model.generate_text(motion_tokens)
354
+ ```
355
+ """
356
+
357
+ _tied_weights_keys = ["lm_head.weight"]
358
+
359
+ def __init__(self, config: GeoMotionGPTConfig):
360
+ super().__init__(config)
361
+
362
+ # Motion tokenizer
363
+ self.motion_tokenizer = MotionTokenizer(config)
364
+
365
+ # Build GPT-2 config
366
+ gpt2_config = GPT2Config(
367
+ vocab_size=config.vocab_size,
368
+ n_positions=config.n_positions,
369
+ n_embd=config.n_embd,
370
+ n_layer=config.n_layer,
371
+ n_head=config.n_head,
372
+ n_inner=config.n_inner,
373
+ activation_function=config.activation_function,
374
+ resid_pdrop=config.resid_pdrop,
375
+ embd_pdrop=config.embd_pdrop,
376
+ attn_pdrop=config.attn_pdrop,
377
+ layer_norm_epsilon=config.layer_norm_epsilon,
378
+ initializer_range=config.initializer_range,
379
+ bos_token_id=config.bos_token_id,
380
+ eos_token_id=config.eos_token_id,
381
+ )
382
+
383
+ # Language model (GPT-2)
384
+ self.language_model = GPT2LMHeadModel(gpt2_config)
385
+
386
+ # Motion token embeddings (separate from text embeddings)
387
+ mot_embed_dim = int(config.n_embd // config.n_head * config.mot_factor) * config.n_head
388
+ self.motion_embed = nn.Embedding(
389
+ config.motion_vocab_size + 3, # +3 for special tokens (BOT, EOT, PAD)
390
+ mot_embed_dim
391
+ )
392
+ self.motion_head = nn.Linear(mot_embed_dim, config.motion_vocab_size + 3, bias=False)
393
+
394
+ # Projection layers for multi-modal fusion
395
+ self.motion_to_text_proj = nn.Linear(mot_embed_dim, config.n_embd)
396
+ self.text_to_motion_proj = nn.Linear(config.n_embd, mot_embed_dim)
397
+
398
+ # Initialize weights
399
+ self.post_init()
400
+
401
+ def get_input_embeddings(self):
402
+ return self.language_model.transformer.wte
403
+
404
+ def set_input_embeddings(self, value):
405
+ self.language_model.transformer.wte = value
406
+
407
+ def get_output_embeddings(self):
408
+ return self.language_model.lm_head
409
+
410
+ def set_output_embeddings(self, new_embeddings):
411
+ self.language_model.lm_head = new_embeddings
412
+
413
+ def encode_motion(self, motion: torch.Tensor) -> torch.Tensor:
414
+ """
415
+ Encode motion features to discrete tokens.
416
+
417
+ Args:
418
+ motion: Motion features of shape (batch, time, 263)
419
+
420
+ Returns:
421
+ Token indices of shape (batch, time // 8)
422
+ """
423
+ return self.motion_tokenizer.encode(motion)
424
+
425
+ def forward(
426
+ self,
427
+ input_ids: Optional[torch.LongTensor] = None,
428
+ attention_mask: Optional[torch.FloatTensor] = None,
429
+ past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,
430
+ labels: Optional[torch.LongTensor] = None,
431
+ use_cache: Optional[bool] = None,
432
+ output_attentions: Optional[bool] = None,
433
+ output_hidden_states: Optional[bool] = None,
434
+ return_dict: Optional[bool] = None,
435
+ **kwargs
436
+ ):
437
+ """
438
+ Forward pass through the language model.
439
+
440
+ For motion-to-text generation, use the `generate_text` method instead.
441
+ """
442
+ return self.language_model(
443
+ input_ids=input_ids,
444
+ attention_mask=attention_mask,
445
+ past_key_values=past_key_values,
446
+ labels=labels,
447
+ use_cache=use_cache,
448
+ output_attentions=output_attentions,
449
+ output_hidden_states=output_hidden_states,
450
+ return_dict=return_dict,
451
+ )
452
+
453
+ def prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs):
454
+ """Prepare inputs for text generation."""
455
+ return self.language_model.prepare_inputs_for_generation(
456
+ input_ids, past_key_values=past_key_values, **kwargs
457
+ )
458
+
459
+ @torch.no_grad()
460
+ def generate_text(
461
+ self,
462
+ motion_tokens: torch.Tensor,
463
+ max_new_tokens: int = 128,
464
+ num_beams: int = 4,
465
+ temperature: float = 0.7,
466
+ top_p: float = 0.9,
467
+ do_sample: bool = True,
468
+ **kwargs
469
+ ) -> List[str]:
470
+ """
471
+ Generate text descriptions from motion tokens.
472
+
473
+ Args:
474
+ motion_tokens: Motion token indices of shape (batch, seq_len)
475
+ max_new_tokens: Maximum number of new tokens to generate
476
+ num_beams: Number of beams for beam search
477
+ temperature: Sampling temperature
478
+ top_p: Top-p sampling parameter
479
+ do_sample: Whether to use sampling
480
+
481
+ Returns:
482
+ List of generated text strings
483
+ """
484
+ device = motion_tokens.device
485
+ batch_size = motion_tokens.shape[0]
486
+
487
+ # Offset motion tokens (they come after text tokens)
488
+ motion_offset = self.config.text_vocab_size
489
+ input_ids = motion_tokens + motion_offset
490
+
491
+ # Add BOS token at the start
492
+ bos_tokens = torch.full(
493
+ (batch_size, 1),
494
+ self.config.bos_token_id,
495
+ dtype=torch.long,
496
+ device=device
497
+ )
498
+ input_ids = torch.cat([bos_tokens, input_ids], dim=1)
499
+
500
+ # Generate
501
+ outputs = self.language_model.generate(
502
+ input_ids=input_ids,
503
+ max_new_tokens=max_new_tokens,
504
+ num_beams=num_beams,
505
+ temperature=temperature,
506
+ top_p=top_p,
507
+ do_sample=do_sample,
508
+ pad_token_id=self.config.pad_token_id,
509
+ eos_token_id=self.config.eos_token_id,
510
+ **kwargs
511
+ )
512
+
513
+ # Decode only the generated part
514
+ generated_ids = outputs[:, input_ids.shape[1]:]
515
+
516
+ # Note: Actual text decoding requires a tokenizer
517
+ # Return raw generated IDs for now
518
+ return generated_ids
519
+
520
+
521
+ # Register for AutoClass
522
+ GeoMotionGPTConfig.register_for_auto_class()
523
+ GeoMotionGPTForCausalLM.register_for_auto_class("AutoModelForCausalLM")