GeoMotionGPT / README.md

Update README.md

e3bd54f verified 1 day ago

11.1 kB

	---
	license: mit
	language:
	- en
	tags:
	- motion
	- human-motion
	- motion-to-text
	- vq-vae
	- gpt2
	datasets:
	- humanml3d
	pipeline_tag: text-generation
	library_name: transformers
	base_model:
	- openai-community/gpt2
	---

	# GeoMotionGPT

	GeoMotionGPT is a motion-to-text model that converts human motion sequences into natural language descriptions.

	## Model Components

	This model integrates two components:

	### 1. Motion Tokenizer (DVQ-GSST)
	- Architecture: Decoder-only Vector Quantizer with Gumbel-Softmax Straight-Through quantization
	- Codebook Size: 512 tokens
	- Input: 263-dimensional motion features (HumanML3D format)
	- Temporal Downsampling: 8x (3 downsampling layers with stride 2)

	### 2. Language Model (Fine-tuned GPT-2)
	- Base Model: GPT-2 (124M parameters)
	- Task: Motion-to-Text generation
	- Training: Fine-tuned with orthogonality regularization (λ=0.01)
	- Total Vocab: 50772 tokens (50257 text + 512 motion + 3 special)

	## Quick Start: Motion Tokenization

	```python
	from transformers import AutoModelForCausalLM
	import torch

	# Load the model
	model = AutoModelForCausalLM.from_pretrained(
	"zy22b/GeoMotionGPT",
	trust_remote_code=True
	)

	# Access the motion tokenizer
	motion_tokenizer = model.motion_tokenizer

	# Example: Tokenize motion (batch, time, 263)
	motion = torch.randn(1, 100, 263) # Random motion features
	with torch.no_grad():
	tokens = motion_tokenizer.encode(motion) # -> (batch, time//8)
	print(f"Motion tokens shape: {tokens.shape}")

	# Example: Decode tokens back to motion
	with torch.no_grad():
	reconstructed = motion_tokenizer.decode(tokens) # -> (batch, time, 263)
	print(f"Reconstructed shape: {reconstructed.shape}")
	```

	## Usage with HumanML3D Data

	```python
	import numpy as np
	import torch
	from transformers import AutoModelForCausalLM

	# Load model
	model = AutoModelForCausalLM.from_pretrained(
	"zy22b/GeoMotionGPT",
	trust_remote_code=True
	)
	motion_tokenizer = model.motion_tokenizer

	# Load HumanML3D motion file
	motion = np.load("datasets/humanml3d/new_joint_vecs/000000.npy") # (T, 263)

	# Load normalization parameters
	mean = np.load("datasets/humanml3d/Mean.npy")
	std = np.load("datasets/humanml3d/Std.npy")

	# Normalize
	motion_norm = (motion - mean) / std

	# Convert to tensor and add batch dimension
	motion_tensor = torch.FloatTensor(motion_norm).unsqueeze(0) # (1, T, 263)

	# Tokenize
	with torch.no_grad():
	tokens = motion_tokenizer.encode(motion_tensor)

	print(f"Input shape: {motion_tensor.shape}") # e.g., torch.Size([1, 116, 263])
	print(f"Token shape: {tokens.shape}") # e.g., torch.Size([1, 14])
	print(f"Tokens: {tokens[0].tolist()}") # e.g., [138, 104, 508, ...]
	```

	## Full Motion-to-Text Generation

	The following is a complete, self-contained script for motion-to-text generation.

	Requirements:
	```bash
	pip install torch transformers numpy safetensors huggingface_hub
	```

	Complete Code:

	```python
	"""
	GeoMotionGPT: Complete Motion-to-Text Generation
	Self-contained script - requires only: torch, transformers, numpy, safetensors
	"""

	import torch
	import numpy as np
	from transformers import AutoModelForCausalLM, AutoTokenizer, GPT2LMHeadModel, GPT2Config
	from safetensors.torch import load_file


	class GeoMotionGPTGenerator:
	"""Complete Motion-to-Text Generator using HuggingFace models."""

	def __init__(self, device="cuda" if torch.cuda.is_available() else "cpu"):
	self.device = torch.device(device)
	self.text_vocab_size = 50257
	self.motion_codebook_size = 512

	# Load motion tokenizer from HuggingFace
	print("Loading motion tokenizer from HuggingFace...")
	hf_model = AutoModelForCausalLM.from_pretrained(
	"zy22b/GeoMotionGPT",
	trust_remote_code=True
	)
	self.motion_tokenizer = hf_model.motion_tokenizer.to(self.device)
	self.motion_tokenizer.eval()

	# Load GPT2 tokenizer
	print("Loading GPT2 tokenizer...")
	self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
	self.tokenizer.pad_token = self.tokenizer.eos_token

	# Add motion tokens to tokenizer
	motion_tokens = [f'<motion_id_{i}>' for i in range(self.motion_codebook_size)]
	special_tokens = ['<start_of_motion>', '<end_of_motion>', '<masked_motion>', '<pad_motion>']
	self.tokenizer.add_tokens(motion_tokens)
	self.tokenizer.add_special_tokens({'additional_special_tokens': special_tokens})

	# Build language model using transformers GPT2LMHeadModel
	print("Building language model...")
	config = GPT2Config(
	vocab_size=50772, # 50257 text + 512 motion + 3 special
	n_positions=1024,
	n_embd=768,
	n_layer=12,
	n_head=12,
	)
	self.language_model = GPT2LMHeadModel(config).to(self.device)

	# Load weights
	print("Loading language model weights...")
	self._load_weights()
	self.language_model.eval()
	print("Model ready!")

	def _load_weights(self):
	"""Load language model weights from HuggingFace."""
	from huggingface_hub import hf_hub_download

	# Download weights
	weights_path = hf_hub_download(
	repo_id="zy22b/GeoMotionGPT",
	filename="model.safetensors"
	)

	state_dict = load_file(weights_path)

	# Map weights to model (remove 'language_model.' prefix)
	new_state_dict = {}
	for k, v in state_dict.items():
	if k.startswith("language_model."):
	new_key = k[len("language_model."):]
	new_state_dict[new_key] = v

	self.language_model.load_state_dict(new_state_dict, strict=False)

	def motion_tokens_to_string(self, tokens: torch.Tensor) -> str:
	"""Convert motion token IDs to string format."""
	token_list = tokens[0].cpu().tolist()
	mot_start = f'<motion_id_{self.motion_codebook_size}>' # <start_of_motion>
	mot_end = f'<motion_id_{self.motion_codebook_size + 1}>' # <end_of_motion>
	motion_str = ''.join([f'<motion_id_{int(t)}>' for t in token_list])
	return mot_start + motion_str + mot_end

	def generate_text(self, motion: np.ndarray, mean: np.ndarray, std: np.ndarray,
	max_new_tokens: int = 40) -> str:
	"""
	Generate text description from motion sequence.

	Args:
	motion: Raw motion array of shape (T, 263)
	mean: Mean for normalization (263,)
	std: Std for normalization (263,)
	max_new_tokens: Maximum tokens to generate

	Returns:
	Generated text description
	"""
	# Normalize motion
	motion_norm = (motion - mean) / std
	motion_tensor = torch.FloatTensor(motion_norm).unsqueeze(0).to(self.device)

	# Tokenize motion
	with torch.no_grad():
	motion_tokens = self.motion_tokenizer.encode(motion_tensor)

	# Convert to string and create prompt
	motion_string = self.motion_tokens_to_string(motion_tokens)
	prompt = f"Generate text: {motion_string} \n "

	# Tokenize prompt
	input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(self.device)

	# Generate using GPT2's built-in generate
	with torch.no_grad():
	output_ids = self.language_model.generate(
	input_ids,
	max_new_tokens=max_new_tokens,
	do_sample=False,
	eos_token_id=self.tokenizer.eos_token_id,
	pad_token_id=self.tokenizer.pad_token_id,
	)

	# Decode output
	generated_ids = output_ids[0, input_ids.shape[1]:]

	# Filter to text tokens only
	text_ids = [tid.item() for tid in generated_ids if tid.item() < self.text_vocab_size]
	generated_text = self.tokenizer.decode(text_ids, skip_special_tokens=True)

	return generated_text.strip(), motion_tokens[0].tolist()


	if __name__ == "__main__":
	import argparse

	parser = argparse.ArgumentParser(description="GeoMotionGPT Motion-to-Text Generation")
	parser.add_argument("--motion_file", type=str, required=True,
	help="Path to HumanML3D motion .npy file")
	parser.add_argument("--mean_file", type=str, default="Mean.npy",
	help="Path to Mean.npy")
	parser.add_argument("--std_file", type=str, default="Std.npy",
	help="Path to Std.npy")
	parser.add_argument("--device", type=str, default="cuda",
	help="Device to use (cuda/cpu)")
	args = parser.parse_args()

	# Initialize generator
	generator = GeoMotionGPTGenerator(device=args.device)

	# Load data
	motion = np.load(args.motion_file)
	mean = np.load(args.mean_file)
	std = np.load(args.std_file)

	print(f"\nInput motion shape: {motion.shape}")

	# Generate text
	text, tokens = generator.generate_text(motion, mean, std)

	print(f"Motion tokens ({len(tokens)}): {tokens}")
	print(f"\nGenerated text: {text}")
	```

	Example Usage:
	```bash
	python geomotiongpt_inference.py \
	--motion_file datasets/humanml3d/new_joint_vecs/000000.npy \
	--mean_file datasets/humanml3d/Mean.npy \
	--std_file datasets/humanml3d/Std.npy
	```

	Expected Output:
	```
	Loading motion tokenizer from HuggingFace...
	Loading GPT2 tokenizer...
	Building language model...
	Loading language model weights...
	Model ready!

	Input motion shape: (116, 263)
	Motion tokens (14): [138, 104, 508, 21, 498, 229, 144, 484, 393, 393, 144, 144, 144, 414]

	Generated text: a person kicks something with their left foot.
	```

	## Model Architecture

	```
	GeoMotionGPTForCausalLM
	├── motion_tokenizer (MotionTokenizer)
	│ ├── quantizer (MotionQuantizer)
	│ │ └── 1D CNN with ResNet blocks
	│ ├── decoder (MotionDecoder)
	│ │ └── 1D Transposed CNN with ResNet blocks
	│ └── codebook
	│ └── 512-entry codebook
	└── language_model (GPT2LMHeadModel)
	└── 12-layer transformer
	```

	## Training Details

	- Motion Tokenizer: Trained on HumanML3D dataset with DVQ-GSST quantization
	- Language Model: Fine-tuned GPT-2 with:
	- Orthogonality loss (λ=0.01) for motion token embeddings
	- Codebook-initialized motion embeddings
	- AdamW optimizer (lr=1e-4)

	## Citation

	If you use this model, please cite:
	```bibtex
	@misc{ye2026geomotiongpt,
	title={GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models},
	author={Zhankai Ye and Bofan Li and Yukai Jin and Shuoqiu Li and Wei Wang and Yanfu Zhang and Shangqian Gao and Xin Liu},
	year={2026},
	eprint={2601.07632},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2601.07632},
	}
	```

	## License

	MIT License