--- license: mit language: - en tags: - motion - human-motion - motion-to-text - vq-vae - gpt2 datasets: - humanml3d pipeline_tag: text-generation library_name: transformers base_model: - openai-community/gpt2 --- # GeoMotionGPT GeoMotionGPT is a motion-to-text model that converts human motion sequences into natural language descriptions. ## Model Components This model integrates two components: ### 1. Motion Tokenizer (DVQ-GSST) - **Architecture**: Decoder-only Vector Quantizer with Gumbel-Softmax Straight-Through quantization - **Codebook Size**: 512 tokens - **Input**: 263-dimensional motion features (HumanML3D format) - **Temporal Downsampling**: 8x (3 downsampling layers with stride 2) ### 2. Language Model (Fine-tuned GPT-2) - **Base Model**: GPT-2 (124M parameters) - **Task**: Motion-to-Text generation - **Training**: Fine-tuned with orthogonality regularization (λ=0.01) - **Total Vocab**: 50772 tokens (50257 text + 512 motion + 3 special) ## Quick Start: Motion Tokenization ```python from transformers import AutoModelForCausalLM import torch # Load the model model = AutoModelForCausalLM.from_pretrained( "zy22b/GeoMotionGPT", trust_remote_code=True ) # Access the motion tokenizer motion_tokenizer = model.motion_tokenizer # Example: Tokenize motion (batch, time, 263) motion = torch.randn(1, 100, 263) # Random motion features with torch.no_grad(): tokens = motion_tokenizer.encode(motion) # -> (batch, time//8) print(f"Motion tokens shape: {tokens.shape}") # Example: Decode tokens back to motion with torch.no_grad(): reconstructed = motion_tokenizer.decode(tokens) # -> (batch, time, 263) print(f"Reconstructed shape: {reconstructed.shape}") ``` ## Usage with HumanML3D Data ```python import numpy as np import torch from transformers import AutoModelForCausalLM # Load model model = AutoModelForCausalLM.from_pretrained( "zy22b/GeoMotionGPT", trust_remote_code=True ) motion_tokenizer = model.motion_tokenizer # Load HumanML3D motion file motion = np.load("datasets/humanml3d/new_joint_vecs/000000.npy") # (T, 263) # Load normalization parameters mean = np.load("datasets/humanml3d/Mean.npy") std = np.load("datasets/humanml3d/Std.npy") # Normalize motion_norm = (motion - mean) / std # Convert to tensor and add batch dimension motion_tensor = torch.FloatTensor(motion_norm).unsqueeze(0) # (1, T, 263) # Tokenize with torch.no_grad(): tokens = motion_tokenizer.encode(motion_tensor) print(f"Input shape: {motion_tensor.shape}") # e.g., torch.Size([1, 116, 263]) print(f"Token shape: {tokens.shape}") # e.g., torch.Size([1, 14]) print(f"Tokens: {tokens[0].tolist()}") # e.g., [138, 104, 508, ...] ``` ## Full Motion-to-Text Generation The following is a complete, self-contained script for motion-to-text generation. **Requirements:** ```bash pip install torch transformers numpy safetensors huggingface_hub ``` **Complete Code:** ```python """ GeoMotionGPT: Complete Motion-to-Text Generation Self-contained script - requires only: torch, transformers, numpy, safetensors """ import torch import numpy as np from transformers import AutoModelForCausalLM, AutoTokenizer, GPT2LMHeadModel, GPT2Config from safetensors.torch import load_file class GeoMotionGPTGenerator: """Complete Motion-to-Text Generator using HuggingFace models.""" def __init__(self, device="cuda" if torch.cuda.is_available() else "cpu"): self.device = torch.device(device) self.text_vocab_size = 50257 self.motion_codebook_size = 512 # Load motion tokenizer from HuggingFace print("Loading motion tokenizer from HuggingFace...") hf_model = AutoModelForCausalLM.from_pretrained( "zy22b/GeoMotionGPT", trust_remote_code=True ) self.motion_tokenizer = hf_model.motion_tokenizer.to(self.device) self.motion_tokenizer.eval() # Load GPT2 tokenizer print("Loading GPT2 tokenizer...") self.tokenizer = AutoTokenizer.from_pretrained("gpt2") self.tokenizer.pad_token = self.tokenizer.eos_token # Add motion tokens to tokenizer motion_tokens = [f'' for i in range(self.motion_codebook_size)] special_tokens = ['', '', '', ''] self.tokenizer.add_tokens(motion_tokens) self.tokenizer.add_special_tokens({'additional_special_tokens': special_tokens}) # Build language model using transformers GPT2LMHeadModel print("Building language model...") config = GPT2Config( vocab_size=50772, # 50257 text + 512 motion + 3 special n_positions=1024, n_embd=768, n_layer=12, n_head=12, ) self.language_model = GPT2LMHeadModel(config).to(self.device) # Load weights print("Loading language model weights...") self._load_weights() self.language_model.eval() print("Model ready!") def _load_weights(self): """Load language model weights from HuggingFace.""" from huggingface_hub import hf_hub_download # Download weights weights_path = hf_hub_download( repo_id="zy22b/GeoMotionGPT", filename="model.safetensors" ) state_dict = load_file(weights_path) # Map weights to model (remove 'language_model.' prefix) new_state_dict = {} for k, v in state_dict.items(): if k.startswith("language_model."): new_key = k[len("language_model."):] new_state_dict[new_key] = v self.language_model.load_state_dict(new_state_dict, strict=False) def motion_tokens_to_string(self, tokens: torch.Tensor) -> str: """Convert motion token IDs to string format.""" token_list = tokens[0].cpu().tolist() mot_start = f'' # mot_end = f'' # motion_str = ''.join([f'' for t in token_list]) return mot_start + motion_str + mot_end def generate_text(self, motion: np.ndarray, mean: np.ndarray, std: np.ndarray, max_new_tokens: int = 40) -> str: """ Generate text description from motion sequence. Args: motion: Raw motion array of shape (T, 263) mean: Mean for normalization (263,) std: Std for normalization (263,) max_new_tokens: Maximum tokens to generate Returns: Generated text description """ # Normalize motion motion_norm = (motion - mean) / std motion_tensor = torch.FloatTensor(motion_norm).unsqueeze(0).to(self.device) # Tokenize motion with torch.no_grad(): motion_tokens = self.motion_tokenizer.encode(motion_tensor) # Convert to string and create prompt motion_string = self.motion_tokens_to_string(motion_tokens) prompt = f"Generate text: {motion_string} \n " # Tokenize prompt input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(self.device) # Generate using GPT2's built-in generate with torch.no_grad(): output_ids = self.language_model.generate( input_ids, max_new_tokens=max_new_tokens, do_sample=False, eos_token_id=self.tokenizer.eos_token_id, pad_token_id=self.tokenizer.pad_token_id, ) # Decode output generated_ids = output_ids[0, input_ids.shape[1]:] # Filter to text tokens only text_ids = [tid.item() for tid in generated_ids if tid.item() < self.text_vocab_size] generated_text = self.tokenizer.decode(text_ids, skip_special_tokens=True) return generated_text.strip(), motion_tokens[0].tolist() if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description="GeoMotionGPT Motion-to-Text Generation") parser.add_argument("--motion_file", type=str, required=True, help="Path to HumanML3D motion .npy file") parser.add_argument("--mean_file", type=str, default="Mean.npy", help="Path to Mean.npy") parser.add_argument("--std_file", type=str, default="Std.npy", help="Path to Std.npy") parser.add_argument("--device", type=str, default="cuda", help="Device to use (cuda/cpu)") args = parser.parse_args() # Initialize generator generator = GeoMotionGPTGenerator(device=args.device) # Load data motion = np.load(args.motion_file) mean = np.load(args.mean_file) std = np.load(args.std_file) print(f"\nInput motion shape: {motion.shape}") # Generate text text, tokens = generator.generate_text(motion, mean, std) print(f"Motion tokens ({len(tokens)}): {tokens}") print(f"\nGenerated text: {text}") ``` **Example Usage:** ```bash python geomotiongpt_inference.py \ --motion_file datasets/humanml3d/new_joint_vecs/000000.npy \ --mean_file datasets/humanml3d/Mean.npy \ --std_file datasets/humanml3d/Std.npy ``` **Expected Output:** ``` Loading motion tokenizer from HuggingFace... Loading GPT2 tokenizer... Building language model... Loading language model weights... Model ready! Input motion shape: (116, 263) Motion tokens (14): [138, 104, 508, 21, 498, 229, 144, 484, 393, 393, 144, 144, 144, 414] Generated text: a person kicks something with their left foot. ``` ## Model Architecture ``` GeoMotionGPTForCausalLM ├── motion_tokenizer (MotionTokenizer) │ ├── quantizer (MotionQuantizer) │ │ └── 1D CNN with ResNet blocks │ ├── decoder (MotionDecoder) │ │ └── 1D Transposed CNN with ResNet blocks │ └── codebook │ └── 512-entry codebook └── language_model (GPT2LMHeadModel) └── 12-layer transformer ``` ## Training Details - **Motion Tokenizer**: Trained on HumanML3D dataset with DVQ-GSST quantization - **Language Model**: Fine-tuned GPT-2 with: - Orthogonality loss (λ=0.01) for motion token embeddings - Codebook-initialized motion embeddings - AdamW optimizer (lr=1e-4) ## Citation If you use this model, please cite: ```bibtex @misc{ye2026geomotiongpt, title={GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models}, author={Zhankai Ye and Bofan Li and Yukai Jin and Shuoqiu Li and Wei Wang and Yanfu Zhang and Shangqian Gao and Xin Liu}, year={2026}, eprint={2601.07632}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.07632}, } ``` ## License MIT License