nvidia
/

GEM-X

@@ -6,6 +6,7 @@ tags:
   - pose-estimation
   - human-motion
   - soma-body-model
   - video
   - monocular-video
   - 3d-pose
@@ -15,9 +16,12 @@ library_name: gem
 # GEM: A Generalist Model for Human Motion
-GEM is a monocular video 3D human body pose estimation model developed by NVIDIA. It reconstructs full-body motion from video sequences with dynamic cameras, producing accurate 3D body pose in [SOMA](https://research.nvidia.com/labs/dair/gem/) format.
-The model outputs full-body **77-joint pose** using the SOMA parametric body model, recovering both local body kinematics and global motion trajectories from unconstrained monocular video.
 - **Paper:** [arXiv 2505.01425](https://arxiv.org/abs/2505.01425)
 - **Project page:** https://research.nvidia.com/labs/dair/gem/
@@ -25,23 +29,19 @@ The model outputs full-body **77-joint pose** using the SOMA parametric body mod
 ---
-## Model Details
-| Property | Value |
-|---|---|
-| Architecture | 16-layer Transformer encoder (RoPE, 1024 latent dim, 8 heads) |
-| Body model | SOMA (77 joints, full body + hands) |
-| Feature space | soma_v2, 585-dim |
-| Parameters | ~520M |
-| Input | RGB video + 2D keypoints + bounding box + camera intrinsics |
-| Output | Per-frame SOMA body parameters (pose, shape, translation) |
 ---
-## Usage
 ```bash
-# Clone the GEM repository
 git clone --recursive https://github.com/NVlabs/GEM-X.git
 cd GEM-X
@@ -68,6 +68,59 @@ weights = torch.load(path, weights_only=False)
 ---
 ## Training Data
 GEM was trained on an internal NVIDIA synthetic dataset (MetroSim) composed of:

   - pose-estimation
   - human-motion
   - soma-body-model
+  - smpl-body-model
   - video
   - monocular-video
   - 3d-pose
 # GEM: A Generalist Model for Human Motion
+GEM is a family of Generalist Human Motion models developed by NVIDIA. This repository hosts two model variants:
+- **GEM-SOMA** — Full-body 77-joint pose (body + hands + face) using the [SOMA](https://research.nvidia.com/labs/dair/gem/) body model
+- **GEM-SMPL** — 17-joint body pose using the SMPLx body model, with support for text/audio/music conditioning
+Both models reconstruct 3D human motion from monocular video with dynamic cameras, recovering both camera-space and global motion trajectories.
 - **Paper:** [arXiv 2505.01425](https://arxiv.org/abs/2505.01425)
 - **Project page:** https://research.nvidia.com/labs/dair/gem/
 ---
+## Available Models
+| Model | Checkpoint | Body Model | Joints | Config | Code |
+|---|---|---|---|---|---|
+| GEM-SOMA | `gem_soma.ckpt` | SOMA | 77 (body + hands + face) | `config.json` | [GEM-X](https://github.com/NVlabs/GEM-X) |
+| GEM-SMPL | `gem_smpl.ckpt` | SMPLx | 17 (body) | `gem_smpl_config.json` | [GEM-SMPL](https://github.com/NVlabs/GEM-SMPL) |
 ---
+## Usage — GEM-SOMA
 ```bash
+# Clone the GEM-X repository
 git clone --recursive https://github.com/NVlabs/GEM-X.git
 cd GEM-X
 ---
+## Usage — GEM-SMPL
+```bash
+# Clone the GEM-SMPL repository
+git clone https://github.com/NVlabs/GEM-SMPL.git
+cd GEM-SMPL
+# Install dependencies (see README for full setup)
+bash scripts/install_env.sh
+# Run demo (video + text conditioning)
+python scripts/demo/demo_smpl.py \
+    --input_list input.mp4 "text:a person walks forward" \
+    --ckpt_path inputs/pretrained/gem_smpl.ckpt
+```
+Loading the weights manually:
+```python
+import torch
+from huggingface_hub import hf_hub_download
+path = hf_hub_download(repo_id="nvidia/GEM-X", filename="gem_smpl.ckpt", local_dir="inputs/pretrained")
+weights = torch.load(path, weights_only=False)
+```
+---
+## Model Details
+### GEM-SOMA
+| Property | Value |
+|---|---|
+| Architecture | 16-layer Transformer encoder (RoPE, 1024 latent dim, 8 heads) |
+| Body model | SOMA (77 joints, full body + hands) |
+| Feature space | soma_v2, 585-dim |
+| Parameters | ~520M |
+| Input | RGB video + 2D keypoints + bounding box + camera intrinsics |
+| Output | Per-frame SOMA body parameters (pose, shape, translation) |
+### GEM-SMPL
+| Property | Value |
+|---|---|
+| Architecture | 12-layer Transformer encoder (RoPE, 512 latent dim, 8 heads) |
+| Body model | SMPLx (17 joints, body only) |
+| Feature space | gvhmr, 151-dim |
+| Input | RGB video + 2D keypoints + bounding box + camera intrinsics (+ optional text/audio) |
+| Output | Per-frame SMPL body parameters (pose, shape, translation) |
+---
 ## Training Data
 GEM was trained on an internal NVIDIA synthetic dataset (MetroSim) composed of: