jeffffffli commited on
Commit
398a787
·
verified ·
1 Parent(s): 1d2c383

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +66 -13
README.md CHANGED
@@ -6,6 +6,7 @@ tags:
6
  - pose-estimation
7
  - human-motion
8
  - soma-body-model
 
9
  - video
10
  - monocular-video
11
  - 3d-pose
@@ -15,9 +16,12 @@ library_name: gem
15
 
16
  # GEM: A Generalist Model for Human Motion
17
 
18
- GEM is a monocular video 3D human body pose estimation model developed by NVIDIA. It reconstructs full-body motion from video sequences with dynamic cameras, producing accurate 3D body pose in [SOMA](https://research.nvidia.com/labs/dair/gem/) format.
19
 
20
- The model outputs full-body **77-joint pose** using the SOMA parametric body model, recovering both local body kinematics and global motion trajectories from unconstrained monocular video.
 
 
 
21
 
22
  - **Paper:** [arXiv 2505.01425](https://arxiv.org/abs/2505.01425)
23
  - **Project page:** https://research.nvidia.com/labs/dair/gem/
@@ -25,23 +29,19 @@ The model outputs full-body **77-joint pose** using the SOMA parametric body mod
25
 
26
  ---
27
 
28
- ## Model Details
29
 
30
- | Property | Value |
31
- |---|---|
32
- | Architecture | 16-layer Transformer encoder (RoPE, 1024 latent dim, 8 heads) |
33
- | Body model | SOMA (77 joints, full body + hands) |
34
- | Feature space | soma_v2, 585-dim |
35
- | Parameters | ~520M |
36
- | Input | RGB video + 2D keypoints + bounding box + camera intrinsics |
37
- | Output | Per-frame SOMA body parameters (pose, shape, translation) |
38
 
39
  ---
40
 
41
- ## Usage
42
 
43
  ```bash
44
- # Clone the GEM repository
45
  git clone --recursive https://github.com/NVlabs/GEM-X.git
46
  cd GEM-X
47
 
@@ -68,6 +68,59 @@ weights = torch.load(path, weights_only=False)
68
 
69
  ---
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ## Training Data
72
 
73
  GEM was trained on an internal NVIDIA synthetic dataset (MetroSim) composed of:
 
6
  - pose-estimation
7
  - human-motion
8
  - soma-body-model
9
+ - smpl-body-model
10
  - video
11
  - monocular-video
12
  - 3d-pose
 
16
 
17
  # GEM: A Generalist Model for Human Motion
18
 
19
+ GEM is a family of Generalist Human Motion models developed by NVIDIA. This repository hosts two model variants:
20
 
21
+ - **GEM-SOMA** Full-body 77-joint pose (body + hands + face) using the [SOMA](https://research.nvidia.com/labs/dair/gem/) body model
22
+ - **GEM-SMPL** — 17-joint body pose using the SMPLx body model, with support for text/audio/music conditioning
23
+
24
+ Both models reconstruct 3D human motion from monocular video with dynamic cameras, recovering both camera-space and global motion trajectories.
25
 
26
  - **Paper:** [arXiv 2505.01425](https://arxiv.org/abs/2505.01425)
27
  - **Project page:** https://research.nvidia.com/labs/dair/gem/
 
29
 
30
  ---
31
 
32
+ ## Available Models
33
 
34
+ | Model | Checkpoint | Body Model | Joints | Config | Code |
35
+ |---|---|---|---|---|---|
36
+ | GEM-SOMA | `gem_soma.ckpt` | SOMA | 77 (body + hands + face) | `config.json` | [GEM-X](https://github.com/NVlabs/GEM-X) |
37
+ | GEM-SMPL | `gem_smpl.ckpt` | SMPLx | 17 (body) | `gem_smpl_config.json` | [GEM-SMPL](https://github.com/NVlabs/GEM-SMPL) |
 
 
 
 
38
 
39
  ---
40
 
41
+ ## Usage — GEM-SOMA
42
 
43
  ```bash
44
+ # Clone the GEM-X repository
45
  git clone --recursive https://github.com/NVlabs/GEM-X.git
46
  cd GEM-X
47
 
 
68
 
69
  ---
70
 
71
+ ## Usage — GEM-SMPL
72
+
73
+ ```bash
74
+ # Clone the GEM-SMPL repository
75
+ git clone https://github.com/NVlabs/GEM-SMPL.git
76
+ cd GEM-SMPL
77
+
78
+ # Install dependencies (see README for full setup)
79
+ bash scripts/install_env.sh
80
+
81
+ # Run demo (video + text conditioning)
82
+ python scripts/demo/demo_smpl.py \
83
+ --input_list input.mp4 "text:a person walks forward" \
84
+ --ckpt_path inputs/pretrained/gem_smpl.ckpt
85
+ ```
86
+
87
+ Loading the weights manually:
88
+
89
+ ```python
90
+ import torch
91
+ from huggingface_hub import hf_hub_download
92
+
93
+ path = hf_hub_download(repo_id="nvidia/GEM-X", filename="gem_smpl.ckpt", local_dir="inputs/pretrained")
94
+ weights = torch.load(path, weights_only=False)
95
+ ```
96
+
97
+ ---
98
+
99
+ ## Model Details
100
+
101
+ ### GEM-SOMA
102
+
103
+ | Property | Value |
104
+ |---|---|
105
+ | Architecture | 16-layer Transformer encoder (RoPE, 1024 latent dim, 8 heads) |
106
+ | Body model | SOMA (77 joints, full body + hands) |
107
+ | Feature space | soma_v2, 585-dim |
108
+ | Parameters | ~520M |
109
+ | Input | RGB video + 2D keypoints + bounding box + camera intrinsics |
110
+ | Output | Per-frame SOMA body parameters (pose, shape, translation) |
111
+
112
+ ### GEM-SMPL
113
+
114
+ | Property | Value |
115
+ |---|---|
116
+ | Architecture | 12-layer Transformer encoder (RoPE, 512 latent dim, 8 heads) |
117
+ | Body model | SMPLx (17 joints, body only) |
118
+ | Feature space | gvhmr, 151-dim |
119
+ | Input | RGB video + 2D keypoints + bounding box + camera intrinsics (+ optional text/audio) |
120
+ | Output | Per-frame SMPL body parameters (pose, shape, translation) |
121
+
122
+ ---
123
+
124
  ## Training Data
125
 
126
  GEM was trained on an internal NVIDIA synthetic dataset (MetroSim) composed of: