Kimodo: Controllable Kinematic Motion Diffusion at Scale

Description:

Kimodo (Kinematic Motion Diffusion) generates three-dimensional (3D) skeletal body animations given a text prompt and/or constraints on the motion like full-body poses, end-effector joint positions, paths, and waypoints to follow.

The Kimodo model family includes models trained on different skeletons and datasets:

Kimodo-SOMA-RP
- Trained on the 30-joint SOMA skeleton with the proprietary Bones Rigplay dataset.
Kimodo-SOMA-SEED
- Trained on the 30-joint SOMA skeleton with the open Bones-SEED dataset.
Kimodo-G1-RP
- Trained on the proprietary Bones Rigplay dataset retargeted to the 34-joint Unitree G1 robot skeleton.
Kimodo-G1-SEED
- Trained on the open Bones-SEED dataset retargeted to the 34-joint Unitree G1 robot skeleton.
Kimodo-SMPLX-RP
- Trained on the proprietary Bones Rigplay dataset retargeted to the 22-joint SMPLX-body skeleton.

This release pertains to Kimodo-G1-RP. This model is ready for commercial use.

License:

This model is released under the NVIDIA Open Model License.

Deployment Geography:

Global

Use Case:

The model is intended for users with any level of animation experience to create 3D human motion data for their application. This may include:

Demonstrations for humanoid robots
Digital human motion for digital twin and industrial simulations
Digital human motion for synthetic data
Animations for game and media development

Release Date:

Github [03/16/2026] via link
HuggingFace [03/16/2026] via link

References:

Technical report: Kimodo: Scaling Controllable Human Motion Generation
Webpage: link

Model Architecture:

Architecture Type: Diffusion Model
Network Architecture: Novel Two-Stage Transformer
Model Size: 282 M parameters

Inputs:

Input Types: Text, Duration (Num Frames), Pose Constraints

Input Formats:

Text: String
Duration: Integer
Pose Constraints: Matrix

Input Parameters:

Text: One-Dimensional (1D)
Duration: One-Dimensional (1D)
Pose Constraints:
- One-Dimensional (1D) frame index of each constraint
- Features to constrain may include Three-Dimensional (3D) joint positions, (3x3) joint rotation matrices, Two-Dimensional (2D) heading direction, and/or Two-Dimensional (2D) root position

Other Properties Related to Input: Maximum duration is 10 sec (300 frames at 30 frames per second).

Outputs

Output Type: Skeleton Motion: Root Translation and Joint Rotations

Output Formats:

Root Translation: Matrix
Joint Rotations: Matrix

Output Parameters:

Root Translation: Two-Dimensional (num_frames x 3)
Joint Rotations: Four-Dimensional (num_frames x 34 x 3 x 3)

Other Properties Related to Outupt:

Motions are at 30 frames per second (30 fps)

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engines:

PyTorch

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Lovelace

Supported Operating Systems:

Linux
Windows

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version

Kimodo-G1-RP-v1

Training and Testing Datasets:

Name: Proprietary Bones Rigplay Dataset

Data Modalities

Text
Human Motion Capture

Data Size:

Less than 1 Billion tokens of text
Roughly 560 hours of human motion capture

Data Collection Method
Automatic/Sensors

Labeling Method
Hybrid: Automatic/Sensors, Human

Properties: Roughly 560 hours of captured human body motions with corresponding text descriptions. Split into 90%/10% train/test splits. Various augmentations were employed to expand text and motion variety. Motions were retargeted to G1 robot skeleton for training.

Quantitative Evaluation
For test set evaluation, please refer to the technical report

Inference:

Acceleration Engine: N/A

Test Hardware:

GeForce RTX 3090
GeForce RTX 4090
GeForce RTX 5090
NVIDIA A100
NVIDIA L40S
NVIDIA L4
NVIDIA RTX 6000 Ada
NVIDIA RTX A6000

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Bias, Explainability, Safety & Security, and Privacy Subcards below.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Bias

Field	Response
Participation considerations from adversely impacted groups protected classes in model design and testing:	Gender
Measures taken to mitigate against unwanted bias:	Our training data contains motion captured from a roughly equal number of male and female actors

Explainability

Field	Response
Intended Task/Domain:	Robotics
Model Type:	Diffusion Transformer
Intended Users:	The model is intended for users with any level of animation experience to create 3D humanoid robot motion data for their application. This may include demonstrations for humanoid robots or robot motions for simulations and synthetic data.
Output:	3D skeletal animation (root translation and joint rotations)
Describe how the model works:	Text input and pose constraints are processed and given to a transformer-based model that iteratively denoises a sequence of body poses.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:	Gender
Technical Limitations & Mitigation:	Generated motions may include artifacts like foot skating where feet slide unnaturally when they should be in static contact with the ground. The motion does not always follow the given text prompt, and the model does not know how to perform certain types of actions (e.g., the model is best at locomotion, gestures, dancing, and everyday activities). Each trained model currently outputs motion for a single character skeleton. The model is designed to output realistic motions, so it cannot create cartoon motions or non-physically plausible motions. The model is not aware of objects in the scene around a character.
Verified to have met prescribed NVIDIA quality standards:	Yes
Performance Metrics:	Pose Constraint Accuracy (joint distance error), Motion Quality (foot-skating error, FID, latent similarity), Text-Following Accuracy (R-precision, latent similarity)
Potential Known Risks:	The model may output body motions that inadvertently reflect stereotypes related to age, gender, or physical characteristics. To mitigate this, prompts should describe actions in neutral, physical terms (e.g., “A person walks slowly with shuffled steps”) rather than relying on demographic adjectives.
Licensing:	This model is released under the NVIDIA Open Model License

Privacy

Field	Response
Generatable or reverse engineerable personal data?	No
Personal data used to create this model?	No
How often is dataset reviewed?	During dataset creation, model training, evaluation and before release
Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model?	No
Is there provenance for all datasets used in training?	Yes
Does data labeling (annotation, metadata) comply with privacy laws?	Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made?	Not Applicable
Applicable Privacy Policy	https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

Safety

Field	Response
Model Application Field(s):	Media & Entertainment, Industrial/Machinery and Robotics, Autonomous Vehicles
Describe the life critical impact (if present).	Not Applicable
Use Case Restrictions:	Abide by the NVIDIA Open Model License
Model and dataset restrictions:	The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.

Downloads last month: 856

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using nvidia/Kimodo-G1-RP-v1 41

Collection including nvidia/Kimodo-G1-RP-v1

Kimodo-v1

Collection

Models for human(oid) motion generation • 11 items • Updated 4 days ago • 36