Cosmos-H-Surgical-Simulator

Description

Cosmos-H-Surgical-Simulator is a kinematic action-conditioned surgical world foundation model, built on the public NVIDIA Cosmos-Predict2.5-2B for physical AI and fine-tuned on the Open-H multi-embodiment surgical benchmark. Unlike the text-conditioned base model, it is driven directly by robot kinematics: given a surgical context frame and a sequence of 44-dimensional action vectors encoding end-effector poses and gripper commands (unified across 9 embodiments), it generates future video of the resulting surgical scene.

The model is intended for evaluating surgical robotics policies in simulation and for synthetic data generation prior to deployment on a physical system. It covers CMR Surgical Versius clinical procedures (cholecystectomy, prostatectomy, inguinal hernia, hysterectomy) as well as dVRK, MITIC, and other surgical platforms across tasks such as suturing, tissue manipulation, and peg transfer.

This model is for commercial/non-commercial use.

Updates

April 2026 — Released updated checkpoint after fixing an action-embedder MLP initialization bug. Aggregate quality improves on all three metrics: FDS (L1) 0.223 → 0.184 (−17%), GATC 0.417 → 0.472 (+13%), TCD 83.68 → 67.03 (−20%).

License/Terms of Use

Use of this model is governed by the NVIDIA Open Model License Agreement.

Deployment Geography

Global

Use Case

Primarily intended for surgical robotics researchers, healthcare AI developers, academic institutions, and surgical robotics companies, exploring surgical robotics policy evaluation and synthetic data generation.

Release Date

GitHub: 3/13/2026 via https://github.com/NVIDIA-Medtech/Cosmos-H-Surgical-Simulator
Huggingface: 3/15/2026 via https://huggingface.co/NVIDIA

Reference(s)

Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.-W., Chattopadhyay, P., Chen, M., Chen, Y., Cheng, S., Cui, Y., Diamond, J., Ding, Y., Fan, J., Fan, L., Feng, L., Ferroni, F., Fidler, S., Fu, X., Gao, R., Ge, Y., Gu, J., … Zhu, Y. (2025).World Simulation with Video Foundation Models for Physical AI (arXiv:2511.00062) [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2511.00062

Link to Cosmos’ nvidia/Cosmos-Predict2.5-2B-Video2World Model Card

Model Architecture

Architecture Type: Diffusion Transformer Network Architecture: Latent video diffusion transformer (DiT-style denoiser) with cross-attention conditioning.

This model was developed based on Cosmos-Predict2.5-2B-Video2World.

The Cosmos-H-Surgical-Simulator model extends Cosmos-Predict2.5-2B-Video2World, a diffusion transformer for video generation in latent space. It incorporates an MLP to condition the model on kinematic actions. The model accepts a 44-dimensional action vector (22 dimensions per arm) alongside the current video frame, and predicts the subsequent 12 frames. Through autoregressive rollout, it can generate videos of complete surgical trajectories from either learned policies or manually designed action sequences.

Input

Input Type(s): Image (camera frame), a sequence of twelve 44-dimensional numerical vectors
Input Format(s): Red, Green, Blue (RGB) frame, numeric vector
Input Parameters: Image: Two-Dimensional (2D) image frame , Vector: Sequence of 12 forty-four-dimensional (44D) vectors Other Properties Related to Input:
Recommended resolution: The model operates at 512x288 resolution. Input frames are automatically resized; a 16:9 aspect ratio is recommended.
Pre-processing: The Versius kinematic action is a hybrid relative action as defined here.

Output

Output Type(s): A sequence of 12 video frames
Output Format: mp4
Output Parameters: Three-Dimensional (3D)
Other Properties Related to Output:
- By default, the model generates twelve video frames as output, representing the next world states incorporating the input kinematic actions into the current video frame. Through an autoregressive loop, a complete video sequence, representing a complete surgical trajectory, can be generated.
- No additional post-processing is strictly required.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s):

Cosmos-Predict2.5

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Hopper

Note: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.

Preferred/Supported Operating System(s): Linux (We have not tested on other operating systems.)

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

v1.0 (Finetuned on the Open-H Embodiment dataset, which includes clinical procedures data such as cholecystectomy, prostatectomy, inguinal hernia, and hysterectomy)

Developers may integrate the model into an AI evaluation system by providing video frames as input along with corresponding kinematic actions to evaluate a surgical policy model, such as one for the CMR Surgical Versius robotic system.

Training, Testing, and Evaluation Datasets

Dataset: Open-H-Embodiment community generated dataset.

Total size: ~26,500+ surgical task demonstrations (plus CMR Versius clinical procedures) spanning 32 datasets, 9 robot embodiments, and 10+ institutions, totaling ~4.9 million synchronized video-kinematics frames.
Dataset partition: 95% training and validation, 5% test (approximate).

Training Dataset

Link: Open-H Embodiment dataset
Data Collection Method: Human (kinematic action trajectories).
Labeling Method: Human
Properties:
- ~26,500+ surgical task demonstrations (plus CMR Versius clinical procedures) spanning 32 datasets, 9 robot embodiments, and 10+ institutions, totaling ~4.9 million synchronized video-kinematics frames.
- Human experts teleoperated the surgical robotic arms, which were recorded as video and kinematics pairs.

Testing Dataset

Link: We tested on the holdout portion for the Open-H Embodiment dataset.
Data Collection Method: Human
Labeling Method: Human
Properties:
- ~1K frames for testing (approx).

Evaluation Dataset

Link: We evaluated on the Open-H Embodiment dataset, using the frames never seen during training.
Data Collection Method: Human
Labeling Method: Human
Properties:
- ~1K frames for final evaluation (approx).

Key performance: The model can generate physically accurate surgical robotic videos given an initial frame and a kinematic action trajectory, either recorded from a surgeon or a surgical robotic policy model such as Gr00t-H.

Quantitative Evaluation

Evaluated on 4 CMR Versius clinical surgery procedures (prostatectomy, inguinal hernia, hysterectomy, cholecystectomy) at 360p resolution, 2 episodes per procedure, 2 seeds each, 72-frame autoregressive generation (6 chunks × 12 frames).

Aggregate Metrics

Checkpoint	FDS (L1) ↓	GATC ↑	TCD (px) ↓
Previous (16k, pre-fix)	0.223	0.417	83.68
Current (12k-v2, post-fix)	0.184	0.472	67.03
Relative change	−17%	+13%	−20%

Metrics:

FDS (L1): Frame Decay Score - mean L1 distance between generated and ground-truth frames normalized to [-1, 1], averaged across all generated frames (lower is better)
GATC: Ground-truth Anchored Tool Consistency - median zero-mean normalized cross-correlation (ZNCC) of grayscale pixels within SAM3-segmented tool regions between generated and ground-truth frames, weighted by a gradient-based tool presence penalty (higher is better)
TCD: Tool Centroid Distance - median per-frame average Euclidean distance (in pixels) between Hungarian-matched tool instance centroids in generated vs ground-truth frames, with a half-diagonal penalty for unmatched tools (lower is better)

Per-Procedure Metrics (current checkpoint)

Procedure	FDS (L1) ↓	GATC ↑	TCD (px) ↓
Prostatectomy	0.220	0.451	122.0
Inguinal Hernia	0.199	0.429	143.2
Hysterectomy	0.121	0.737	12.7
Cholecystectomy	0.198	0.344	28.8

Inference

Acceleration Engine: PyTorch, Transformer Engine

Test Hardware: Tested on A100.

Usage:

See Cosmos-Predict2.5 for further details.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

For more detailed information on ethical considerations for this model, please see the Model Card++ for Bias, Explainability, Safety & Security, and Privacy Subcards below.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.