lzbinden commited on
Commit
b551298
·
verified ·
1 Parent(s): 947e13c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -6
README.md CHANGED
@@ -1,6 +1,145 @@
1
- ---
2
- license: other
3
- license_name: nvidia-open-model-license
4
- license_link: >-
5
- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nvidia-open-model-license
4
+ license_link: >-
5
+ https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
6
+ pipeline_tag: image-to-video
7
+ tags:
8
+ - simulation
9
+ - surgical-robotics
10
+ - video-generation
11
+ - policy-evaluation
12
+ - vla-evaluation
13
+ - healthcare-robotics
14
+ - world-model
15
+ ---
16
+ # Cosmos-H-Surgical-Simulator
17
+
18
+ ## Description
19
+ Cosmos-H-Surgical-Simulator is a surgical world foundation model fine-tuned on the Open-H embodiment dataset including clinical surgical procedures for the evaluation of physically grounded surgical robotics policies in simulation and synthetic data generation.
20
+ This model assists in evaluating surgical robotics policies in simulation, primarily for CMR Surgical Versius clinical procedures (cholecystectomy, prostatectomy, inguinal hernia, and hysterectomy), as well as dVRK, MITIC, and other surgical platforms across tasks such as suturing, tissue manipulation, and peg transfer, before transitioning to a physical system.
21
+
22
+ The released model is based on the public NVIDIA Cosmos-predict2.5 world foundation model for physical AI.
23
+
24
+ This model is for commercial/non-commercial use.
25
+
26
+ ## License/Terms of Use
27
+ Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
28
+
29
+ ## Deployment Geography
30
+ Global
31
+
32
+ ## Use Case
33
+ Primarily intended for surgical robotics researchers, healthcare AI developers, academic institutions, and surgical robotics companies, exploring surgical robotics policy evaluation and synthetic data generation.
34
+
35
+ ## Release Date
36
+ - **GitHub:** 3/13/2026 via https://github.com/NVIDIA-Medtech/Cosmos-H-Surgical-Simulator
37
+ - **Huggingface:** 3/15/2026 via https://huggingface.co/NVIDIA
38
+
39
+
40
+ ## Reference(s)
41
+ Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.-W., Chattopadhyay, P., Chen, M., Chen, Y., Cheng, S., Cui, Y., Diamond, J., Ding, Y., Fan, J., Fan, L., Feng, L., Ferroni, F., Fidler, S., Fu, X., Gao, R., Ge, Y., Gu, J., … Zhu, Y. (2025).World Simulation with Video Foundation Models for Physical AI (arXiv:2511.00062) [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2511.00062
42
+
43
+ Link to Cosmos’ [nvidia/Cosmos-Predict2.5-2B-Video2World Model Card](https://huggingface.co/nvidia/Cosmos-Predict2.5-2B)
44
+
45
+ ## Model Architecture
46
+ Architecture Type: Diffusion Transformer
47
+ Network Architecture: Latent video diffusion transformer (DiT-style denoiser) with cross-attention conditioning.
48
+
49
+ **This model was developed based on Cosmos-Predict2.5-2B-Video2World.**
50
+
51
+ The Cosmos-H-Surgical-Simulator model extends Cosmos-Predict2.5-2B-Video2World, a diffusion transformer for video generation in latent space.
52
+ It incorporates an MLP to condition the model on kinematic actions.
53
+ The model accepts a 44-dimensional action vector (22 dimensions per arm) alongside the current video frame, and predicts the subsequent 12 frames.
54
+ Through autoregressive rollout, it can generate videos of complete surgical trajectories from either learned policies or manually designed action sequences.
55
+
56
+ ## Input
57
+ - **Input Type(s):** Image (camera frame), a sequence of twelve 44-dimensional numerical vectors
58
+ - **Input Format(s):** Red, Green, Blue (RGB) frame, numeric vector
59
+ - **Input Parameters:** Image: Two-Dimensional (2D) image frame , Vector: Sequence of 12 forty-four-dimensional (44D) vectors
60
+ **Other Properties Related to Input:**
61
+ - **Recommended resolution:** The model operates at 512x288 resolution. Input frames are automatically resized; a 16:9 aspect ratio is recommended.
62
+ - **Pre-processing:** The Versius kinematic action is a hybrid relative action as defined [here](https://arxiv.org/pdf/2407.12998).
63
+
64
+ ## Output
65
+ - **Output Type(s):** A sequence of 12 video frames
66
+ - **Output Format:** mp4
67
+ - **Output Parameters:** Three-Dimensional (3D)
68
+ - **Other Properties Related to Output:**
69
+ - By default, the model generates twelve video frames as output, representing the next world states incorporating the input kinematic actions into the current video frame. Through an autoregressive loop, a complete video sequence, representing a complete surgical trajectory, can be generated.
70
+ - No additional post-processing is strictly required.
71
+
72
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
73
+
74
+ ## Software Integration
75
+ **Runtime Engine(s):**
76
+
77
+ [Cosmos-Predict2.5](https://github.com/nvidia-cosmos/cosmos-predict2.5)
78
+
79
+ **Supported Hardware Microarchitecture Compatibility:**
80
+ - NVIDIA Ampere
81
+ - NVIDIA Blackwell
82
+ - NVIDIA Hopper
83
+
84
+ Note: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.
85
+
86
+ **Preferred/Supported Operating System(s):**
87
+ Linux (We have not tested on other operating systems.)
88
+
89
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
90
+
91
+ ## Model Version(s)
92
+ **v1.0 (Finetuned on the Open-H Embodiment dataset, which includes clinical procedures data such as cholecystectomy, prostatectomy, inguinal hernia, and hysterectomy)**
93
+
94
+ Developers may integrate the model into an AI evaluation system by providing video frames as input along with corresponding kinematic actions to evaluate a surgical policy model, such as one for the CMR Surgical Versius robotic system.
95
+
96
+ ## Training, Testing, and Evaluation Datasets
97
+
98
+ Dataset: Open-H-Embodiment community generated dataset.
99
+
100
+ - **Total size:** ~26,500+ surgical task demonstrations (plus CMR Versius clinical procedures) spanning 32 datasets, 9 robot embodiments, and 10+ institutions, totaling ~4.9 million synchronized video-kinematics frames.
101
+ - **Dataset partition:** 95% training and validation, 5% test (approximate).
102
+
103
+ ### Training Dataset
104
+ - **Link:** Open-H Embodiment dataset
105
+ - **Data Collection Method:** Human (kinematic action trajectories).
106
+ - **Labeling Method:** Human
107
+ - **Properties:**
108
+ - ~26,500+ surgical task demonstrations (plus CMR Versius clinical procedures) spanning 32 datasets, 9 robot embodiments, and 10+ institutions, totaling ~4.9 million synchronized video-kinematics frames.
109
+ - Human experts teleoperated the surgical robotic arms, which were recorded as video and kinematics pairs.
110
+
111
+ ### Testing Dataset
112
+ - **Link:** We tested on the holdout portion for the Open-H Embodiment dataset.
113
+ - **Data Collection Method:** Human
114
+ - **Labeling Method:** Human
115
+ - **Properties:**
116
+ - ~1K frames for testing (approx).
117
+
118
+ ### Evaluation Dataset
119
+ - **Link:** We evaluated on the Open-H Embodiment dataset, using the frames never seen during training.
120
+ - **Data Collection Method:** Human
121
+ - **Labeling Method:** Human
122
+ - **Properties:**
123
+ - ~1K frames for final evaluation (approx).
124
+
125
+ **Key performance:** The model can generate physically accurate surgical robotic videos given an initial frame and a kinematic action trajectory, either recorded from a surgeon or a surgical robotic policy model such as Gr00t-H.
126
+
127
+
128
+ ## Inference
129
+ **Acceleration Engine:** [PyTorch](https://pytorch.org/), [Transformer Engine](https://github.com/NVIDIA/TransformerEngine)
130
+
131
+ **Test Hardware:** Tested on A100.
132
+
133
+
134
+ **Usage:**
135
+ - See [Cosmos-Predict2.5](https://github.com/nvidia-cosmos/cosmos-predict2.5) for further details.
136
+
137
+
138
+ ## Ethical Considerations
139
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
140
+
141
+ Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
142
+
143
+ For more detailed information on ethical considerations for this model, please see the Model Card++ for Bias, Explainability, Safety & Security, and Privacy Subcards below.
144
+
145
+ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).