nvidia
/

Cosmos-H-Surgical-Simulator

@@ -1,6 +1,145 @@
----
-license: other
-license_name: nvidia-open-model-license
-license_link: >-
-  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
----

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: >-
+  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
+pipeline_tag: image-to-video
+tags:
+- simulation
+- surgical-robotics
+- video-generation
+- policy-evaluation
+- vla-evaluation
+- healthcare-robotics
+- world-model
+---
+# Cosmos-H-Surgical-Simulator
+## Description
+Cosmos-H-Surgical-Simulator is a surgical world foundation model fine-tuned on the Open-H embodiment dataset including clinical surgical procedures for the evaluation of physically grounded surgical robotics policies in simulation and synthetic data generation.
+This model assists in evaluating surgical robotics policies in simulation, primarily for CMR Surgical Versius clinical procedures (cholecystectomy, prostatectomy, inguinal hernia, and hysterectomy), as well as dVRK, MITIC, and other surgical platforms across tasks such as suturing, tissue manipulation, and peg transfer, before transitioning to a physical system.
+The released model is based on the public NVIDIA Cosmos-predict2.5 world foundation model for physical AI.
+This model is for commercial/non-commercial use.
+## License/Terms of Use
+Use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).
+## Deployment Geography
+Global
+## Use Case
+Primarily intended for surgical robotics researchers, healthcare AI developers, academic institutions, and surgical robotics companies, exploring surgical robotics policy evaluation and synthetic data generation.
+## Release Date
+- **GitHub:** 3/13/2026 via https://github.com/NVIDIA-Medtech/Cosmos-H-Surgical-Simulator
+- **Huggingface:** 3/15/2026 via https://huggingface.co/NVIDIA
+## Reference(s)
+Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.-W., Chattopadhyay, P., Chen, M., Chen, Y., Cheng, S., Cui, Y., Diamond, J., Ding, Y., Fan, J., Fan, L., Feng, L., Ferroni, F., Fidler, S., Fu, X., Gao, R., Ge, Y., Gu, J., … Zhu, Y. (2025).World Simulation with Video Foundation Models for Physical AI (arXiv:2511.00062) [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2511.00062
+Link to Cosmos’ [nvidia/Cosmos-Predict2.5-2B-Video2World Model Card](https://huggingface.co/nvidia/Cosmos-Predict2.5-2B)
+## Model Architecture
+Architecture Type: Diffusion Transformer
+Network Architecture: Latent video diffusion transformer (DiT-style denoiser) with cross-attention conditioning.
+**This model was developed based on Cosmos-Predict2.5-2B-Video2World.**
+The Cosmos-H-Surgical-Simulator model extends Cosmos-Predict2.5-2B-Video2World, a diffusion transformer for video generation in latent space.
+It incorporates an MLP to condition the model on kinematic actions.
+The model accepts a 44-dimensional action vector (22 dimensions per arm) alongside the current video frame, and predicts the subsequent 12 frames.
+Through autoregressive rollout, it can generate videos of complete surgical trajectories from either learned policies or manually designed action sequences.
+## Input
+- **Input Type(s):** Image (camera frame), a sequence of twelve 44-dimensional numerical vectors
+- **Input Format(s):** Red, Green, Blue (RGB) frame, numeric vector
+- **Input Parameters:** Image: Two-Dimensional (2D) image frame , Vector: Sequence of 12 forty-four-dimensional (44D) vectors
+**Other Properties Related to Input:**
+- **Recommended resolution:** The model operates at 512x288 resolution. Input frames are automatically resized; a 16:9 aspect ratio is recommended.
+- **Pre-processing:** The Versius kinematic action is a hybrid relative action as defined [here](https://arxiv.org/pdf/2407.12998).
+## Output
+- **Output Type(s):** A sequence of 12 video frames
+- **Output Format:** mp4
+- **Output Parameters:** Three-Dimensional (3D)
+- **Other Properties Related to Output:**
+  - By default, the model generates twelve video frames as output, representing the next world states incorporating the input kinematic actions into the current video frame. Through an autoregressive loop, a complete video sequence, representing a complete surgical trajectory, can be generated.
+  - No additional post-processing is strictly required.
+Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
+## Software Integration
+**Runtime Engine(s):**
+[Cosmos-Predict2.5](https://github.com/nvidia-cosmos/cosmos-predict2.5)
+**Supported Hardware Microarchitecture Compatibility:**
+- NVIDIA Ampere
+- NVIDIA Blackwell
+- NVIDIA Hopper
+Note: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.
+**Preferred/Supported Operating System(s):**
+Linux (We have not tested on other operating systems.)
+The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
+## Model Version(s)
+**v1.0 (Finetuned on the Open-H Embodiment dataset, which includes clinical procedures data such as cholecystectomy, prostatectomy, inguinal hernia, and hysterectomy)**
+Developers may integrate the model into an AI evaluation system by providing video frames as input along with corresponding kinematic actions to evaluate a surgical policy model, such as one for the CMR Surgical Versius robotic system.
+## Training, Testing, and Evaluation Datasets
+Dataset: Open-H-Embodiment community generated dataset.
+- **Total size:** ~26,500+ surgical task demonstrations (plus CMR Versius clinical procedures) spanning 32 datasets, 9 robot embodiments, and 10+ institutions, totaling ~4.9 million synchronized video-kinematics frames.
+- **Dataset partition:** 95% training and validation, 5% test (approximate).
+### Training Dataset
+- **Link:** Open-H Embodiment dataset
+- **Data Collection Method:** Human (kinematic action trajectories).
+- **Labeling Method:** Human
+- **Properties:**
+  - ~26,500+ surgical task demonstrations (plus CMR Versius clinical procedures) spanning 32 datasets, 9 robot embodiments, and 10+ institutions, totaling ~4.9 million synchronized video-kinematics frames.
+  - Human experts teleoperated the surgical robotic arms, which were recorded as video and kinematics pairs.
+### Testing Dataset
+- **Link:** We tested on the holdout portion for the Open-H Embodiment dataset.
+- **Data Collection Method:** Human
+- **Labeling Method:** Human
+- **Properties:**
+  - ~1K frames for testing (approx).
+### Evaluation Dataset
+- **Link:** We evaluated on the Open-H Embodiment dataset, using the frames never seen during training.
+- **Data Collection Method:** Human
+- **Labeling Method:** Human
+- **Properties:**
+  - ~1K frames for final evaluation (approx).
+**Key performance:** The model can generate physically accurate surgical robotic videos given an initial frame and a kinematic action trajectory, either recorded from a surgeon or a surgical robotic policy model such as Gr00t-H.
+## Inference
+**Acceleration Engine:** [PyTorch](https://pytorch.org/), [Transformer Engine](https://github.com/NVIDIA/TransformerEngine)
+**Test Hardware:** Tested on A100.
+**Usage:**
+- See [Cosmos-Predict2.5](https://github.com/nvidia-cosmos/cosmos-predict2.5) for further details.
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
+For more detailed information on ethical considerations for this model, please see the Model Card++ for Bias, Explainability, Safety & Security, and Privacy Subcards below.
+Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).