Cosmos-H-Surgical Overview

Description:

Cosmos-H-Surgical-Predict is a fine-tuned world foundation model for surgical robotics applications, built on NVIDIA's Cosmos platform. Fine-tuned from Cosmos-Predict2.5-2B, the model takes a first frame image and a text description as input and predicts the next 92 future frames of surgical video. This enables synthetic data generation (SDG) for training downstream policy models for surgical robotics. The model is adapted from the Cosmos foundation model to the surgical domain. The functionality is the same as the original Cosmos-Predict2.5-2B, with one key difference: Cosmos-H-Surgical-Predict drops text-only video generation and requires a first frame image as input alongside the text description.

Cosmos-H-Surgical-Transfer is a fine-tuned world foundation model for surgical robotics applications, built on NVIDIA's Cosmos platform. Fine-tuned from Cosmos-Transfer2.5-2B, the model transfers control input videos (depth maps, segmentation masks, edge maps, or blurred RGB) into photorealistic surgical videos. This bridges the simulation-to-real (sim2real) gap by converting synthetic/CG-rendered videos into photorealistic equivalents.

This model is for research and development only.

License/Terms of Use:

Use of this model is governed by the NVIDIA License.

Deployment Geography:

Global

Use Case:

Medical researchers, surgical robotics developers, AI developers, and healthcare institutions would be expected to use this model for:

Synthetic Data Generation (SDG): Generating synthetic surgical video data from a single observation frame to train policy models for surgical robotics
Physical AI Development: Advancing physical AI systems for surgical robotics by providing realistic training data
Simulation-to-Real Transfer: Converting simulated or CG-rendered surgical videos into photorealistic videos, minimizing the domain gap between simulated and real environments

This model is intended to serve as a research tool and is not intended to be used for clinical diagnostic purposes.

Release Date:

Huggingface: 03/16/2026 (GTC San Jose 2026) via https://huggingface.co/nvidia/cosmos-h-surgical-predict

Reference(s):

[1] Cosmos-H-Surgical: Learning Surgical Robot Policies from Videos via World Modeling arXiv preprint arXiv:2512.23162, 2025. https://arxiv.org/abs/2512.23162

[2] NVIDIA Cosmos: "World Foundation Models for Physical AI." arXiv preprint arXiv:2511.00062, 2025. https://arxiv.org/abs/2511.00062

[3] Cosmos-Predict2.5 Model Card: https://huggingface.co/nvidia/Cosmos-Predict2.5-2B

[4] Cosmos-Transfer2.5 Model Card: https://huggingface.co/nvidia/Cosmos-Transfer2.5-2B

Model Architecture:

Architecture Type: Diffusion Transformer
Network Architecture (Predict): Latent space video denoising model with interleaved self-attention, cross-attention, and feedforward layers
Network Architecture (Transfer): Latent space video denoising model with control branch injection
Task: Generation (Video Prediction / Video Transfer / Sim-to-Real)
Base Model (Predict): Cosmos-Predict2.5-2B (pre-trained)
Base Model (Transfer): Cosmos-Transfer2.5-2B
Number of model parameters: 2B

Input (Predict):

Input Type(s): Text+Image
Input Format(s): JPG/PNG/JPEG/WebP (image), String (text)
Input Parameters: Text: One-dimensional (1D), Image: Two-dimensional (2D)
Other Properties Related to Input:

Requires a first frame image and optionally a text description. Unlike the base Cosmos-Predict2.5-2B, text-only video generation (without a first frame) is not supported.
The input string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 5-second duration.
For the 720P model, the input image should be 1280x704.

Output (Predict):

Output Type(s): Video
Output Format: MP4
Output Parameters: Three-Dimensional (3D) — 92 frames at 1280x704, 16 FPS (~5.8 seconds of video)
Other Properties Related to Output: The generated video is a 5-second clip, with resolution and frame rate determined by the model variant used. For example, the 720P 16FPS model produces a video with a resolution of 1280x704 and a frame rate of 16 FPS.

Input (Transfer):

Input Type(s): Text+Video
Input Format(s): MP4 (video), String (text)
Input Parameters: Control video(s) (3D) + Text description (1D)
Other Properties Related to Input:

The input text string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 5-second duration.
The model supports control input videos of varying lengths, but a length which is multiples of 93 frames (e.g., 93, 186, or 279 frames) performs the best.
The model supports four types of control input videos: blurred video, Canny edge video, depth map video, and segmentation mask video. When multiple control inputs are provided, they must be derived from the same source video, representing different modalities of the same content while maintaining identical spatio-temporal dimensions.
The control input video should have a spatial resolution of 1280x720 for the 720P model.

Output (Transfer):

Output Type(s): Video
Output Format: MP4
Output Parameters: Three-Dimensional (3D) — matching input duration at 1280x720, 16 FPS
Other Properties Related to Output: Photorealistic surgical video aligned with the input control conditions. The output maintains the spatial and temporal structure of the control input while adding realistic visual appearance.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

Cosmos-Predict2.5
Cosmos-Transfer2.5

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper
NVIDIA Blackwell

Note: Only BF16 precision is tested. Other precisions like FP16 or FP32 are not officially supported.

Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

0.1 - Initial release version for surgical robotics synthetic data generation

Training, Testing, and Evaluation Datasets:

Dataset Overview:

The models were trained on a collection of public surgical video datasets spanning in-vivo clinical endoscopic footage captured via standard laparoscope or da Vinci robotic stereo endoscope cameras, with the exception of SutureBot which uses ex-vivo tissue phantom data collected via multi-view RGB cameras and robot kinematics sensors. The collection covers thousands of surgical videos and hundreds of thousands of annotated frames across multiple procedure types — laparoscopic cholecystectomy (Cholec80, CholecT50, HeiChole), robot-assisted radical prostatectomy (GraSP, SAR-RARP50), laparoscopic hysterectomy (AutoLaparo), laparoscopic Roux-en-Y gastric bypass (MultiBypass140), gynecologic laparoscopy (SurgicalActions160), and robotic suturing (SutureBot).

Training Dataset:

Data Modality:

Video (surgical procedures)

Video Training Data Size

Less than 10,000 Hours

Data Collection Method by dataset:

Hybrid: Human, Automatic/Sensors

Labeling Method by dataset:

Hybrid: Human, Automated

Properties: Approximately 280 full-length surgical videos comprising over 280,000 annotated frames (sampled at 1 fps) and several million raw frames at full capture rates, plus ~1,323 robotic suturing demonstrations; primary modality is video/image; content consists predominantly of real human patient data (personal/clinical) captured in-vivo under institutional oversight across laparoscopic and robotic surgical procedures, with one dataset (SutureBot) consisting of ex-vivo tissue phantom demonstrations with no patient content; no synthetic or machine-generated content; no natural language content; sensor types include standard laparoscope cameras and da Vinci robotic stereo endoscopes (854x480 to 1920x1080, 25-60 fps), with one dataset additionally providing synchronized robot kinematics from a da Vinci Research Kit (dVRK).

Testing Dataset:

Data Modality:

Video (surgical procedures)

Video Training Data Size

Less than 10,000 Hours

Data Collection Method by dataset:

Hybrid: Human, Automatic/Sensors

Labeling Method by dataset:

Hybrid: Human, Automated

Properties: Approximately 80 full-length surgical videos comprising ~80,000 annotated frames (sampled at 1 fps), plus ~378 robotic suturing demonstrations

Evaluation Dataset:

Data Modality:

Video (surgical procedures)

Video Training Data Size

Less than 10,000 Hours

Data Collection Method by dataset:

Hybrid: Human, Automatic/Sensors

Labeling Method by dataset:

Hybrid: Human, Automated

Properties: Approximately 40 full-length surgical videos comprising ~40,000 annotated frames (sampled at 1 fps), plus ~189 robotic suturing demonstrations

Inference:

Acceleration Engine: Tensor(RT)
Test Hardware:

H100
Requires at least 32+ GB of GPU VRAM.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Please report model quality, risk, security vulnerabilities or concerns here.