SurgMotion Access Agreement

Access to SurgMotion is provided under the same license as the GitHub repository: Apache License 2.0. Please review the license at https://github.com/CAIR-HKISI/SurgMotion/blob/main/LICENSE and provide your institution information before requesting access.

SurgMotion

A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Model Summary

SurgMotion is a video-native surgical foundation model that learns spatiotemporal representations by predicting latent motion rather than reconstructing pixels. Built upon the Video Joint Embedding Predictive Architecture (V-JEPA 2), SurgMotion captures the complex temporal dynamics of surgical procedures without the computational overhead of generative decoding.

Key innovations:

Latent motion prediction — shifts from pixel-level reconstruction to abstract motion forecasting in latent space
Flow-Guided Latent Prediction — a novel objective that prevents feature collapse in homogeneous surgical tissue regions
Pre-trained on SurgMotion-15M — the largest multi-modal surgical video dataset to date (15M frames, 3,658 hours, 13+ anatomical regions)

Model Variants

Variant	Backbone	Parameters	Pre-training Data
SurgMotion-L	ViT-Large	300M	SurgMotion-15M
SurgMotion-G	ViT-Giant-xformer	1B	SurgMotion-15M

Pre-training Data: SurgMotion-15M

Statistic	Value
Total Frames	15M+
Total Duration	3,658 hours
Anatomical Regions	13+
Supported Tasks	6+ (workflow, action, segmentation, triplet, skill, depth)

SurgMotion-15M spans a diverse range of surgical procedures including laparoscopic, endoscopic, robotic, open, endonasal, neurosurgical, and ophthalmic surgeries.

Performance Highlights

SurgMotion achieves state-of-the-art results across multiple surgical understanding tasks.

Best overall on all representative surgical tasks

Workflow Phase Recognition (Avg F1)

Dataset	Domain	SurgMotion
AutoLaparo	Laparoscopic hysterectomy	SOTA
Cholec80	Laparoscopic cholecystectomy	SOTA
EgoSurgery	Egocentric open surgery	SOTA
M2CAI2016	Laparoscopic cholecystectomy	SOTA
OphNet2024	Ophthalmic surgery	SOTA
PitVis	Pituitary neurosurgery	SOTA
PmLR50	Laparoscopic liver resection	SOTA
PolypDiag	GI endoscopy	SOTA

Multi-Task Performance

SurgMotion demonstrates strong generalization across diverse surgical understanding tasks beyond phase recognition:

Task	Description	Result
Workflow Recognition	Surgical phase identification	Best
Action Recognition	Fine-grained action classification	Best
Segmentation	Instrument/tissue segmentation	Best
Triplet Recognition	Instrument-verb-target triplets	Best
Skill Assessment	Surgical skill scoring	Best
Depth Estimation	Monocular depth prediction	Best

Intended Use

SurgMotion is designed for:

Surgical workflow analysis — automated phase and step recognition
Downstream fine-tuning — feature extraction backbone for surgical vision tasks
Research benchmarking — standardized evaluation of surgical video foundation models

How to Use

Quick Start

# Clone the repository
git clone https://github.com/CAIR-HKISI/SurgMotion.git
cd SurgMotion
pip install -e .

# Run probing evaluation on a dataset
python -m evals.main \
    --fname configs/foundation_model_probing/dinov3/AutoLaparo/dinov3_vitl_64f_autolaparo.yaml \
    --devices cuda:0

For detailed setup, data preparation, and usage instructions, please refer to the GitHub repository.

Download from Hugging Face Hub

The recommended way to access released checkpoints is through the dedicated single-model repositories:

The original combined repository still hosts the raw PyTorch .pt files:

config.json
SurgMotion-vitl.pt
SurgMotion-vitg-xformer.pt

For better Hugging Face Hub compatibility, fetch the repository config together with the checkpoint:

from huggingface_hub import hf_hub_download

config_path = hf_hub_download(
    repo_id="CAIR-HKISI/SurgMotion",
    filename="config.json",
)

checkpoint_path = hf_hub_download(
    repo_id="CAIR-HKISI/SurgMotion",
    filename="SurgMotion-vitl.pt",
)

print(config_path)
print(checkpoint_path)

Or download them directly:

The original .pt checkpoints remain the primary release artifacts and are unchanged.

Architecture

SurgMotion builds on the V-JEPA 2 framework with the following pipeline:

Video Encoder (ViT) — processes 64-frame surgical video clips into spatiotemporal token sequences
Latent Predictor — predicts masked region representations in latent space guided by optical flow
Probing Head — lightweight temporal classifier for downstream phase recognition

The model learns without pixel-level reconstruction, relying entirely on latent-space self-supervised objectives. The Flow-Guided Latent Prediction mechanism specifically addresses the challenge of homogeneous tissue regions common in surgical videos, where standard masking strategies tend to collapse.

Training Details

Aspect	Detail
Framework	V-JEPA 2 (Video Joint Embedding Predictive Architecture)
Objective	Flow-Guided Latent Prediction
Input	64 frames per clip
Pre-training Scale	3,658 hours of surgical video
Hardware	Multi-GPU (8x NVIDIA A100/H100)

Evaluated Benchmarks

SurgMotion has been evaluated on the following public surgical phase datasets:

Dataset	Task	Domain
AutoLaparo	Phase recognition	Laparoscopic hysterectomy
Cholec80	Phase recognition	Laparoscopic cholecystectomy
EgoSurgery	Phase recognition	Egocentric open surgery
M2CAI2016	Phase recognition	Laparoscopic cholecystectomy
OphNet2024	Phase recognition	Ophthalmic surgery
PitVis	Phase recognition	Pituitary neurosurgery
PmLR50	Phase recognition	Laparoscopic liver resection
PolypDiag	Binary classification	GI endoscopy
SurgicalActions160	Action recognition	Multi-procedure

Limitations

Performance may vary on surgical domains not represented in SurgMotion-15M
Frame-level predictions may exhibit temporal fragmentation without post-processing smoothing
Requires GPU for inference; not optimized for edge deployment

Citation

If you find SurgMotion useful, please cite:

@misc{wu2026surgmotionvideonativefoundationmodel,
      title={SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos}, 
      author={Jinlin Wu and Felix Holm and Chuxi Chen and An Wang and Yaxin Hu and Xiaofan Ye and Zelin Zang and Miao Xu and Lihua Zhou and Huai Liao and Danny T. M. Chan and Ming Feng and Wai S. Poon and Hongliang Ren and Dong Yi and Nassir Navab and Gaofeng Meng and Jiebo Luo and Hongbin Liu and Zhen Lei},
      year={2026},
      eprint={2602.05638},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.05638}, 
}

Acknowledgement

SurgMotion is built on top of V-JEPA 2 by Meta. We thank the authors of the following works whose open-source models were used in our benchmark comparison:

Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, CAS

Project Page | Paper | GitHub

Downloads last month: 31

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for CAIR-HKISI/SurgMotion

UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Paper • 2602.05638 • Published Feb 5 • 9