Model Summary
SurgMotion is a video-native surgical foundation model that learns spatiotemporal representations by predicting latent motion rather than reconstructing pixels. Built upon the Video Joint Embedding Predictive Architecture (V-JEPA 2), SurgMotion captures the complex temporal dynamics of surgical procedures without the computational overhead of generative decoding.
Key innovations:
- Latent motion prediction β shifts from pixel-level reconstruction to abstract motion forecasting in latent space
- Flow-Guided Latent Prediction β a novel objective that prevents feature collapse in homogeneous surgical tissue regions
- Pre-trained on SurgMotion-15M β the largest multi-modal surgical video dataset to date (15M frames, 3,658 hours, 13+ anatomical regions)
Model Variants
| Variant | Backbone | Parameters | Pre-training Data |
|---|---|---|---|
| SurgMotion-L | ViT-Large | 300M | SurgMotion-15M |
| SurgMotion-G | ViT-Giant-xformer | 1B | SurgMotion-15M |
Pre-training Data: SurgMotion-15M
| Statistic | Value |
|---|---|
| Total Frames | 15M+ |
| Total Duration | 3,658 hours |
| Anatomical Regions | 13+ |
| Supported Tasks | 6+ (workflow, action, segmentation, triplet, skill, depth) |
SurgMotion-15M spans a diverse range of surgical procedures including laparoscopic, endoscopic, robotic, open, endonasal, neurosurgical, and ophthalmic surgeries.
Performance Highlights
SurgMotion achieves state-of-the-art results across multiple surgical understanding tasks.
- Best overall on all representative surgical tasks
Workflow Phase Recognition (Avg F1)
| Dataset | Domain | SurgMotion |
|---|---|---|
| AutoLaparo | Laparoscopic hysterectomy | SOTA |
| Cholec80 | Laparoscopic cholecystectomy | SOTA |
| EgoSurgery | Egocentric open surgery | SOTA |
| M2CAI2016 | Laparoscopic cholecystectomy | SOTA |
| OphNet2024 | Ophthalmic surgery | SOTA |
| PitVis | Pituitary neurosurgery | SOTA |
| PmLR50 | Laparoscopic liver resection | SOTA |
| PolypDiag | GI endoscopy | SOTA |
Multi-Task Performance
SurgMotion demonstrates strong generalization across diverse surgical understanding tasks beyond phase recognition:
| Task | Description | Result |
|---|---|---|
| Workflow Recognition | Surgical phase identification | Best |
| Action Recognition | Fine-grained action classification | Best |
| Segmentation | Instrument/tissue segmentation | Best |
| Triplet Recognition | Instrument-verb-target triplets | Best |
| Skill Assessment | Surgical skill scoring | Best |
| Depth Estimation | Monocular depth prediction | Best |
Intended Use
SurgMotion is designed for:
- Surgical workflow analysis β automated phase and step recognition
- Downstream fine-tuning β feature extraction backbone for surgical vision tasks
- Research benchmarking β standardized evaluation of surgical video foundation models
How to Use
Quick Start
# Clone the repository
git clone https://github.com/CAIR-HKISI/SurgMotion.git
cd SurgMotion
pip install -e .
# Run probing evaluation on a dataset
python -m evals.main \
--fname configs/foundation_model_probing/dinov3/AutoLaparo/dinov3_vitl_64f_autolaparo.yaml \
--devices cuda:0
For detailed setup, data preparation, and usage instructions, please refer to the GitHub repository.
Download from Hugging Face Hub
The recommended way to access released checkpoints is through the dedicated single-model repositories:
The original combined repository still hosts the raw PyTorch .pt files:
config.jsonSurgMotion-vitl.ptSurgMotion-vitg-xformer.pt
For better Hugging Face Hub compatibility, fetch the repository config together with the checkpoint:
from huggingface_hub import hf_hub_download
config_path = hf_hub_download(
repo_id="CAIR-HKISI/SurgMotion",
filename="config.json",
)
checkpoint_path = hf_hub_download(
repo_id="CAIR-HKISI/SurgMotion",
filename="SurgMotion-vitl.pt",
)
print(config_path)
print(checkpoint_path)
Or download them directly:
The original .pt checkpoints remain the primary release artifacts and are unchanged.
Architecture
SurgMotion builds on the V-JEPA 2 framework with the following pipeline:
- Video Encoder (ViT) β processes 64-frame surgical video clips into spatiotemporal token sequences
- Latent Predictor β predicts masked region representations in latent space guided by optical flow
- Probing Head β lightweight temporal classifier for downstream phase recognition
The model learns without pixel-level reconstruction, relying entirely on latent-space self-supervised objectives. The Flow-Guided Latent Prediction mechanism specifically addresses the challenge of homogeneous tissue regions common in surgical videos, where standard masking strategies tend to collapse.
Training Details
| Aspect | Detail |
|---|---|
| Framework | V-JEPA 2 (Video Joint Embedding Predictive Architecture) |
| Objective | Flow-Guided Latent Prediction |
| Input | 64 frames per clip |
| Pre-training Scale | 3,658 hours of surgical video |
| Hardware | Multi-GPU (8x NVIDIA A100/H100) |
Evaluated Benchmarks
SurgMotion has been evaluated on the following public surgical phase datasets:
| Dataset | Task | Domain |
|---|---|---|
| AutoLaparo | Phase recognition | Laparoscopic hysterectomy |
| Cholec80 | Phase recognition | Laparoscopic cholecystectomy |
| EgoSurgery | Phase recognition | Egocentric open surgery |
| M2CAI2016 | Phase recognition | Laparoscopic cholecystectomy |
| OphNet2024 | Phase recognition | Ophthalmic surgery |
| PitVis | Phase recognition | Pituitary neurosurgery |
| PmLR50 | Phase recognition | Laparoscopic liver resection |
| PolypDiag | Binary classification | GI endoscopy |
| SurgicalActions160 | Action recognition | Multi-procedure |
Limitations
- Performance may vary on surgical domains not represented in SurgMotion-15M
- Frame-level predictions may exhibit temporal fragmentation without post-processing smoothing
- Requires GPU for inference; not optimized for edge deployment
Citation
If you find SurgMotion useful, please cite:
@misc{wu2026surgmotionvideonativefoundationmodel,
title={SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos},
author={Jinlin Wu and Felix Holm and Chuxi Chen and An Wang and Yaxin Hu and Xiaofan Ye and Zelin Zang and Miao Xu and Lihua Zhou and Huai Liao and Danny T. M. Chan and Ming Feng and Wai S. Poon and Hongliang Ren and Dong Yi and Nassir Navab and Gaofeng Meng and Jiebo Luo and Hongbin Liu and Zhen Lei},
year={2026},
eprint={2602.05638},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.05638},
}
Acknowledgement
SurgMotion is built on top of V-JEPA 2 by Meta. We thank the authors of the following works whose open-source models were used in our benchmark comparison:
DINOv2 | Endo-FM | EndoMamba | EndoSSL | EndoViT | GastroNet | GSViT | SelfSupSurg | SurgeNet | SurgVISTA | SurgVLP | VideoMAEv2
Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, CAS
Project Page | Paper | GitHub
- Downloads last month
- 48