| --- |
| license: apache-2.0 |
| pipeline_tag: image-classification |
| tags: |
| - medical |
| - surgical |
| - endoscopy |
| --- |
| |
| <div align="center"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/67d9504a41d31cc626fcecc8/cE7UgFfJJ2gUHJr0SSEhc.png"> </img> |
| </div> |
|
|
| [📚 Paper](https://arxiv.org/abs/2503.19740) - [🤖 GitHub](https://github.com/visurg-ai/LEMON) |
|
|
| This repository provides the models used in the data curation pipeline for the paper [LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings](https://arxiv.org/abs/2503.19740). These models assist in constructing the LEMON dataset by filtering and processing surgical video content. |
|
|
| For more details about the LEMON dataset and our LemonFM foundation model, please visit our [GitHub repository](https://github.com/visurg-ai/LEMON). |
|
|
| ## Citation |
|
|
| If you use our dataset, model, or code in your research, please cite our paper: |
|
|
| ```bibtex |
| @misc{che2025lemonlargeendoscopicmonocular, |
| title={LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings}, |
| author={Chengan Che and Chao Wang and Tom Vercauteren messenger, Sophia Tsoka and Luis C. Garcia-Peraza-Herrera}, |
| year={2025}, |
| eprint={2503.19740}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2503.19740}, |
| } |
| ``` |
|
|
| ## Model Overview |
|
|
| This Hugging Face repository includes video storyboard classification models, frame classification models, and non-surgical object detection models. The model loader file can be found at [model_loader.py](https://huggingface.co/visurg/Surg3M_curation_models/blob/main/model_loader.py). |
|
|
| <div align="center"> |
| <table style="margin-left: auto; margin-right: auto;"> |
| <tr> |
| <th>Model</th> |
| <th>Architecture</th> |
| <th colspan="5">Download</th> |
| </tr> |
| <tr> |
| <td>Video storyboard classification models</td> |
| <td>ResNet-18</td> |
| <td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/video_storyboard_classification">Full ckpt</a></td> |
| </tr> |
| <tr> |
| <td>Frame classification models</td> |
| <td>ResNet-18</td> |
| <td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/frame_classification">Full ckpt</a></td> |
| </tr> |
| <tr> |
| <td>Non-surgical object detection models</td> |
| <td>Yolov8-Nano</td> |
| <td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/nonsurgical_object_detection">Full ckpt</a></td> |
| </tr> |
| </table> |
| </div> |
| |
| The data curation pipeline leading to the clean videos in the LEMON dataset is as follows: |
| <div align="center"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/67d9504a41d31cc626fcecc8/jzw36jlPT-V_I-Vm01OzO.png"> </img> |
| </div> |
|
|
| ## Usage |
|
|
| ### Video classification models |
| **Video classification models** are employed in step **2** of the data curation pipeline to classify a video storyboard as either surgical or non-surgical: |
|
|
| ```python |
| import torch |
| import torchvision |
| from PIL import Image |
| from model_loader import build_model |
| |
| # Load the model |
| net = build_model(mode='classify') |
| model_path = 'Video storyboard classification models' |
| |
| # Enable multi-GPU support |
| net = torch.nn.DataParallel(net) |
| torch.backends.cudnn.benchmark = True |
| state = torch.load(model_path, map_location=torch.device('cpu')) |
| net.load_state_dict(state['net']) |
| net.eval() |
| |
| # Load the video storyboard and convert it to a PyTorch tensor |
| img_path = 'path/to/your/image.jpg' |
| img = Image.open(img_path) |
| img = img.resize((224, 224)) |
| transform = torchvision.transforms.Compose([ |
| torchvision.transforms.ToTensor(), |
| torchvision.transforms.Normalize( |
| (0.4299694, 0.29676908, 0.27707579), |
| (0.24373249, 0.20208984, 0.19319402) |
| ) |
| ]) |
| img_tensor = transform(img).unsqueeze(0).to('cuda') |
| |
| # Extract features from the image |
| outputs = net(img_tensor) |
| ``` |
|
|
| ### Frame classification models |
| **Frame classification models** are used in step **3** of the data curation pipeline to classify a frame as either surgical or non-surgical: |
|
|
| ```python |
| import torch |
| import torchvision |
| from PIL import Image |
| from model_loader import build_model |
| |
| # Load the model |
| net = build_model(mode='classify') |
| model_path = 'Frame classification models' |
| |
| # Enable multi-GPU support |
| net = torch.nn.DataParallel(net) |
| torch.backends.cudnn.benchmark = True |
| state = torch.load(model_path, map_location=torch.device('cpu')) |
| net.load_state_dict(state['net']) |
| net.eval() |
| |
| img_path = 'path/to/your/image.jpg' |
| img = Image.open(img_path) |
| img = img.resize((224, 224)) |
| transform = torchvision.transforms.Compose([ |
| torchvision.transforms.ToTensor(), |
| torchvision.transforms.Normalize( |
| (0.4299694, 0.29676908, 0.27707579), |
| (0.24373249, 0.20208984, 0.19319402) |
| ) |
| ]) |
| img_tensor = transform(img).unsqueeze(0).to('cuda') |
| |
| # Extract features from the image |
| outputs = net(img_tensor) |
| ``` |
|
|
| ### Non-surgical object detection models |
| **Non-surgical object detection models** are used to obliterate the non-surgical region in the surgical frames (e.g. user interface information): |
|
|
| ```python |
| import torch |
| import torchvision |
| from PIL import Image |
| from model_loader import build_model |
| |
| # Load the model |
| net = build_model(mode='mask') |
| model_path = 'Frame classification models' |
| |
| # Enable multi-GPU support |
| net = torch.nn.DataParallel(net) |
| torch.backends.cudnn.benchmark = True |
| state = torch.load(model_path, map_location=torch.device('cpu')) |
| net.load_state_dict(state['net']) |
| net.eval() |
| |
| img_path = 'path/to/your/image.jpg' |
| img = Image.open(img_path) |
| img = img.resize((224, 224)) |
| transform = torchvision.transforms.Compose([ |
| torchvision.transforms.ToTensor(), |
| torchvision.transforms.Normalize( |
| (0.4299694, 0.29676908, 0.27707579), |
| (0.24373249, 0.20208984, 0.19319402) |
| ) |
| ]) |
| img_tensor = transform(img).unsqueeze(0).to('cuda') |
| |
| # Extract features from the image |
| outputs = net(img_tensor) |
| ``` |