File size: 6,118 Bytes

---
license: apache-2.0
pipeline_tag: image-classification
tags:
- medical
- surgical
- endoscopy
---

<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/67d9504a41d31cc626fcecc8/cE7UgFfJJ2gUHJr0SSEhc.png"> </img>
</div>

[📚 Paper](https://arxiv.org/abs/2503.19740) - [🤖 GitHub](https://github.com/visurg-ai/LEMON) 

This repository provides the models used in the data curation pipeline for the paper [LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings](https://arxiv.org/abs/2503.19740). These models assist in constructing the LEMON dataset by filtering and processing surgical video content. 

For more details about the LEMON dataset and our LemonFM foundation model, please visit our [GitHub repository](https://github.com/visurg-ai/LEMON).

## Citation

If you use our dataset, model, or code in your research, please cite our paper:

```bibtex
@misc{che2025lemonlargeendoscopicmonocular,
      title={LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings}, 
      author={Chengan Che and Chao Wang and Tom Vercauteren messenger, Sophia Tsoka and Luis C. Garcia-Peraza-Herrera},
      year={2025},
      eprint={2503.19740},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.19740}, 
}
```

## Model Overview

This Hugging Face repository includes video storyboard classification models, frame classification models, and non-surgical object detection models. The model loader file can be found at [model_loader.py](https://huggingface.co/visurg/Surg3M_curation_models/blob/main/model_loader.py).

<div align="center">
<table style="margin-left: auto; margin-right: auto;">
  <tr>
    <th>Model</th>
    <th>Architecture</th>
    <th colspan="5">Download</th>
  </tr>
  <tr>
    <td>Video storyboard classification models</td>
    <td>ResNet-18</td>
    <td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/video_storyboard_classification">Full ckpt</a></td>
  </tr>
  <tr>
    <td>Frame classification models</td>
    <td>ResNet-18</td>
    <td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/frame_classification">Full ckpt</a></td>
  </tr>
  <tr>
    <td>Non-surgical object detection models</td>
    <td>Yolov8-Nano</td>
    <td><a href="https://huggingface.co/visurg/Surg3M_curation_models/tree/main/nonsurgical_object_detection">Full ckpt</a></td>
  </tr>
</table>
</div>

The data curation pipeline leading to the clean videos in the LEMON dataset is as follows:
<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/67d9504a41d31cc626fcecc8/jzw36jlPT-V_I-Vm01OzO.png"> </img>
</div>

## Usage

### Video classification models
**Video classification models** are employed in step **2** of the data curation pipeline to classify a video storyboard as either surgical or non-surgical:

   ```python
   import torch
   import torchvision
   from PIL import Image
   from model_loader import build_model

   # Load the model
   net = build_model(mode='classify')
   model_path = 'Video storyboard classification models'

   # Enable multi-GPU support
   net = torch.nn.DataParallel(net)
   torch.backends.cudnn.benchmark = True
   state = torch.load(model_path, map_location=torch.device('cpu'))
   net.load_state_dict(state['net'])
   net.eval()

   # Load the video storyboard and convert it to a PyTorch tensor
   img_path = 'path/to/your/image.jpg'
   img = Image.open(img_path)
   img = img.resize((224, 224))
   transform = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(
        (0.4299694, 0.29676908, 0.27707579), 
        (0.24373249, 0.20208984, 0.19319402)
    )
   ])
   img_tensor = transform(img).unsqueeze(0).to('cuda')

   # Extract features from the image
   outputs = net(img_tensor)
   ```

### Frame classification models
**Frame classification models** are used in step **3** of the data curation pipeline to classify a frame as either surgical or non-surgical:

   ```python
   import torch
   import torchvision
   from PIL import Image
   from model_loader import build_model

   # Load the model
   net = build_model(mode='classify')
   model_path = 'Frame classification models'

   # Enable multi-GPU support
   net = torch.nn.DataParallel(net)
   torch.backends.cudnn.benchmark = True
   state = torch.load(model_path, map_location=torch.device('cpu'))
   net.load_state_dict(state['net'])
   net.eval()

   img_path = 'path/to/your/image.jpg'
   img = Image.open(img_path)
   img = img.resize((224, 224))
   transform = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(
        (0.4299694, 0.29676908, 0.27707579), 
        (0.24373249, 0.20208984, 0.19319402)
    )
   ])
   img_tensor = transform(img).unsqueeze(0).to('cuda')

   # Extract features from the image
   outputs = net(img_tensor)
   ```

### Non-surgical object detection models
**Non-surgical object detection models** are used to obliterate the non-surgical region in the surgical frames (e.g. user interface information):

   ```python
   import torch
   import torchvision
   from PIL import Image
   from model_loader import build_model

   # Load the model
   net = build_model(mode='mask')
   model_path = 'Frame classification models'

   # Enable multi-GPU support
   net = torch.nn.DataParallel(net)
   torch.backends.cudnn.benchmark = True
   state = torch.load(model_path, map_location=torch.device('cpu'))
   net.load_state_dict(state['net'])
   net.eval()

   img_path = 'path/to/your/image.jpg'
   img = Image.open(img_path)
   img = img.resize((224, 224))
   transform = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(
        (0.4299694, 0.29676908, 0.27707579), 
        (0.24373249, 0.20208984, 0.19319402)
    )
   ])
   img_tensor = transform(img).unsqueeze(0).to('cuda')

   # Extract features from the image
   outputs = net(img_tensor)
   ```