| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - ardamamur/EgoExOR |
| | language: |
| | - en |
| | metrics: |
| | - f1 |
| | base_model: |
| | - liuhaotian/llava-v1.5-7b |
| | --- |
| | # EgoExOR Scene Graph Foundation Model |
| | <table> |
| | <tr> |
| | <td style="padding: 0;"> |
| | <a href="https://huggingface.co/datasets/ardamamur/EgoExOR"> |
| | <img src="https://img.shields.io/badge/Data-4d5eff?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor" alt="Data"> |
| | </a> |
| | </td> |
| | <td style="padding: 0;"> |
| | <a href="https://github.com/ardamamur/EgoExOR"> |
| | <img src="https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white" alt="Code"> |
| | </a> |
| | </td> |
| | </tr> |
| | </table> |
| | |
| | This repository hosts the foundation model for **surgical scene graph generation** trained on the [EgoExOR](https://huggingface.co/datasets/ardamamur/EgoExOR) dataset – a multimodal, multi-perspective dataset collected in a simulated operating room (OR) environment. |
| |
|
| | > Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating |
| | > advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context, |
| | > but don't explore the comprehensive combination of both. We introduce EgoExOR, |
| | > the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives. |
| | > Spanning 94 minutes (84,553 frames at 15 FPS) of two simulated spine procedures, |
| | > Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery, |
| | > EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses, |
| | > exocentric RGB and depth from RGB-D cameras, and ultrasound imagery. |
| | > Its detailed scene graph annotations, covering 36 entities and 22 relations (~573,000 triplets), enable robust modeling of clinical interactions, |
| | > supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of |
| | > two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR’s multimodal and multi-perspective signals. |
| | > Our new dataset and benchmark set a new foundation for OR perception, offering a rich, |
| | > multimodal resource for next-generation clinical perception. Our code available at [EgoExOR GitHub](https://github.com/ardamamur/EgoExOR) and dataset [EgoExOR Hugging Face Dataset](https://huggingface.co/datasets/ardamamur/EgoExOR) |
| |
|
| | ## 🧠 Model Overview |
| |
|
| | <p align="center"> |
| | <img src="https://github.com/ardamamur/EgoExOR/blob/main/figures/model_overview.png?raw=true" alt="EgoExOR Overview" width="80%"/> |
| | </p> |
| | <p align="center"> |
| | <em>Figure: Overview of the proposed EgoExOR model for surgical scene graph generation. The model |
| | employs a dual-branch architecture to separately process egocentric and exocentric modalities. Fused |
| | embeddings are passed to a large language model (LLM) to autoregressively generate scene graph |
| | triplets representing entities and their interactions.</em> |
| | </p> |
| |
|
| | EgoExOR Model. To fully exploit EgoExOR’s rich multi-perspective data, we introduce a new baseline model featuring a dual-branch architecture. |
| | The egocentric branch processes first |
| | person RGB, hand pose, and gaze data, while the exocentric branch handles third-person RGB-D, |
| | ultrasound recordings, audio, and point clouds. Each branch uses a 2-layer transformer to fuse its inputs into N feature embeddings. |
| | These are concatenated and fed into the LLM for triplet prediction. |
| | By explicitly separating and fusing perspective-specific features, |
| | our model better captures actions and staff interactions, outperforming single-stream baselines in modeling complex OR dynamics. |
| |
|
| | ## 📊 Benchmark Results |
| |
|
| | This model outperforms prior single-stream baselines like [ORacle](https://arxiv.org/pdf/2404.07031) and [MM2SG](https://arxiv.org/pdf/2503.02579) |
| | by effectively leveraging perspective-specific signals. |
| |
|
| | > | Model | UI F1 | MISS F1 | Overall F1 | |
| | |------------------|-------|---------|------------| |
| | | ORacle (Baseline) | 0.70 | 0.71 | 0.69 | |
| | | MM2SG (Baseline) | 0.77 | 0.68 | 0.72 | |
| | | **EgoExOR (Ours)**| **0.86** | **0.70** | **0.79** | |
| |
|
| | Overall the results, shown in Table above, the dual-branch EgoExOR model achieves |
| | the highest macro F1. Several predicates in EgoExOR rely on understanding transient tool-hand |
| | trajectories, and fine-grained action cues. This emphasizes the importance of explicitly modeling |
| | multiple viewpoints and leveraging all available modalities to improve OR scene understanding. |
| |
|
| | ## 🗃️ Dataset |
| |
|
| | EgoExOR provides: |
| | - 84,553 frames (94 mins) |
| | - 2 surgical procedures (Ultrasound Injection & MISS) |
| | - 36 entities, 22 predicates |
| | - Over 573,000 triplets |
| | - Multimodal signals: RGB, depth, gaze, audio, ultrasound, point cloud, hand tracking |
| |
|
| | You can find the dataset processing tools [GitHub repo](https://github.com/ardamamur/EgoExOR). |
| |
|
| | ## 🔗 Links |
| |
|
| | - 🖥️ Code: [EgoExOR GitHub](https://github.com/ardamamur/EgoExOR) |
| | - 🤗 Dataset: [EgoExOR Hugging Face Dataset](https://huggingface.co/datasets/ardamamur/EgoExOR) |
| | - 🤗 Model Card & Weights: [EgoExOR Hugging Face Model](https://huggingface.co/ardamamur/EgoExOR) |
| |
|
| |
|