EgoExOR / README.md

Update README.md

cf79a2e verified 10 months ago

5.28 kB

	---
	license: apache-2.0
	datasets:
	- ardamamur/EgoExOR
	language:
	- en
	metrics:
	- f1
	base_model:
	- liuhaotian/llava-v1.5-7b
	---
	# EgoExOR Scene Graph Foundation Model
	<table>
	<tr>
	<td style="padding: 0;">
	<a href="https://huggingface.co/datasets/ardamamur/EgoExOR">
	<img src="https://img.shields.io/badge/Data-4d5eff?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor" alt="Data">
	</a>
	</td>
	<td style="padding: 0;">
	<a href="https://github.com/ardamamur/EgoExOR">
	<img src="https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white" alt="Code">
	</a>
	</td>
	</tr>
	</table>

	This repository hosts the foundation model for surgical scene graph generation trained on the [EgoExOR](https://huggingface.co/datasets/ardamamur/EgoExOR) dataset – a multimodal, multi-perspective dataset collected in a simulated operating room (OR) environment.

	> Operating rooms (ORs) demand precise coordination among surgeons, nurses, and equipment in a fast-paced, occlusion-heavy environment, necessitating
	> advanced perception models to enhance safety and efficiency. Existing datasets either provide partial egocentric views or sparse exocentric multi-view context,
	> but don't explore the comprehensive combination of both. We introduce EgoExOR,
	> the first OR dataset and accompanying benchmark to fuse first-person and third-person perspectives.
	> Spanning 94 minutes (84,553 frames at 15 FPS) of two simulated spine procedures,
	> Ultrasound-Guided Needle Insertion and Minimally Invasive Spine Surgery,
	> EgoExOR integrates egocentric data (RGB, gaze, hand tracking, audio) from wearable glasses,
	> exocentric RGB and depth from RGB-D cameras, and ultrasound imagery.
	> Its detailed scene graph annotations, covering 36 entities and 22 relations (~573,000 triplets), enable robust modeling of clinical interactions,
	> supporting tasks like action recognition and human-centric perception. We evaluate the surgical scene graph generation performance of
	> two adapted state-of-the-art models and offer a new baseline that explicitly leverages EgoExOR’s multimodal and multi-perspective signals.
	> Our new dataset and benchmark set a new foundation for OR perception, offering a rich,
	> multimodal resource for next-generation clinical perception. Our code available at [EgoExOR GitHub](https://github.com/ardamamur/EgoExOR) and dataset [EgoExOR Hugging Face Dataset](https://huggingface.co/datasets/ardamamur/EgoExOR)

	## 🧠 Model Overview

	<p align="center">
	<img src="https://github.com/ardamamur/EgoExOR/blob/main/figures/model_overview.png?raw=true" alt="EgoExOR Overview" width="80%"/>
	</p>
	<p align="center">
	<em>Figure: Overview of the proposed EgoExOR model for surgical scene graph generation. The model
	employs a dual-branch architecture to separately process egocentric and exocentric modalities. Fused
	embeddings are passed to a large language model (LLM) to autoregressively generate scene graph
	triplets representing entities and their interactions.</em>
	</p>

	EgoExOR Model. To fully exploit EgoExOR’s rich multi-perspective data, we introduce a new baseline model featuring a dual-branch architecture.
	The egocentric branch processes first
	person RGB, hand pose, and gaze data, while the exocentric branch handles third-person RGB-D,
	ultrasound recordings, audio, and point clouds. Each branch uses a 2-layer transformer to fuse its inputs into N feature embeddings.
	These are concatenated and fed into the LLM for triplet prediction.
	By explicitly separating and fusing perspective-specific features,
	our model better captures actions and staff interactions, outperforming single-stream baselines in modeling complex OR dynamics.

	## 📊 Benchmark Results

	This model outperforms prior single-stream baselines like [ORacle](https://arxiv.org/pdf/2404.07031) and [MM2SG](https://arxiv.org/pdf/2503.02579)
	by effectively leveraging perspective-specific signals.

	> \| Model \| UI F1 \| MISS F1 \| Overall F1 \|
	\|------------------\|-------\|---------\|------------\|
	\| ORacle (Baseline) \| 0.70 \| 0.71 \| 0.69 \|
	\| MM2SG (Baseline) \| 0.77 \| 0.68 \| 0.72 \|
	\| EgoExOR (Ours)\| 0.86 \| 0.70 \| 0.79 \|

	Overall the results, shown in Table above, the dual-branch EgoExOR model achieves
	the highest macro F1. Several predicates in EgoExOR rely on understanding transient tool-hand
	trajectories, and fine-grained action cues. This emphasizes the importance of explicitly modeling
	multiple viewpoints and leveraging all available modalities to improve OR scene understanding.

	## 🗃️ Dataset

	EgoExOR provides:
	- 84,553 frames (94 mins)
	- 2 surgical procedures (Ultrasound Injection & MISS)
	- 36 entities, 22 predicates
	- Over 573,000 triplets
	- Multimodal signals: RGB, depth, gaze, audio, ultrasound, point cloud, hand tracking

	You can find the dataset processing tools [GitHub repo](https://github.com/ardamamur/EgoExOR).

	## 🔗 Links

	- 🖥️ Code: [EgoExOR GitHub](https://github.com/ardamamur/EgoExOR)
	- 🤗 Dataset: [EgoExOR Hugging Face Dataset](https://huggingface.co/datasets/ardamamur/EgoExOR)
	- 🤗 Model Card & Weights: [EgoExOR Hugging Face Model](https://huggingface.co/ardamamur/EgoExOR)