--- license: cc-by-4.0 datasets: - UWMadAbility/DISCOVR language: - en base_model: - Ultralytics/YOLOv8 pipeline_tag: object-detection tags: - yolo - yolov8 - object-detection - accessibility - vr - virtual-reality - social-vr - screen-reader library_name: ultralytics --- # VRSight Object Detection Model Fine-tuned YOLOv8n model for detecting UI elements and interactive objects in virtual reality environments. This model powers the [VRSight system](https://github.com/MadisonAbilityLab/VRSight), a post hoc 3D screen reader for blind and low vision VR users. **Model Weights:** `best.pt` (available in the Files tab) **Full System:** [github.com/MadisonAbilityLab/VRSight](https://github.com/MadisonAbilityLab/VRSight) **Paper:** [VRSight (UIST 2025)](https://dl.acm.org/doi/full/10.1145/3746059.3747641) **Training Dataset:** [UWMadAbility/DISCOVR](https://huggingface.co/datasets/UWMadAbility/DISCOVR) **Developed by:** Daniel Killough, Justin Feng, Zheng Xue Ching, Daniel Wang, Rithvik Dyava, Yapeng Tian*, Yuhang Zhao **Affiliations:** University of Wisconsin-Madison, *University of Texas at Dallas ## Quick Start ### Installation & Download ```bash pip install ultralytics # Download model weights wget -O best.pt https://huggingface.co/UWMadAbility/VRSight/resolve/main/best.pt ``` ### Basic Usage ```python from ultralytics import YOLO # Load model model = YOLO('best.pt') # Run inference on VR screenshot results = model('vr_screenshot.jpg') # Process results for result in results: boxes = result.boxes for box in boxes: class_id = int(box.cls[0]) confidence = float(box.conf[0]) bbox = box.xyxy[0].tolist() print(f"Class: {model.names[class_id]}") print(f"Confidence: {confidence:.2f}") print(f"BBox: {bbox}") ``` ### Batch Processing ```python results = model.predict( source='vr_screenshots/', save=True, conf=0.25, device='0' # GPU 0, or 'cpu' ) ``` ## Model Details ### Architecture - **Base:** YOLOv8n (Nano variant - optimized for real-time performance) - **Input:** 640×640 pixels - **Output:** Bounding boxes with class predictions and confidence scores - **Classes:** 30 VR object types across 6 categories ### Performance | Metric | Test Set | |--------|----------| | **mAP@50** | **67.3%** | | **mAP@75** | 49.5% | | **mAP** | 46.3% | | **Inference Speed** | ~20-30+ FPS | **Key Finding:** Base YOLOv8n trained on COCO rarely detected VR objects, demonstrating the necessity of VR-specific training data. See Table 1 in the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for per-class metrics. ### Object Classes (30 Total) The model detects 6 categories of VR objects: **Avatars:** avatar, avatar-nonhuman, chat-bubble, chat-box **Informational:** sign-text, ui-text, sign-graphic, menu, ui-graphic, progress-bar, hud, indicator-mute **Interactables:** interactable, button, target, portal, writing-utensil, watch, writing-surface, spawner **Safety:** guardian, out-of-bounds **Seating:** seat-single, table, seat-multiple, campfire **VR System:** hand, controller, dashboard, locomotion-target See the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) (Table 1) for detailed descriptions and per-class performance. ## Training Details ### Dataset - **DISCOVR:** 17,691 labeled images from 17 social VR apps - **Train:** 15,207 images | **Val:** 1,645 images | **Test:** 839 images - **Augmentation:** Horizontal/vertical flips, rotation, scaling, shearing, HSV jittering ### Training Configuration - **GPU:** NVIDIA A100 - **Epochs:** 250 - **Image Size:** 640×640 - **Method:** Fine-tuned from YOLOv8n pretrained weights ## VRSight System Integration This model is one component of the complete VRSight system, which combines: - **This object detection model** (detects VR objects) - Depth estimation (DepthAnythingV2) - GPT-4o (scene atmosphere and detailed descriptions) - OCR (text reading) - Spatial audio (TTS -> WebVR app e.g., PlayCanvas) **To use the full VRSight system**, see the [GitHub repository](https://github.com/MadisonAbilityLab/VRSight). ## Limitations - **VR-specific:** Trained on social VR apps - performance varies on other VR types - **Lighting:** Reduced accuracy in dark environments - **Coverage:** 30 classes cover common social VR objects but not all possible VR elements - **Application types:** Best performance in social VR; may struggle with faster-paced games See Section 7.2 of the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for detailed discussion. ## Citation Please cite use of this model, the DISCOVR dataset, or the fine-tuned object detection model using the VRSight paper: ```bibtex @inproceedings{killough2025vrsight, title={VRSight: An AI-Driven Scene Description System to Improve Virtual Reality Accessibility for Blind People}, author={Killough, Daniel and Feng, Justin and Ching, Zheng Xue and Wang, Daniel and Dyava, Rithvik and Tian, Yapeng and Zhao, Yuhang}, booktitle={Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology}, pages={1--17}, year={2025}, publisher={ACM}, address={Busan, Republic of Korea}, doi={10.1145/3746059.3747641} } ``` ## License CC BY 4.0 - Free to use with attribution ## Contact - **GitHub Issues:** [github.com/MadisonAbilityLab/VRSight/issues](https://github.com/MadisonAbilityLab/VRSight/issues) - **Paper:** [dl.acm.org/doi/full/10.1145/3746059.3747641](https://dl.acm.org/doi/full/10.1145/3746059.3747641) - **Lead Author:** Daniel Killough (UW-Madison MadAbility Lab) ## Related Resources - **[VRSight GitHub](https://github.com/MadisonAbilityLab/VRSight)** - Complete system implementation - **[DISCOVR Dataset](https://huggingface.co/datasets/UWMadAbility/DISCOVR)** - Training data - **[UIST 2025 Paper](https://dl.acm.org/doi/full/10.1145/3746059.3747641)** - Research paper - **[Video Demo](https://x.com/i/status/1969153746337665262)** - System in action