---
license: cc-by-4.0
datasets:
- UWMadAbility/DISCOVR
language:
- en
base_model:
- Ultralytics/YOLOv8
pipeline_tag: object-detection
tags:
- yolo
- yolov8
- object-detection
- accessibility
- vr
- virtual-reality
- social-vr
- screen-reader
library_name: ultralytics
---

# VRSight Object Detection Model

Fine-tuned YOLOv8n model for detecting UI elements and interactive objects in virtual reality environments. This model powers the [VRSight system](https://github.com/MadisonAbilityLab/VRSight), a post hoc 3D screen reader for blind and low vision VR users.

**Model Weights:** `best.pt` (available in the Files tab)  
**Full System:** [github.com/MadisonAbilityLab/VRSight](https://github.com/MadisonAbilityLab/VRSight)  
**Paper:** [VRSight (UIST 2025)](https://dl.acm.org/doi/full/10.1145/3746059.3747641) 
**Training Dataset:** [UWMadAbility/DISCOVR](https://huggingface.co/datasets/UWMadAbility/DISCOVR)

**Developed by:** Daniel Killough, Justin Feng, Zheng Xue Ching, Daniel Wang, Rithvik Dyava, Yapeng Tian*, Yuhang Zhao  
**Affiliations:** University of Wisconsin-Madison, *University of Texas at Dallas

## Quick Start

### Installation & Download
```bash
pip install ultralytics

# Download model weights
wget -O best.pt https://huggingface.co/UWMadAbility/VRSight/resolve/main/best.pt
```

### Basic Usage
```python
from ultralytics import YOLO

# Load model
model = YOLO('best.pt')

# Run inference on VR screenshot
results = model('vr_screenshot.jpg')

# Process results
for result in results:
    boxes = result.boxes
    for box in boxes:
        class_id = int(box.cls[0])
        confidence = float(box.conf[0])
        bbox = box.xyxy[0].tolist()
        
        print(f"Class: {model.names[class_id]}")
        print(f"Confidence: {confidence:.2f}")
        print(f"BBox: {bbox}")
```

### Batch Processing
```python
results = model.predict(
    source='vr_screenshots/',
    save=True,
    conf=0.25,
    device='0'  # GPU 0, or 'cpu'
)
```

## Model Details

### Architecture
- **Base:** YOLOv8n (Nano variant - optimized for real-time performance)
- **Input:** 640×640 pixels
- **Output:** Bounding boxes with class predictions and confidence scores
- **Classes:** 30 VR object types across 6 categories

### Performance

| Metric | Test Set |
|--------|----------|
| **mAP@50** | **67.3%** |
| **mAP@75** | 49.5% |
| **mAP** | 46.3% |
| **Inference Speed** | ~20-30+ FPS |

**Key Finding:** Base YOLOv8n trained on COCO rarely detected VR objects, demonstrating the necessity of VR-specific training data. See Table 1 in the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for per-class metrics.

### Object Classes (30 Total)

The model detects 6 categories of VR objects:

**Avatars:** avatar, avatar-nonhuman, chat-bubble, chat-box  
**Informational:** sign-text, ui-text, sign-graphic, menu, ui-graphic, progress-bar, hud, indicator-mute  
**Interactables:** interactable, button, target, portal, writing-utensil, watch, writing-surface, spawner  
**Safety:** guardian, out-of-bounds  
**Seating:** seat-single, table, seat-multiple, campfire  
**VR System:** hand, controller, dashboard, locomotion-target

See the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) (Table 1) for detailed descriptions and per-class performance.

## Training Details

### Dataset
- **DISCOVR:** 17,691 labeled images from 17 social VR apps
- **Train:** 15,207 images | **Val:** 1,645 images | **Test:** 839 images
- **Augmentation:** Horizontal/vertical flips, rotation, scaling, shearing, HSV jittering

### Training Configuration
- **GPU:** NVIDIA A100
- **Epochs:** 250
- **Image Size:** 640×640
- **Method:** Fine-tuned from YOLOv8n pretrained weights

## VRSight System Integration

This model is one component of the complete VRSight system, which combines:
- **This object detection model** (detects VR objects)
- Depth estimation (DepthAnythingV2)
- GPT-4o (scene atmosphere and detailed descriptions)
- OCR (text reading)
- Spatial audio (TTS -> WebVR app e.g., PlayCanvas)

**To use the full VRSight system**, see the [GitHub repository](https://github.com/MadisonAbilityLab/VRSight).

## Limitations

- **VR-specific:** Trained on social VR apps - performance varies on other VR types
- **Lighting:** Reduced accuracy in dark environments
- **Coverage:** 30 classes cover common social VR objects but not all possible VR elements
- **Application types:** Best performance in social VR; may struggle with faster-paced games

See Section 7.2 of the [paper](https://dl.acm.org/doi/pdf/10.1145/3746059.3747641) for detailed discussion.

## Citation
Please cite use of this model, the DISCOVR dataset, or the fine-tuned object detection model using the VRSight paper:
```bibtex
@inproceedings{killough2025vrsight,
  title={VRSight: An AI-Driven Scene Description System to Improve Virtual Reality Accessibility for Blind People},
  author={Killough, Daniel and Feng, Justin and Ching, Zheng Xue and Wang, Daniel and Dyava, Rithvik and Tian, Yapeng and Zhao, Yuhang},
  booktitle={Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology},
  pages={1--17},
  year={2025},
  publisher={ACM},
  address={Busan, Republic of Korea},
  doi={10.1145/3746059.3747641}
}
```

## License

CC BY 4.0 - Free to use with attribution

## Contact

- **GitHub Issues:** [github.com/MadisonAbilityLab/VRSight/issues](https://github.com/MadisonAbilityLab/VRSight/issues)
- **Paper:** [dl.acm.org/doi/full/10.1145/3746059.3747641](https://dl.acm.org/doi/full/10.1145/3746059.3747641)
- **Lead Author:** Daniel Killough (UW-Madison MadAbility Lab)

## Related Resources

- **[VRSight GitHub](https://github.com/MadisonAbilityLab/VRSight)** - Complete system implementation
- **[DISCOVR Dataset](https://huggingface.co/datasets/UWMadAbility/DISCOVR)** - Training data
- **[UIST 2025 Paper](https://dl.acm.org/doi/full/10.1145/3746059.3747641)** - Research paper
- **[Video Demo](https://x.com/i/status/1969153746337665262)** - System in action