---
license: apache-2.0
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
---

# HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

[![arXiv](https://img.shields.io/badge/arXiv-2510.05609-b31b1b.svg)](https://arxiv.org/abs/2510.05609)

HOI-R1 is inspired by recent advances in reinforcement learning for large language models and investigates how vision-language models can reason about and detect human-object interactions more effectively.

---

## 🔍 Overview

- **Task**: Human-Object Interaction Detection (HOID)
- **Our Motivation**:  
  Leverage the reasoning capability of Multimodal LLMs and reinforcement learning–style optimization to explore HOI detection performance.
---

![hoi-r1-arch](https://cdn-uploads.huggingface.co/production/uploads/63119ce2fb65b9a3e2f75e3c/tHYWwrnqBAHsoo8lIOtnM.jpeg)

## 📌 Citation

If you find this work useful, please consider citing:

```bibtex
@article{chen2025hoi,
  title={HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection},
  author={Chen, Junwen and Xiong, Peilin and Yanai, Keiji},
  journal={arXiv preprint arXiv:2510.05609},
  year={2025}
}