--- license: apache-2.0 base_model: - Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: image-text-to-text --- # HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection [![arXiv](https://img.shields.io/badge/arXiv-2510.05609-b31b1b.svg)](https://arxiv.org/abs/2510.05609) HOI-R1 is inspired by recent advances in reinforcement learning for large language models and investigates how vision-language models can reason about and detect human-object interactions more effectively. --- ## 🔍 Overview - **Task**: Human-Object Interaction Detection (HOID) - **Our Motivation**: Leverage the reasoning capability of Multimodal LLMs and reinforcement learning–style optimization to explore HOI detection performance. --- ![hoi-r1-arch](https://cdn-uploads.huggingface.co/production/uploads/63119ce2fb65b9a3e2f75e3c/tHYWwrnqBAHsoo8lIOtnM.jpeg) ## 📌 Citation If you find this work useful, please consider citing: ```bibtex @article{chen2025hoi, title={HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection}, author={Chen, Junwen and Xiong, Peilin and Yanai, Keiji}, journal={arXiv preprint arXiv:2510.05609}, year={2025} }