| --- |
| license: apache-2.0 |
| base_model: |
| - Qwen/Qwen3-VL-4B-Instruct |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection |
|
|
| [](https://arxiv.org/abs/2510.05609) |
|
|
| HOI-R1 is inspired by recent advances in reinforcement learning for large language models and investigates how vision-language models can reason about and detect human-object interactions more effectively. |
|
|
| --- |
|
|
| ## 🔍 Overview |
|
|
| - **Task**: Human-Object Interaction Detection (HOID) |
| - **Our Motivation**: |
| Leverage the reasoning capability of Multimodal LLMs and reinforcement learning–style optimization to explore HOI detection performance. |
| --- |
|
|
|  |
|
|
| ## 📌 Citation |
|
|
| If you find this work useful, please consider citing: |
|
|
| ```bibtex |
| @article{chen2025hoi, |
| title={HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection}, |
| author={Chen, Junwen and Xiong, Peilin and Yanai, Keiji}, |
| journal={arXiv preprint arXiv:2510.05609}, |
| year={2025} |
| } |
| |