|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection |
|
|
|
|
|
[](https://arxiv.org/abs/2510.05609) |
|
|
|
|
|
HOI-R1 is inspired by recent advances in reinforcement learning for large language models and investigates how vision-language models can reason about and detect human-object interactions more effectively. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔍 Overview |
|
|
|
|
|
- **Task**: Human-Object Interaction Detection (HOID) |
|
|
- **Our Motivation**: |
|
|
Leverage the reasoning capability of Multimodal LLMs and reinforcement learning–style optimization to explore HOI detection performance. |
|
|
--- |
|
|
|
|
|
 |
|
|
|
|
|
## 📌 Citation |
|
|
|
|
|
If you find this work useful, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@article{chen2025hoi, |
|
|
title={HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection}, |
|
|
author={Chen, Junwen and Xiong, Peilin and Yanai, Keiji}, |
|
|
journal={arXiv preprint arXiv:2510.05609}, |
|
|
year={2025} |
|
|
} |
|
|
|