HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

arXiv

HOI-R1 is inspired by recent advances in reinforcement learning for large language models and investigates how vision-language models can reason about and detect human-object interactions more effectively.


πŸ” Overview

  • Task: Human-Object Interaction Detection (HOID)
  • Our Motivation:
    Leverage the reasoning capability of Multimodal LLMs and reinforcement learning–style optimization to explore HOI detection performance.

hoi-r1-arch

πŸ“Œ Citation

If you find this work useful, please consider citing:

@article{chen2025hoi,
  title={HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection},
  author={Chen, Junwen and Xiong, Peilin and Yanai, Keiji},
  journal={arXiv preprint arXiv:2510.05609},
  year={2025}
}
Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for thxplz/HOI-R1_Qwen2.5-VL-3B-Instruct

Finetuned
(611)
this model