thxplz
/

HOI-R1_Qwen2.5-VL-3B-Instruct

Image-Text-to-Text

Model card Files Files and versions

HOI-R1_Qwen2.5-VL-3B-Instruct / README.md

thxplz's picture

Update README.md

091064b verified 7 days ago

|

history blame contribute delete

1.24 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	pipeline_tag: image-text-to-text
	---

	# HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

	[![arXiv](https://img.shields.io/badge/arXiv-2510.05609-b31b1b.svg)](https://arxiv.org/abs/2510.05609)

	HOI-R1 is inspired by recent advances in reinforcement learning for large language models and investigates how vision-language models can reason about and detect human-object interactions more effectively.

	---

	## 🔍 Overview

	- Task: Human-Object Interaction Detection (HOID)
	- Our Motivation:
	Leverage the reasoning capability of Multimodal LLMs and reinforcement learning–style optimization to explore HOI detection performance.
	---

	![hoi-r1-arch](https://cdn-uploads.huggingface.co/production/uploads/63119ce2fb65b9a3e2f75e3c/tHYWwrnqBAHsoo8lIOtnM.jpeg)

	## 📌 Citation

	If you find this work useful, please consider citing:

	```bibtex
	@article{chen2025hoi,
	title={HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection},
	author={Chen, Junwen and Xiong, Peilin and Yanai, Keiji},
	journal={arXiv preprint arXiv:2510.05609},
	year={2025}
	}