| license: mit | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| # MomaGraph-R1 | |
| This repository contains **MomaGraph-R1**, a 7B vision-language model presented in the paper [MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning](https://huggingface.co/papers/2512.16909). | |
| MomaGraph-R1 introduces a unified scene representation for embodied agents, integrating spatial-functional relationships and part-level interactive elements to address the needs of mobile manipulators in household environments. Trained with reinforcement learning on the MomaGraph-Scenes dataset, MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. | |
| It achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the MomaGraph-Bench benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments. | |
| * **Project Page:** https://hybridrobotics.github.io/MomaGraph/ | |
| * **Code:** https://github.com/HybridRobotics/MomaGraph | |
| ## Usage | |
| This model is compatible with the Hugging Face `transformers` library. For detailed usage instructions and code examples, please refer to the official [GitHub repository](https://github.com/HybridRobotics/MomaGraph). | |
| ## Citation | |
| If you find our work helpful or inspiring, please consider citing the paper: | |
| ```bibtex | |
| @misc{ju2025momagraph, | |
| title={MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning}, | |
| author={Yuanchen Ju and Yongyuan Liang and Yen-Jen Wang and Nandiraju Gireesh and Yuanliang Ju and Seungjae Lee and Qiao Gu and Elvis Hsieh and Furong Huang and Koushil Sreenath}, | |
| year={2025}, | |
| eprint={2512.16909}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.RO}, | |
| url={https://arxiv.org/abs/2512.16909}, | |
| } | |
| ``` |