--- license: mit pipeline_tag: image-text-to-text library_name: transformers --- # MomaGraph-R1 This repository contains **MomaGraph-R1**, a 7B vision-language model presented in the paper [MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning](https://huggingface.co/papers/2512.16909). MomaGraph-R1 introduces a unified scene representation for embodied agents, integrating spatial-functional relationships and part-level interactive elements to address the needs of mobile manipulators in household environments. Trained with reinforcement learning on the MomaGraph-Scenes dataset, MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. It achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the MomaGraph-Bench benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments. * **Project Page:** https://hybridrobotics.github.io/MomaGraph/ * **Code:** https://github.com/HybridRobotics/MomaGraph ## Usage This model is compatible with the Hugging Face `transformers` library. For detailed usage instructions and code examples, please refer to the official [GitHub repository](https://github.com/HybridRobotics/MomaGraph). ## Citation If you find our work helpful or inspiring, please consider citing the paper: ```bibtex @misc{ju2025momagraph, title={MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning}, author={Yuanchen Ju and Yongyuan Liang and Yen-Jen Wang and Nandiraju Gireesh and Yuanliang Ju and Seungjae Lee and Qiao Gu and Elvis Hsieh and Furong Huang and Koushil Sreenath}, year={2025}, eprint={2512.16909}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2512.16909}, } ```