license: mit
pipeline_tag: image-text-to-text
library_name: transformers
MomaGraph-R1
This repository contains MomaGraph-R1, a 7B vision-language model presented in the paper MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning.
MomaGraph-R1 introduces a unified scene representation for embodied agents, integrating spatial-functional relationships and part-level interactive elements to address the needs of mobile manipulators in household environments. Trained with reinforcement learning on the MomaGraph-Scenes dataset, MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework.
It achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the MomaGraph-Bench benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
- Project Page: https://hybridrobotics.github.io/MomaGraph/
- Code: https://github.com/HybridRobotics/MomaGraph
Usage
This model is compatible with the Hugging Face transformers library. For detailed usage instructions and code examples, please refer to the official GitHub repository.
Citation
If you find our work helpful or inspiring, please consider citing the paper:
@misc{ju2025momagraph,
title={MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning},
author={Yuanchen Ju and Yongyuan Liang and Yen-Jen Wang and Nandiraju Gireesh and Yuanliang Ju and Seungjae Lee and Qiao Gu and Elvis Hsieh and Furong Huang and Koushil Sreenath},
year={2025},
eprint={2512.16909},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2512.16909},
}