---
license: mit
pipeline_tag: image-text-to-text
library_name: transformers
---

# MomaGraph-R1

This repository contains **MomaGraph-R1**, a 7B vision-language model presented in the paper [MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning](https://huggingface.co/papers/2512.16909).

MomaGraph-R1 introduces a unified scene representation for embodied agents, integrating spatial-functional relationships and part-level interactive elements to address the needs of mobile manipulators in household environments. Trained with reinforcement learning on the MomaGraph-Scenes dataset, MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework.

It achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the MomaGraph-Bench benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

*   **Project Page:** https://hybridrobotics.github.io/MomaGraph/
*   **Code:** https://github.com/HybridRobotics/MomaGraph

## Usage

This model is compatible with the Hugging Face `transformers` library. For detailed usage instructions and code examples, please refer to the official [GitHub repository](https://github.com/HybridRobotics/MomaGraph).

## Citation

If you find our work helpful or inspiring, please consider citing the paper:

```bibtex
@misc{ju2025momagraph,
      title={MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning},
      author={Yuanchen Ju and Yongyuan Liang and Yen-Jen Wang and Nandiraju Gireesh and Yuanliang Ju and Seungjae Lee and Qiao Gu and Elvis Hsieh and Furong Huang and Koushil Sreenath},
      year={2025},
      eprint={2512.16909},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.16909},
}
```