MomaGraph-R1 / README.md

nielsr HF Staff

Improve model card: Add pipeline tag, library name, and links

79c747f verified about 1 month ago

preview code

raw

history blame

1.94 kB

metadata

license: mit
pipeline_tag: image-text-to-text
library_name: transformers

MomaGraph-R1

This repository contains MomaGraph-R1, a 7B vision-language model presented in the paper MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning.

MomaGraph-R1 introduces a unified scene representation for embodied agents, integrating spatial-functional relationships and part-level interactive elements to address the needs of mobile manipulators in household environments. Trained with reinforcement learning on the MomaGraph-Scenes dataset, MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework.

It achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the MomaGraph-Bench benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.

Project Page: https://hybridrobotics.github.io/MomaGraph/
Code: https://github.com/HybridRobotics/MomaGraph

Usage

This model is compatible with the Hugging Face transformers library. For detailed usage instructions and code examples, please refer to the official GitHub repository.

Citation

If you find our work helpful or inspiring, please consider citing the paper:

@misc{ju2025momagraph,
      title={MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning},
      author={Yuanchen Ju and Yongyuan Liang and Yen-Jen Wang and Nandiraju Gireesh and Yuanliang Ju and Seungjae Lee and Qiao Gu and Elvis Hsieh and Furong Huang and Koushil Sreenath},
      year={2025},
      eprint={2512.16909},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.16909},
}