mxt / README.md
chrisyrniu's picture
Update README.md
cb324cb verified
---
license: mit
datasets:
- chrisyrniu/human2locoman
pipeline_tag: robotics
tags:
- Human
- Pretraining
- Manipulation
library_name: transformers
---
<h1 align="center">
Modularized Cross-Embodiment Transformer (MXT) – Pretrained Models from Human2LocoMan
</h1>
<p align="center">
<img src="https://raw.githubusercontent.com/chrisyrniu/Human2LocoMan/main/doc/figures/logo.png" width="100" />
</p>
<p align="center">
<a href="https://chrisyrniu.github.io/" target="_blank">Yaru Niu</a><sup>1,*</sup>&nbsp;&nbsp;&nbsp;
<a href="https://human2bots.github.io/" target="_blank">Yunzhe Zhang</a><sup>1,*</sup>&nbsp;&nbsp;&nbsp;
<a href="https://human2bots.github.io/" target="_blank">Mingyang Yu</a><sup>1</sup>&nbsp;&nbsp;&nbsp;
<a href="https://linchangyi1.github.io/" target="_blank">Changyi Lin</a><sup>1</sup>&nbsp;&nbsp;&nbsp;
<a href="https://human2bots.github.io/" target="_blank">Chenhao Li</a><sup>1</sup>&nbsp;&nbsp;&nbsp;
<a href="https://scholar.google.com/citations?user=7ZZ9fOIAAAAJ&hl=zh-CN" target="_blank">Yikai Wang</a><sup>1</sup>
<br />
<a href="https://yxyang.github.io/" target="_blank">Yuxiang Yang</a><sup>2</sup>&nbsp;&nbsp;&nbsp;
<a href="https://wenhaoyu.weebly.com/" target="_blank">Wenhao Yu</a><sup>2</sup>&nbsp;&nbsp;&nbsp;
<a href="https://research.google/people/tingnanzhang/?&type=google" target="_blank">Tingnan Zhang</a><sup>2</sup>&nbsp;&nbsp;&nbsp;
<a href="https://scholar.google.com/citations?user=6LYI6uUAAAAJ&hl=en" target="_blank">Zhenzhen Li</a><sup>3</sup>&nbsp;&nbsp;&nbsp;
<a href="https://jonfranc.com/" target="_blank">Jonathan Francis</a><sup>1,3</sup>&nbsp;&nbsp;&nbsp;
<a href="https://scholar.google.com/citations?user=LYt_2MgAAAAJ&hl=en" target="_blank">Bingqing Chen</a><sup>3</sup>&nbsp;&nbsp;&nbsp;
<br />
<a href="https://www.jie-tan.net/" target="_blank">Jie Tan</a><sup>2</sup>&nbsp;&nbsp;&nbsp;
<a href="https://www.meche.engineering.cmu.edu/directory/bios/zhao-ding.html" target="_blank">Ding Zhao</a><sup>1</sup>&nbsp;&nbsp;&nbsp;
<br />
<sup>1</sup>Carnegie Mellon University&nbsp;&nbsp;&nbsp;
<sup>2</sup>Google DeepMind&nbsp;&nbsp;&nbsp;
<sup>3</sup>Bosch Center for AI&nbsp;&nbsp;&nbsp;
<br />
<sup>*</sup>Equal contributions
</p>
<p align="center">
Robotics: Science and Systems (RSS) 2025<br />
<a href="https://human2bots.github.io/">Website</a> |
<a href="https://www.arxiv.org/pdf/2506.16475">Paper</a> |
<a href="https://github.com/chrisyrniu/Human2LocoMan">Code</a>
</p>
<p align="center">
<img src="https://raw.githubusercontent.com/chrisyrniu/Human2LocoMan/main/doc/figures/system_overview.png" alt="Descriptive Alt Text" width="70%"/>
</p>
---
# Model Description
<p align="center">
<img src="https://human2bots.github.io/static/images/model_arch.jpg" alt="Descriptive Alt Text" width="70%"/>
</p>
<p>
Our learning framework is designed to efficiently utilize data from both human and robot sources, and account for modality-specific distributions unique to each embodiment. We propose a modularized design Modularized Cross-Embodiment Transformer (MXT). MXT consists mainly of three groups of modules: tokenizers, Transformer trunk, and detokenizers. The tokenizers act as encoders and map embodiment-specific observation modalities to tokens in the latent space, and the detokenizers translate the output tokens from the trunk to action modalities in the action space of each embodiment. The tokenizers and detokenizers are specific to one embodiment and are reinitialized for each new embodiment, while the trunk is shared across all embodiments and reused for transferring the policy among embodiments.
</p>
<p>
We provide MXT checkpoints pretrained on human data, along with their corresponding config files. Specifically, <code>pour.ckpt</code> is pretrained on the human pouring dataset; <code>scoop.ckpt</code> on the human scooping dataset; <code>shoe_org.ckpt</code> on the human unimanual and bimanual shoe organization dataset; and <code>toy_collect.ckpt</code> on the human unimanual and bimanual toy collection dataset.
</p>
# Citation
If you find this work helpful, please consider citing the paper:
```bibtex
@inproceedings{niu2025human2locoman,
title={Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining},
author={Niu, Yaru and Zhang, Yunzhe and Yu, Mingyang and Lin, Changyi and Li, Chenhao and Wang, Yikai and Yang, Yuxiang and Yu, Wenhao and Zhang, Tingnan and Li, Zhenzhen and Francis, Jonathan and Chen, Bingqing and Tan, Jie and Zhao, Ding},
booktitle={Robotics: Science and Systems (RSS)},
year={2025}
}
```