miniFranka
/

MLA_pretrain

Model card Files Files and versions

MLA_pretrain / README.md

lzy

Add model weights

460896b 7 months ago

|

history blame contribute delete

1.52 kB

	---
	license: mit
	---

	# MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation

	![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)
	![PyTorch](https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge&logo=PyTorch&logoColor=white)

	[🌐Project Page](https://sites.google.com/view/open-mla) \| [✍️Paper(Arxiv)](http://arxiv.org/abs/2509.26642) \| [🎥Demo](https://sites.google.com/view/open-mla)

	Zhuoyang Liu, Jiaming Liu, Jiadong Xu, Nuowei Han, Chenyang Gu, Hao Chen, Kaichen Zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, Zhengping Che, Jian Tang, Shanghang Zhang

	We introduce a multisensory language–action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling.
	Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence.
	To further enhance MLA’s understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation.