MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence

MLLM-4D is a comprehensive framework designed to bridge the gaps in training data curation and model post-training for spatiotemporal understanding and reasoning. It enables multimodal large language models (MLLMs) to perceive and reason about the evolution of 3D space over time from purely visual inputs.

Model Description

MLLM-4D achieves state-of-the-art spatiotemporal intelligence by focusing on the relationships between objects and the camera within 3D space. The model establishes foundational 4D understanding via Supervised Fine-Tuning (SFT) and further catalyzes 4D reasoning capabilities by employing Group Relative Policy Optimization (GRPO) with specialized Spatiotemporal Chain of Thought (ST-CoT) prompting. It achieves these capabilities using purely 2D RGB inputs without architectural modifications.

Usage

To run the inference demo for MLLM-4D, please refer to the setup instructions in the official repository and use the following commands:

# for MLLM-4D-SFT
python scripts/inference.py --model_type "MLLM-4D-SFT" --model_path PATH-to-MLLM-4D-SFT

# for MLLM-4D-RFT
python scripts/inference.py --model_type "MLLM-4D-RFT" --model_path PATH-to-MLLM-4D-RFT

Citation

If you find the work useful, please consider citing:

@article{yin2026mllm4d,
    title={MLLM-4D: Towards Visual-based Spatial-Temporal Intelligence},
    author={Yin, Xingyilang and Li, Chengzhengxu and Chang, Jiahao and Pun, Chi-Man and Cun, Xiaodong},
    journal={arXiv preprint arXiv:2603.00515},
    year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for flow666/MLLM-4D