GRANT / README.md
nielsr's picture
nielsr HF Staff
Enhance model card for GRANT with metadata, links, authors, and installation guide
ddc424d verified
|
raw
history blame
5.06 kB
metadata
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
datasets:
  - H-EmbodVis/ORS3D-60K
base_model:
  - Jiayi-Pan/Tiny-Vicuna-1B

GRANT: Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

This repository contains GRANT, an embodied multi-modal large language model for Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), presented in the paper Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution.

Authors: Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai

GRANT Teaser Image

Abstract

Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency.

Installation & Data Preparation

This project is built upon Grounded 3D-LLM, and the preparations roughly follow the Grounded 3D-LLM.

Environment Setup

Python: 3.10.16
Pytorch: 1.12.1+cu116
CUDA: 11.6

conda create -n GRANT python=3.10.16
conda activate GRANT

conda install openblas-devel -c anaconda
conda install openjdk=11

pip install -r requirements.txt

export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc/mpc-0.8.1/lib:/mnt/petrelfs/share/gcc/mpfr-2.4.2/lib:/mnt/petrelfs/share/gcc/gmp-4.3.2/lib:/mnt/petrelfs/share/gcc/gcc-9.4.0/lib64:$LD_LIBRARY_PATH
# Note: The above path is for a specific cluster environment. Please update it according to your system configuration.

pip3 install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip3 install torch-scatter -f https://data.pyg.org/whl/torch-1.12.1+cu116.html
pip install peft==0.8.2 --no-deps # ignore the pytorch version error 

mkdir -p third_party
cd third_party
git clone --recursive "https://github.com/NVIDIA/MinkowskiEngine"
cd MinkowskiEngine
git checkout 02fc608bea4c0549b0a7b00ca1bf15dee4a0b228
python setup.py install --blas_include_dirs=${CONDA_PREFIX}/include --blas=openblas

cd ../pointnet2
python setup.py install

Data Preparation

Download ORS3D-60K dataset and dataset splits from HuggingFace.
Download 3D scenes from SceneVerse.

GRANT
β”œβ”€β”€ data                            
β”‚   β”œβ”€β”€ langdata
β”‚   β”‚   │── ORS3D.json # ORS3D-60K dataset
β”‚   │── SceneVerse
β”‚   β”‚   │── 3RScan
β”‚   β”‚   │── ARKitScenes
β”‚   β”‚   │── HM3D
β”‚   β”‚   │── MultiScan
β”‚   β”‚   │── ScanNet
β”‚   β”‚   │── splits # ORS3D-60K dataset splits

Pretrained weights

1. Download the pretrained LLM weights

Please download the pretrained LLM weights (Tiny-Vicuna-1B) and store them in $ROOT_PATH/pretrained/llm_weight/Tiny-Vicuna-1B/

2. Download the model weights

Download the point cloud encoder weights and pretrained GRANT weights from HuggingFace.

Citation

If you find this repository useful in your research, please consider giving a star ⭐ and a citation.

@inproceedings{liang2026cook,
  title={Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution},
  author={Liang, Dingkang and Zhang, Cheng and Xu, Xiaopeng and Ju, Jianzhong and Luo, Zhenbo and Bai, Xiang},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026}
}