pipeline_tag: robotics
license: apache-2.0
Large Video Planner Enables Generalizable Robot Control
This repository contains the trained large video planner checkpoints (14B parameter) for the model presented in the paper Large Video Planner Enables Generalizable Robot Control.
The Large Video Planner explores an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. It produces zero-shot video plans for novel scenes and tasks, which are then post-processed to extract executable robot actions.
- Project Page: https://www.boyuan.space/large-video-planner/
- GitHub Repository: https://github.com/buoyancy99/large-video-planner
- Hugging Face Demo: https://huggingface.co/spaces/KempnerInstituteAI/LVP
This folder contains the trained large video planner checkpoints (14B parameter) and all the metadata for eight dataset sources: Agibot-world, droid, bridge, language-tables, Pandas(filtered), SomethingSomethingV2, ego4d, epic_kitchens. We release a merged_metadata.csv and a cleaned_metadata.csv for each.
We also released our test set in data/ours_test/ with the images and text instructions gathered from third-parties.
Trained Checkpoints for our Large Video Planner
checkpoints/lvp_14B.ckpt is the trained weights for the transformer backbone.
Dataset format
We train on a mixture of datasets, so we define a unified dataset format for consistency and ease of management.
Each dataset includes a global metadata file, typically named metadata_merged.csv, which contains key information for each video clip.
The file is named as metdata_merged.csv because each video clip may have multiple recaptions. Instead of saving the captions for each video into a list within a single csv row, we just create another row on the metadata_merged.csv. So metadata_merged.csv may contain multiple rows referring to the same video with different captions. For some dataset, we also provide a cleaned_metadata.csv, which contains a deduplicated version of the metadata (one entry per video) but excludes the additional recaptions.
Important fields of the global metadata includes:
video_path: Relative path (from the metadata file) to the video clip.trim_start, andtrim_end(optional): Specifies the trimmed segment of the clip. If absent, the full video is used.gemini_caption: Action-focused caption generated by Gemini Flash 2.0.original_caption: Original caption from the source dataset; used when no Gemini caption is available.prompt_embed_path: Path to precomputed T5 prompt embeddings (not released due to large size).stable_brightess(optional): 1.0 if brightness is stable, 0.0 otherwise. We recommend removing videos withstable_brightess == 0.0stable_background(optional): Either 1.0 or 0.0. Recommend to remove videos withstable_background == 0.0, this indicates the video has large average optical flow magnitudes, which very likely contains large background motions.detected_hand_in_first_frame(optional): 1.0 if a human hand is detected in the first frame, 0.0 otherwise. Videos with 0.0 often cause embodiment ambiguity and should be filtered out.- There are some other fields which can help you understand more about this clips.
n_frames,n_fps,height,width, ... etc.
Downloading the dataset
We provide dataset-specific download scripts for AgiBot World, DROID, Ego4D, EpicKitchens, and Something-Something in their respective dataset.py files within the datasets/ folder of the relased code.
For downloading the filtered Pandas subset, we provide the unique youtube_key_segment for each video_clip, and the trim_start, and trim_end for each clip. To download these subset, please download the official metadata from Pandas-70M, then using the youtube_key_segment to find the URL of the video clips and then download with your own online video downloader.
Sample Usage (Inference)
To run inference with the Large Video Planner, first ensure your environment is set up and checkpoints are downloaded as described in the GitHub repository's instructions.
Then, use the following command for basic inference:
mkdir -p <your-output-folder>
python -m main \
+name=<your_exp_name> \
experiment=exp_video \
algorithm=wan_i2v \
dataset=ours_test \
experiment.tasks=[validation] \
algorithm.logging.video_type=single \
experiment.num_nodes=1 \
experiment.validation.limit_batch=null \
algorithm.hist_guidance=1.5 \
algorithm.lang_guidance=2.5
Replace <your-output-folder> and <your_exp_name> with your desired values. Refer to the GitHub repository for detailed explanations of arguments and further instructions, including how to download the checkpoints.