This folder contains the trained large video planner checkpoints (14B parameter) and all the metadata for eight dataset sources: Agibot-world, droid, bridge, language-tables, Pandas(filtered), SomethingSomethingV2, ego4d, epic_kitchens. We release a merged_metadata.csv and a cleaned_metadata.csv for each.
We also released our test set in data/ours_test/ with the images and text instructions gathered from third-parties.
Trained Checkpoints for our Large Video Planner
checkpoints/lvp_14B.ckpt is the trained weights for the transformer backbone.
Dataset format
We train on a mixture of datasets, so we define a unified dataset format for consistency and ease of management.
Each dataset includes a global metadata file, typically named metadata_merged.csv, which contains key information for each video clip.
The file is named as metdata_merged.csv because each video clip may have multiple recaptions. Instead of saving the captions for each video into a list within a single csv row, we just create another row on the metadata_merged.csv. So metadata_merged.csv may contain multiple rows referring to the same video with different captions. For some dataset, we also provide a cleaned_metadata.csv, which contains a deduplicated version of the metadata (one entry per video) but excludes the additional recaptions.
Important fields of the global metadata includes:
video_path: Relative path (from the metadata file) to the video clip.trim_start, andtrim_end(optional): Specifies the trimmed segment of the clip. If absent, the full video is used.gemini_caption: Action-focused caption generated by Gemini Flash 2.0.original_caption: Original caption from the source dataset; used when no Gemini caption is available.prompt_embed_path: Path to precomputed T5 prompt embeddings (not released due to large size).stable_brightess(optional): 1.0 if brightness is stable, 0.0 otherwise. We recommend removing videos withstable_brightess == 0.0stable_background(optional): Either 1.0 or 0.0. Recommend to remove videos withstable_background == 0.0, this indicates the video has large average optical flow magnitudes, which very likely contains large background motions.detected_hand_in_first_frame(optional): 1.0 if a human hand is detected in the first frame, 0.0 otherwise. Videos with 0.0 often cause embodiment ambiguity and should be filtered out.- There are some other fields which can help you understand more about this clips.
n_frames,n_fps,height,width, ... etc.
Downloading the dataset
We provide dataset-specific download scripts for AgiBot World, DROID, Ego4D, EpicKitchens, and Something-Something in their respective dataset.py files within the datasets/ folder of the relased code.
For downloading the filtered Pandas subset, we provide the unique youtube_key_segment for each video_clip, and the trim_start, and trim_end for each clip. To download these subset, please download the official metadata from Pandas-70M, then using the youtube_key_segment to find the URL of the video clips and then download with your own online video downloader.
For Bridge, please download from (Bridge)[https://rail-berkeley.github.io/bridgedata/].
For Language Table, please download from (Language Table)[https://github.com/google-research/language-table].