LVP / README.md

Upload folder using huggingface_hub

8eca7e6 verified about 1 month ago

3.44 kB

This folder contains the trained large video planner checkpoints (14B parameter) and all the metadata for eight dataset sources: Agibot-world, droid, bridge, language-tables, Pandas(filtered), SomethingSomethingV2, ego4d, epic_kitchens. We release a merged_metadata.csv and a cleaned_metadata.csv for each. We also released our test set in data/ours_test/ with the images and text instructions gathered from third-parties.

Trained Checkpoints for our Large Video Planner

checkpoints/lvp_14B.ckpt is the trained weights for the transformer backbone.

Dataset format

We train on a mixture of datasets, so we define a unified dataset format for consistency and ease of management.

Each dataset includes a global metadata file, typically named metadata_merged.csv, which contains key information for each video clip.

The file is named as metdata_merged.csv because each video clip may have multiple recaptions. Instead of saving the captions for each video into a list within a single csv row, we just create another row on the metadata_merged.csv. So metadata_merged.csv may contain multiple rows referring to the same video with different captions. For some dataset, we also provide a cleaned_metadata.csv, which contains a deduplicated version of the metadata (one entry per video) but excludes the additional recaptions.

Important fields of the global metadata includes:

video_path: Relative path (from the metadata file) to the video clip.
trim_start, and trim_end (optional): Specifies the trimmed segment of the clip. If absent, the full video is used.
gemini_caption: Action-focused caption generated by Gemini Flash 2.0.
original_caption: Original caption from the source dataset; used when no Gemini caption is available.
prompt_embed_path: Path to precomputed T5 prompt embeddings (not released due to large size).
stable_brightess (optional): 1.0 if brightness is stable, 0.0 otherwise. We recommend removing videos with stable_brightess == 0.0
stable_background (optional): Either 1.0 or 0.0. Recommend to remove videos with stable_background == 0.0, this indicates the video has large average optical flow magnitudes, which very likely contains large background motions.
detected_hand_in_first_frame (optional): 1.0 if a human hand is detected in the first frame, 0.0 otherwise. Videos with 0.0 often cause embodiment ambiguity and should be filtered out.
There are some other fields which can help you understand more about this clips. n_frames, n_fps, height, width, ... etc.

Downloading the dataset

We provide dataset-specific download scripts for AgiBot World, DROID, Ego4D, EpicKitchens, and Something-Something in their respective dataset.py files within the datasets/ folder of the relased code.

For downloading the filtered Pandas subset, we provide the unique youtube_key_segment for each video_clip, and the trim_start, and trim_end for each clip. To download these subset, please download the official metadata from Pandas-70M, then using the youtube_key_segment to find the URL of the video clips and then download with your own online video downloader.

For Bridge, please download from (Bridge)[https://rail-berkeley.github.io/bridgedata/].

For Language Table, please download from (Language Table)[https://github.com/google-research/language-table].