Robotics
LVP / README.md
KempnerInstitute's picture
Upload folder using huggingface_hub
8eca7e6 verified
|
raw
history blame
3.44 kB
This folder contains the trained large video planner checkpoints (14B parameter) and all the metadata for eight dataset sources: Agibot-world, droid, bridge, language-tables, Pandas(filtered), SomethingSomethingV2, ego4d, epic_kitchens. We release a `merged_metadata.csv` and a `cleaned_metadata.csv` for each.
We also released our test set in `data/ours_test/` with the images and text instructions gathered from third-parties.
## Trained Checkpoints for our Large Video Planner
`checkpoints/lvp_14B.ckpt` is the trained weights for the transformer backbone.
## Dataset format
We train on a mixture of datasets, so we define a unified dataset format for consistency and ease of management.
Each dataset includes a global metadata file, typically named `metadata_merged.csv`, which contains key information for each video clip.
The file is named as metdata_**merged**.csv because each video clip may have multiple recaptions. Instead of saving the captions for each video into a list within a single csv row, we just create another row on the `metadata_merged.csv`. So `metadata_merged.csv` may contain multiple rows referring to the same video with different captions. For some dataset, we also provide a `cleaned_metadata.csv`, which contains a deduplicated version of the metadata (one entry per video) but excludes the additional recaptions.
Important fields of the global metadata includes:
1. `video_path`: Relative path (from the metadata file) to the video clip.
2. `trim_start`, and `trim_end` (optional): Specifies the trimmed segment of the clip. If absent, the full video is used.
3. `gemini_caption`: Action-focused caption generated by Gemini Flash 2.0.
4. `original_caption`: Original caption from the source dataset; used when no Gemini caption is available.
5. `prompt_embed_path`: Path to precomputed T5 prompt embeddings (not released due to large size).
6. `stable_brightess` (optional): 1.0 if brightness is stable, 0.0 otherwise. We recommend removing videos with `stable_brightess == 0.0`
7. `stable_background` (optional): Either 1.0 or 0.0. Recommend to remove videos with `stable_background == 0.0`, this indicates the video has large average optical flow magnitudes, which very likely contains large background motions.
8. `detected_hand_in_first_frame` (optional): 1.0 if a human hand is detected in the first frame, 0.0 otherwise. Videos with 0.0 often cause embodiment ambiguity and should be filtered out.
9. There are some other fields which can help you understand more about this clips. `n_frames`, `n_fps`, `height`, `width`, ... etc.
## Downloading the dataset
We provide dataset-specific download scripts for AgiBot World, DROID, Ego4D, EpicKitchens, and Something-Something in their respective dataset.py files within the `datasets/` folder of the relased code.
For downloading the filtered Pandas subset, we provide the unique `youtube_key_segment` for each video_clip, and the `trim_start`, and `trim_end` for each clip. To download these subset, please download the official metadata from [Pandas-70M](https://snap-research.github.io/Panda-70M/), then using the `youtube_key_segment` to find the URL of the video clips and then download with your own online video downloader.
For Bridge, please download from (Bridge)[https://rail-berkeley.github.io/bridgedata/].
For Language Table, please download from (Language Table)[https://github.com/google-research/language-table].