Enhance model card for Large Video Planner Enables Generalizable Robot Control (#1)
Browse files- Enhance model card for Large Video Planner Enables Generalizable Robot Control (97f7fe3c571bac5804b4d8ea858c1cc5294ca62e)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,4 +1,21 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
We also released our test set in `data/ours_test/` with the images and text instructions gathered from third-parties.
|
| 3 |
|
| 4 |
## Trained Checkpoints for our Large Video Planner
|
|
@@ -7,27 +24,44 @@ We also released our test set in `data/ours_test/` with the images and text inst
|
|
| 7 |
## Dataset format
|
| 8 |
We train on a mixture of datasets, so we define a unified dataset format for consistency and ease of management.
|
| 9 |
|
| 10 |
-
Each dataset includes a global metadata file, typically named `metadata_merged.csv`, which contains key information for each video clip.
|
| 11 |
|
| 12 |
The file is named as metdata_**merged**.csv because each video clip may have multiple recaptions. Instead of saving the captions for each video into a list within a single csv row, we just create another row on the `metadata_merged.csv`. So `metadata_merged.csv` may contain multiple rows referring to the same video with different captions. For some dataset, we also provide a `cleaned_metadata.csv`, which contains a deduplicated version of the metadata (one entry per video) but excludes the additional recaptions.
|
| 13 |
|
| 14 |
Important fields of the global metadata includes:
|
| 15 |
-
1.
|
| 16 |
-
2.
|
| 17 |
-
3.
|
| 18 |
-
4.
|
| 19 |
-
5.
|
| 20 |
-
6.
|
| 21 |
-
7.
|
| 22 |
-
8.
|
| 23 |
-
9.
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
|
| 28 |
## Downloading the dataset
|
| 29 |
We provide dataset-specific download scripts for AgiBot World, DROID, Ego4D, EpicKitchens, and Something-Something in their respective dataset.py files within the `datasets/` folder of the relased code.
|
| 30 |
|
| 31 |
-
For downloading the filtered Pandas subset, we provide the unique `youtube_key_segment` for each video_clip, and the `trim_start`, and `trim_end` for each clip.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
|
|
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: robotics
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Large Video Planner Enables Generalizable Robot Control
|
| 7 |
+
|
| 8 |
+
This repository contains the trained large video planner checkpoints (14B parameter) for the model presented in the paper [Large Video Planner Enables Generalizable Robot Control](https://huggingface.co/papers/2512.15840).
|
| 9 |
+
|
| 10 |
+
The Large Video Planner explores an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. It produces zero-shot video plans for novel scenes and tasks, which are then post-processed to extract executable robot actions.
|
| 11 |
+
|
| 12 |
+
- **Project Page:** [https://www.boyuan.space/large-video-planner/](https://www.boyuan.space/large-video-planner/)
|
| 13 |
+
- **GitHub Repository:** [https://github.com/buoyancy99/large-video-planner](https://github.com/buoyancy99/large-video-planner)
|
| 14 |
+
- **Hugging Face Demo:** [https://huggingface.co/spaces/KempnerInstituteAI/LVP](https://huggingface.co/spaces/KempnerInstituteAI/LVP)
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
This folder contains the trained large video planner checkpoints (14B parameter) and all the metadata for eight dataset sources: Agibot-world, droid, bridge, language-tables, Pandas(filtered), SomethingSomethingV2, ego4d, epic_kitchens. We release a `merged_metadata.csv` and a `cleaned_metadata.csv` for each.
|
| 19 |
We also released our test set in `data/ours_test/` with the images and text instructions gathered from third-parties.
|
| 20 |
|
| 21 |
## Trained Checkpoints for our Large Video Planner
|
|
|
|
| 24 |
## Dataset format
|
| 25 |
We train on a mixture of datasets, so we define a unified dataset format for consistency and ease of management.
|
| 26 |
|
| 27 |
+
Each dataset includes a global metadata file, typically named `metadata_merged.csv`, which contains key information for each video clip.
|
| 28 |
|
| 29 |
The file is named as metdata_**merged**.csv because each video clip may have multiple recaptions. Instead of saving the captions for each video into a list within a single csv row, we just create another row on the `metadata_merged.csv`. So `metadata_merged.csv` may contain multiple rows referring to the same video with different captions. For some dataset, we also provide a `cleaned_metadata.csv`, which contains a deduplicated version of the metadata (one entry per video) but excludes the additional recaptions.
|
| 30 |
|
| 31 |
Important fields of the global metadata includes:
|
| 32 |
+
1. `video_path`: Relative path (from the metadata file) to the video clip.
|
| 33 |
+
2. `trim_start`, and `trim_end` (optional): Specifies the trimmed segment of the clip. If absent, the full video is used.
|
| 34 |
+
3. `gemini_caption`: Action-focused caption generated by Gemini Flash 2.0.
|
| 35 |
+
4. `original_caption`: Original caption from the source dataset; used when no Gemini caption is available.
|
| 36 |
+
5. `prompt_embed_path`: Path to precomputed T5 prompt embeddings (not released due to large size).
|
| 37 |
+
6. `stable_brightess` (optional): 1.0 if brightness is stable, 0.0 otherwise. We recommend removing videos with `stable_brightess == 0.0`
|
| 38 |
+
7. `stable_background` (optional): Either 1.0 or 0.0. Recommend to remove videos with `stable_background == 0.0`, this indicates the video has large average optical flow magnitudes, which very likely contains large background motions.
|
| 39 |
+
8. `detected_hand_in_first_frame` (optional): 1.0 if a human hand is detected in the first frame, 0.0 otherwise. Videos with 0.0 often cause embodiment ambiguity and should be filtered out.
|
| 40 |
+
9. There are some other fields which can help you understand more about this clips. `n_frames`, `n_fps`, `height`, `width`, ... etc.
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## Downloading the dataset
|
| 43 |
We provide dataset-specific download scripts for AgiBot World, DROID, Ego4D, EpicKitchens, and Something-Something in their respective dataset.py files within the `datasets/` folder of the relased code.
|
| 44 |
|
| 45 |
+
For downloading the filtered Pandas subset, we provide the unique `youtube_key_segment` for each video_clip, and the `trim_start`, and `trim_end` for each clip. To download these subset, please download the official metadata from [Pandas-70M](https://snap-research.github.io/Panda-70M/), then using the `youtube_key_segment` to find the URL of the video clips and then download with your own online video downloader.
|
| 46 |
+
|
| 47 |
+
## Sample Usage (Inference)
|
| 48 |
+
|
| 49 |
+
To run inference with the Large Video Planner, first ensure your environment is set up and checkpoints are downloaded as described in the [GitHub repository's instructions](https://github.com/buoyancy99/large-video-planner#instructions-for-running-the-code).
|
| 50 |
|
| 51 |
+
Then, use the following command for basic inference:
|
| 52 |
|
| 53 |
+
```bash
|
| 54 |
+
mkdir -p <your-output-folder>
|
| 55 |
+
python -m main \
|
| 56 |
+
+name=<your_exp_name> \
|
| 57 |
+
experiment=exp_video \
|
| 58 |
+
algorithm=wan_i2v \
|
| 59 |
+
dataset=ours_test \
|
| 60 |
+
experiment.tasks=[validation] \
|
| 61 |
+
algorithm.logging.video_type=single \
|
| 62 |
+
experiment.num_nodes=1 \
|
| 63 |
+
experiment.validation.limit_batch=null \
|
| 64 |
+
algorithm.hist_guidance=1.5 \
|
| 65 |
+
algorithm.lang_guidance=2.5
|
| 66 |
+
```
|
| 67 |
+
Replace `<your-output-folder>` and `<your_exp_name>` with your desired values. Refer to the [GitHub repository](https://github.com/buoyancy99/large-video-planner) for detailed explanations of arguments and further instructions, including how to download the checkpoints.
|