| # InfoSeek Data Download |
|
|
| This document collects ready-to-run scripts for downloading the InfoSeek dataset into: |
|
|
| `/workspace/xiaobin/RL_dataset/data` |
|
|
| It covers: |
|
|
| - InfoSeek annotations |
| - InfoSeek KB mapping files |
| - InfoSeek human set |
| - Wiki6M text files |
| - OVEN image snapshot on Hugging Face |
| - OVEN original-source image download workflow |
|
|
| InfoSeek images are derived from OVEN, so image download is handled through the OVEN release pipeline. |
|
|
| ## 1. Recommended Directory Layout |
|
|
| ```bash |
| mkdir -p /workspace/xiaobin/RL_dataset/data/infoseek |
| mkdir -p /workspace/xiaobin/RL_dataset/data/oven_hf |
| mkdir -p /workspace/xiaobin/RL_dataset/data/oven_source |
| ``` |
|
|
| Suggested usage: |
|
|
| - `/workspace/xiaobin/RL_dataset/data/infoseek`: InfoSeek jsonl files |
| - `/workspace/xiaobin/RL_dataset/data/oven_hf`: Hugging Face image snapshot files |
| - `/workspace/xiaobin/RL_dataset/data/oven_source`: upstream OVEN repo for original-source image download |
|
|
| ## 2. Proxy Workaround |
|
|
| If your shell is configured with an invalid local proxy such as `127.0.0.1:7890`, use one of these patterns. |
|
|
| Temporarily disable proxy for a single command: |
|
|
| ```bash |
| env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY wget -c URL |
| ``` |
|
|
| Or disable proxy for the current shell session: |
|
|
| ```bash |
| unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY |
| ``` |
|
|
| ## 3. Download All InfoSeek Text Data With `wget` |
|
|
| This is the simplest full download for the released InfoSeek jsonl files. |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" |
| mkdir -p "${TARGET_DIR}" |
| cd "${TARGET_DIR}" |
| |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl |
| |
| ls -lh "${TARGET_DIR}" |
| ``` |
|
|
| ## 4. Download All InfoSeek Text Data With `curl` |
|
|
| Use this if `wget` is not available. |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" |
| mkdir -p "${TARGET_DIR}" |
| cd "${TARGET_DIR}" |
| |
| curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl |
| curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl |
| curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl |
| curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl |
| curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl |
| curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl |
| curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz |
| curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl |
| |
| ls -lh "${TARGET_DIR}" |
| ``` |
|
|
| ## 5. Download Only Core InfoSeek Splits |
|
|
| If you only need the standard train/val/test annotations: |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" |
| mkdir -p "${TARGET_DIR}" |
| cd "${TARGET_DIR}" |
| |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl |
| ``` |
|
|
| ## 6. Download Only KB Mapping Files |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" |
| mkdir -p "${TARGET_DIR}" |
| cd "${TARGET_DIR}" |
| |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl |
| ``` |
|
|
| ## 7. Download Only Human Eval Set |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" |
| mkdir -p "${TARGET_DIR}" |
| cd "${TARGET_DIR}" |
| |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl |
| ``` |
|
|
| ## 8. Download Only Wiki6M Files |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek" |
| mkdir -p "${TARGET_DIR}" |
| cd "${TARGET_DIR}" |
| |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz |
| wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl |
| ``` |
|
|
| Optional decompression: |
|
|
| ```bash |
| gunzip -k /workspace/xiaobin/RL_dataset/data/infoseek/Wiki6M_ver_1_0.jsonl.gz |
| ``` |
|
|
| ## 9. Download OVEN Image Snapshot From Hugging Face |
|
|
| Upstream OVEN now points image snapshot downloads to the gated dataset `ychenNLP/oven` on Hugging Face. Before downloading: |
|
|
| 1. Open `https://huggingface.co/datasets/ychenNLP/oven` |
| 2. Accept the dataset access conditions |
| 3. Log in with the Hugging Face CLI |
|
|
| Install the CLI if needed: |
|
|
| ```bash |
| python -m pip install -U "huggingface_hub[cli]" |
| ``` |
|
|
| Login: |
|
|
| ```bash |
| hf auth login |
| ``` |
|
|
| Download the image snapshot and mapping file into `/workspace/xiaobin/RL_dataset/data/oven_hf`: |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf" |
| mkdir -p "${TARGET_DIR}" |
| |
| hf download ychenNLP/oven \ |
| --repo-type dataset \ |
| --local-dir "${TARGET_DIR}" \ |
| --include "shard*.tar" \ |
| --include "all_wikipedia_images.tar" \ |
| --include "ovenid2impath.csv" |
| ``` |
|
|
| Extract the snapshot tar files: |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| HF_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf" |
| IMG_DIR="/workspace/xiaobin/RL_dataset/data/infoseek/images" |
| mkdir -p "${IMG_DIR}" |
| |
| for f in "${HF_DIR}"/shard*.tar; do |
| tar -xf "${f}" -C "${IMG_DIR}" |
| done |
| |
| tar -xf "${HF_DIR}/all_wikipedia_images.tar" -C "${IMG_DIR}" |
| ``` |
|
|
| Notes: |
|
|
| - Hugging Face file listing shows `shard01.tar` to `shard08.tar` plus `all_wikipedia_images.tar` |
| - The compressed download is very large, roughly 293 GB based on the published file sizes |
| - You need additional free space for extraction |
|
|
| ## 10. Download OVEN Images From Original Sources |
|
|
| This follows the upstream `oven_eval/image_downloads` workflow. |
|
|
| ### 10.1 Clone the Upstream Repo |
|
|
| ```bash |
| git clone https://github.com/edchengg/oven_eval /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval |
| ``` |
|
|
| ### 10.2 Run All Source Download Scripts |
|
|
| The upstream image download directory contains these scripts: |
|
|
| - `download_aircraft.sh` |
| - `download_car196.sh` |
| - `download_coco.sh` |
| - `download_food101.sh` |
| - `download_gldv2.sh` |
| - `download_imagenet.sh` |
| - `download_inat.sh` |
| - `download_oxfordflower.sh` |
| - `download_sports100.sh` |
| - `download_sun397.sh` |
| - `download_textvqa.sh` |
| - `download_v7w.sh` |
| - `download_vg.sh` |
|
|
| Run them one by one: |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads |
| |
| bash download_aircraft.sh |
| bash download_car196.sh |
| bash download_coco.sh |
| bash download_food101.sh |
| bash download_gldv2.sh |
| bash download_imagenet.sh |
| bash download_inat.sh |
| bash download_oxfordflower.sh |
| bash download_sports100.sh |
| bash download_sun397.sh |
| bash download_textvqa.sh |
| bash download_v7w.sh |
| bash download_vg.sh |
| ``` |
|
|
| Or run them in a loop: |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads |
| |
| for script in download_*.sh; do |
| bash "${script}" |
| done |
| ``` |
|
|
| ### 10.3 Download `ovenid2impath.csv` |
|
|
| You need `ovenid2impath.csv` for the merge step. The current recommended source is the Hugging Face dataset: |
|
|
| ```bash |
| #!/usr/bin/env bash |
| set -euo pipefail |
| |
| TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf" |
| mkdir -p "${TARGET_DIR}" |
| |
| hf download ychenNLP/oven \ |
| --repo-type dataset \ |
| --local-dir "${TARGET_DIR}" \ |
| --include "ovenid2impath.csv" |
| ``` |
|
|
| ### 10.4 Merge Into the Final OVEN Image Layout |
|
|
| Run the upstream merge script after all downloads finish: |
|
|
| ```bash |
| cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads |
| python merge_oven_images.py |
| ``` |
|
|
| The upstream documentation states that `merge_oven_images.py` should be run after all image download scripts complete and after `ovenid2impath.csv` is available. |
|
|
| ## 11. Verify the Downloaded Files |
|
|
| Check text files: |
|
|
| ```bash |
| ls -lh /workspace/xiaobin/RL_dataset/data/infoseek |
| ``` |
|
|
| Check Hugging Face snapshot files: |
|
|
| ```bash |
| ls -lh /workspace/xiaobin/RL_dataset/data/oven_hf |
| ``` |
|
|
| Check extracted images: |
|
|
| ```bash |
| find /workspace/xiaobin/RL_dataset/data/infoseek/images -type f | wc -l |
| ``` |
|
|
| ## 12. Upstream References |
|
|
| - InfoSeek release page: `https://github.com/open-vision-language/infoseek` |
| - OVEN image download page: `https://github.com/edchengg/oven_eval/tree/main/image_downloads` |
| - Hugging Face OVEN dataset: `https://huggingface.co/datasets/ychenNLP/oven` |
| - Hugging Face CLI download docs: `https://huggingface.co/docs/huggingface_hub/guides/cli` |
|
|