File size: 9,528 Bytes

90afcf2

# InfoSeek Data Download

This document collects ready-to-run scripts for downloading the InfoSeek dataset into:

`/workspace/xiaobin/RL_dataset/data`

It covers:

- InfoSeek annotations
- InfoSeek KB mapping files
- InfoSeek human set
- Wiki6M text files
- OVEN image snapshot on Hugging Face
- OVEN original-source image download workflow

InfoSeek images are derived from OVEN, so image download is handled through the OVEN release pipeline.

## 1. Recommended Directory Layout

```bash
mkdir -p /workspace/xiaobin/RL_dataset/data/infoseek
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_hf
mkdir -p /workspace/xiaobin/RL_dataset/data/oven_source
```

Suggested usage:

- `/workspace/xiaobin/RL_dataset/data/infoseek`: InfoSeek jsonl files
- `/workspace/xiaobin/RL_dataset/data/oven_hf`: Hugging Face image snapshot files
- `/workspace/xiaobin/RL_dataset/data/oven_source`: upstream OVEN repo for original-source image download

## 2. Proxy Workaround

If your shell is configured with an invalid local proxy such as `127.0.0.1:7890`, use one of these patterns.

Temporarily disable proxy for a single command:

```bash
env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY wget -c URL
```

Or disable proxy for the current shell session:

```bash
unset http_proxy https_proxy HTTP_PROXY HTTPS_PROXY
```

## 3. Download All InfoSeek Text Data With `wget`

This is the simplest full download for the released InfoSeek jsonl files.

```bash
#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

ls -lh "${TARGET_DIR}"
```

## 4. Download All InfoSeek Text Data With `curl`

Use this if `wget` is not available.

```bash
#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
curl -L -O http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl

ls -lh "${TARGET_DIR}"
```

## 5. Download Only Core InfoSeek Splits

If you only need the standard train/val/test annotations:

```bash
#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_test.jsonl
```

## 6. Download Only KB Mapping Files

```bash
#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_train_withkb.jsonl
wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_val_withkb.jsonl
```

## 7. Download Only Human Eval Set

```bash
#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/infoseek/infoseek_human.jsonl
```

## 8. Download Only Wiki6M Files

```bash
#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/infoseek"
mkdir -p "${TARGET_DIR}"
cd "${TARGET_DIR}"

wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0.jsonl.gz
wget -c http://storage.googleapis.com/gresearch/open-vision-language/Wiki6M_ver_1_0_title_only.jsonl
```

Optional decompression:

```bash
gunzip -k /workspace/xiaobin/RL_dataset/data/infoseek/Wiki6M_ver_1_0.jsonl.gz
```

## 9. Download OVEN Image Snapshot From Hugging Face

Upstream OVEN now points image snapshot downloads to the gated dataset `ychenNLP/oven` on Hugging Face. Before downloading:

1. Open `https://huggingface.co/datasets/ychenNLP/oven`
2. Accept the dataset access conditions
3. Log in with the Hugging Face CLI

Install the CLI if needed:

```bash
python -m pip install -U "huggingface_hub[cli]"
```

Login:

```bash
hf auth login
```

Download the image snapshot and mapping file into `/workspace/xiaobin/RL_dataset/data/oven_hf`:

```bash
#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"

hf download ychenNLP/oven \
  --repo-type dataset \
  --local-dir "${TARGET_DIR}" \
  --include "shard*.tar" \
  --include "all_wikipedia_images.tar" \
  --include "ovenid2impath.csv"
```

Extract the snapshot tar files:

```bash
#!/usr/bin/env bash
set -euo pipefail

HF_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
IMG_DIR="/workspace/xiaobin/RL_dataset/data/infoseek/images"
mkdir -p "${IMG_DIR}"

for f in "${HF_DIR}"/shard*.tar; do
  tar -xf "${f}" -C "${IMG_DIR}"
done

tar -xf "${HF_DIR}/all_wikipedia_images.tar" -C "${IMG_DIR}"
```

Notes:

- Hugging Face file listing shows `shard01.tar` to `shard08.tar` plus `all_wikipedia_images.tar`
- The compressed download is very large, roughly 293 GB based on the published file sizes
- You need additional free space for extraction

## 10. Download OVEN Images From Original Sources

This follows the upstream `oven_eval/image_downloads` workflow.

### 10.1 Clone the Upstream Repo

```bash
git clone https://github.com/edchengg/oven_eval /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval
```

### 10.2 Run All Source Download Scripts

The upstream image download directory contains these scripts:

- `download_aircraft.sh`
- `download_car196.sh`
- `download_coco.sh`
- `download_food101.sh`
- `download_gldv2.sh`
- `download_imagenet.sh`
- `download_inat.sh`
- `download_oxfordflower.sh`
- `download_sports100.sh`
- `download_sun397.sh`
- `download_textvqa.sh`
- `download_v7w.sh`
- `download_vg.sh`

Run them one by one:

```bash
#!/usr/bin/env bash
set -euo pipefail

cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads

bash download_aircraft.sh
bash download_car196.sh
bash download_coco.sh
bash download_food101.sh
bash download_gldv2.sh
bash download_imagenet.sh
bash download_inat.sh
bash download_oxfordflower.sh
bash download_sports100.sh
bash download_sun397.sh
bash download_textvqa.sh
bash download_v7w.sh
bash download_vg.sh
```

Or run them in a loop:

```bash
#!/usr/bin/env bash
set -euo pipefail

cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads

for script in download_*.sh; do
  bash "${script}"
done
```

### 10.3 Download `ovenid2impath.csv`

You need `ovenid2impath.csv` for the merge step. The current recommended source is the Hugging Face dataset:

```bash
#!/usr/bin/env bash
set -euo pipefail

TARGET_DIR="/workspace/xiaobin/RL_dataset/data/oven_hf"
mkdir -p "${TARGET_DIR}"

hf download ychenNLP/oven \
  --repo-type dataset \
  --local-dir "${TARGET_DIR}" \
  --include "ovenid2impath.csv"
```

### 10.4 Merge Into the Final OVEN Image Layout

Run the upstream merge script after all downloads finish:

```bash
cd /workspace/xiaobin/RL_dataset/data/oven_source/oven_eval/image_downloads
python merge_oven_images.py
```

The upstream documentation states that `merge_oven_images.py` should be run after all image download scripts complete and after `ovenid2impath.csv` is available.

## 11. Verify the Downloaded Files

Check text files:

```bash
ls -lh /workspace/xiaobin/RL_dataset/data/infoseek
```

Check Hugging Face snapshot files:

```bash
ls -lh /workspace/xiaobin/RL_dataset/data/oven_hf
```

Check extracted images:

```bash
find /workspace/xiaobin/RL_dataset/data/infoseek/images -type f | wc -l
```

## 12. Upstream References

- InfoSeek release page: `https://github.com/open-vision-language/infoseek`
- OVEN image download page: `https://github.com/edchengg/oven_eval/tree/main/image_downloads`
- Hugging Face OVEN dataset: `https://huggingface.co/datasets/ychenNLP/oven`
- Hugging Face CLI download docs: `https://huggingface.co/docs/huggingface_hub/guides/cli`