# Contribute Your Models and Datasets

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/EvolvingLMMs-Lab/lmms-eval/blob/main/tools/make_image_hf_dataset.ipynb)

This notebook will guide you to make correct format of Huggingface dataset, in proper parquet format and visualizable in Huggingface dataset hub.

We will take the example of the dataset [`lmms-lab/VQAv2_TOY`](https://huggingface.co/datasets/lmms-lab/VQAv2_TOY) and convert it to the proper format.

## Preparation

We need to install `datasets` library to create the dataset and `Pillow` to handle images.

In [None]:
!pip install datasets Pillow

And we need to login into Hugging Face to upload the dataset. You should goto the [Hugging Face website](https://huggingface.co/settings/tokens) to get your API token.

In [None]:
!huggingface-cli login --token hf_YOUR_HF_TOKEN # replace hf_YOUR_HF_TOKEN to your own Hugging Face token.

## Download Dataset

We have uploaded the zip file of the dataset to [Hugging Face](https://huggingface.co/datasets/pufanyi/VQAv2_TOY/tree/main/source_data) for download. This dataset is a subset of the [VQAv2](https://visualqa.org/) dataset, with $20$ entries each from the `val`, `test`, and `test-dev` splits, for easier downloading.

In [None]:
!wget https://huggingface.co/datasets/lmms-lab/VQAv2_TOY/resolve/main/source_data/sample_data.zip -P data
!unzip data/sample_data.zip -d data

We can open `data/questions` to take a view of the dataset organization. We found that the toy-`VQAv2` dataset is organized as follows:

```json
{
    "info": { /* some infomation */ },
    "task_type": "TASK_TYPE", "data_type": "mscoco",
    "license": { /* some license */ },
    "questions": [
        {
            "image_id": 262144, // integer id of the image
            "question": "Is the ball flying towards the batter?",
            "question_id": 262144000
        },
        /* ... */
    ]
}
```

## Define Dataset Features _(Optional<sup>*</sup>)_

You can define the features of the dataset. For more details, please refer to the [official documentation](https://huggingface.co/docs/datasets/en/about_dataset_features).

<sup>*</sup> _Note that if the dataset features are consistent and all entries in your dataset table are non-null **for all splits of data**, you can skip this step._

In [None]:
import datasets

features = datasets.Features(
    {
        "question_id": datasets.Value("int64"),
        "question": datasets.Value("string"),
        "image_id": datasets.Value("string"),
        "image": datasets.Image(),
    }
)

## Define Data Generator

We use [`datasets.Dataset.from_generator`](https://huggingface.co/docs/datasets/v2.20.0/en/package_reference/main_classes#datasets.Dataset.from_generator) to create the dataset.

The generator function should `yield` dictionaries with the keys corresponding to the dataset features. This can save memory when loading large datasets.

For the image data, we can convert the image to [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html) object.

Note that if some columns are missing in some splits of the dataset (for example, the `answer` column is usually missing in the `test` split), we need to set these columns to null to ensure that all splits have the same features.

In [None]:
import os
import json
from PIL import Image


def generator(qa_file, image_folder, image_prefix):
    with open(qa_file, "r") as f:
        data = json.load(f)
        qa = data["questions"]

    for q in qa:
        image_id = q["image_id"]
        image_path = os.path.join(image_folder, f"{image_prefix}_{image_id:012}.jpg")
        q["image"] = Image.open(image_path)
        yield q

## Generate Dataset

We generate the dataset using the generator function.

Note that if you skip the step of defining dataset features, there is no need to pass the `features` argument. The dataset infer the features from the dataset automatically.

In [None]:
NUM_PROC = 32  # number of processes to use for multiprocessing, set to 1 for no multiprocessing

data_val = datasets.Dataset.from_generator(
    generator,
    gen_kwargs={
        "qa_file": "data/questions/vqav2_toy_questions_val2014.json",
        "image_folder": "data/images",
        "image_prefix": "COCO_val2014",
    },
    # For this dataset, there is no need to specify the features, as all cells are non-null and all splits have the same schema
    # features=features,
    num_proc=NUM_PROC,
)

data_test = datasets.Dataset.from_generator(
    generator,
    gen_kwargs={
        "qa_file": "data/questions/vqav2_toy_questions_test2015.json",
        "image_folder": "data/images",
        "image_prefix": "COCO_test2015",
    },
    # features=features,
    num_proc=NUM_PROC,
)

data_test_dev = datasets.Dataset.from_generator(
    generator,
    gen_kwargs={
        "qa_file": "data/questions/vqav2_toy_questions_test-dev2015.json",
        "image_folder": "data/images",
        "image_prefix": "COCO_test2015",
    },
    # features=features,
    num_proc=NUM_PROC,
)

## Dataset Upload

Finally, we group the dataset with different splits and upload it to the Huggingface dataset hub.

In [None]:
data = datasets.DatasetDict({"val": data_val, "test": data_test, "test_dev": data_test_dev})

In [None]:
data.push_to_hub("lmms-lab/VQAv2_TOY")  # replace lmms-lab to your username

Now, you can check the dataset on the [Hugging Face dataset hub](https://huggingface.co/datasets/lmms-lab/VQAv2_TOY).