Structure of data in a HuggingFace datasets
Datasets, sub-datasets, splits
- A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.
- Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.
- The data in different splits of a single sub-dataset should non-overlapping
- Example:
- The MegaScale contains 6 datasets
- dataset1 # all stability measurements
- dataset2 # high-quality folding stabilities
- dataset3 # ΔG measurements
- dataset3_single # ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024) splits
- dataset3_single_cv # 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024) splits
- To load a specific subdataset:
- datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
- The MegaScale contains 6 datasets
Example: One .csv file dataset
One table named outcomes.csv to be pushed to HuggingFace dataset repository maomlab/example_dataset
First load the dataset locally then push it to the hub:
import datasets
dataset = datasets.load_dataset(
"csv",
data_files ="outcomes.csv",
keep_in_memory = True)
dataset.push_to_hub(repo_id = "maomlab/example_dataset")
This will create the following files in the repo
data/
train-00000-of-00001.parquet
and add the following to the header of README.md
dataset_info:
features:
- name: id
dtype: int64
- name: value
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1332
dataset_size: 64
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
to load these data from HuggingFace
dataset = datasets.load_dataset("maomlab/example_dataset")
Example: train/valid/test split .csv files
Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository maomlab/example_dataset
load the three splits into one dataset and push it to the hub:
import datasets
dataset = datasets.load_dataset(
'csv',
data_dir = "/tmp",
data_files = {
'train': 'train.csv',
'valid': 'valid.csv',
'test': 'test.csv'},
keep_in_memory = True)
dataset.push\_to\_hub(repo\_id \= "maomlab/example\_dataset")
This will create the following files in the repo
data/
train-00000-of-00001.parquet
valid-00000-of-00001.parquet
test-00000-of-00001.parquet
and add the following to the header of the README.md
dataset_info:
features:
- name: id
dtype: int64
- name: value
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
- name: valid
num_bytes: 64
num_examples: 4
- name: test
num_bytes: 64
num_examples: 4
download_size: 3996
dataset_size: 192
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: valid
path: data/valid-*
- split: test
path: data/test-*
to load these data from HuggingFace
dataset = datasets.load_dataset("maomlab/example_dataset")
Example: sub-datasets
If you have different related datasets (dataset1.csv, dataset2.csv, dataset3.csv) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.
import datasets
dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in_memory = True)
dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in_memory = True)
dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in_memory = True)
dataset1.push_to_hub(repo_id = "maomlab/example_dataset", config_name = 'dataset1', data_dir = 'dataset1/data')
dataset2.push_to_hub(repo_id = "maomlab/example_dataset", config_name = 'dataset2', data_dir = 'dataset2/data')
dataset3.push_to_hub(repo_id = "maomlab/example_dataset", config_name = 'dataset3', data_dir = 'dataset3/data')
This will create the following files in the repo
dataset1/
data/
train-00000-of-00001.parquet
dataset2/
data/
train-00000-of-00001.parquet
dataset3/
data/
train-00000-of-00001.parquet
and add the following to the header of the README.md
dataset_info:
- config_name: dataset1
features:
- name: id
dtype: int64
- name: value1
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
- config_name: dataset2
features:
- name: id
dtype: int64
- name: value2
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
- config_name: dataset3
features:
- name: id
dtype: int64
- name: value3
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
configs:
- config_name: dataset1
data_files:
- split: train
path: dataset1/data/train-*
- config_name: dataset2
data_files:
- split: train
path: dataset2/data/train-*
- config_name: dataset3
data_files:
- split: train
path: dataset3/data/train-*
to load these datasets from HuggingFace
dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')
dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')
dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')