Practical Recommendations
Structure of data in a HuggingFace datasets
Datasets, sub-datasets, splits
- A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.
- Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.
- The data in different splits of a single sub-dataset should non-overlapping
- Example:
- The MegaScale contains 6 datasets
- dataset1 # all stability measurements
- dataset2 # high-quality folding stabilities
- dataset3 # ΔG measurements
- dataset3_single # ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024) splits
- dataset3_single_cv # 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024) splits
- To load a specific subdataset:
datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
- The MegaScale contains 6 datasets
Example: One .csv file dataset
One table named outcomes.csv to be pushed to HuggingFace dataset repository maomlab/example_dataset
First load the dataset locally then push it to the hub:
import datasets
dataset = datasets.load_dataset(
"csv",
data_files ="outcomes.csv",
keep_in_memory = True)
dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")
This will create the following files in the repo
data/
train-00000-of-00001.parquet
and add the following to the header of README.md
dataset_info:
features:
- name: id
dtype: int64
- name: value
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1332
dataset_size: 64
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
to load these data from HuggingFace
dataset = datasets.load_dataset("maomlab/example_dataset")
Example: train/valid/test split .csv files
Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository maomlab/example_dataset
load the three splits into one dataset and push it to the hub:
import datasets
dataset = datasets.load_dataset(
'csv',
data_dir = "/tmp",
data_files = {
'train': 'train.csv',
'valid': 'valid.csv',
'test': 'test.csv'},
keep_in_memory = True)
dataset.push_to_hub(repo_id = "maomlab/example_dataset")
This will create the following files in the repo
data/
train-00000-of-00001.parquet
valid-00000-of-00001.parquet
test-00000-of-00001.parquet
and add the following to the header of the README.md
dataset_info:
features:
- name: id
dtype: int64
- name: value
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
- name: valid
num_bytes: 64
num_examples: 4
- name: test
num_bytes: 64
num_examples: 4
download_size: 3996
dataset_size: 192
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: valid
path: data/valid-*
- split: test
path: data/test-*
to load these data from HuggingFace
dataset = datasets.load_dataset("maomlab/example_dataset")
Example: sub-datasets
If you have different related datasets (dataset1.csv, dataset2.csv, dataset3.csv) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.
import datasets
dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)
dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)
dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)
dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')
dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')
dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')
This will create the following files in the repo
dataset1/
data/
train-00000-of-00001.parquet
dataset2/
data/
train-00000-of-00001.parquet
dataset3/
data/
train-00000-of-00001.parquet
and add the following to the header of the README.md
dataset_info:
- config_name: dataset1
features:
- name: id
dtype: int64
- name: value1
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
- config_name: dataset2
features:
- name: id
dtype: int64
- name: value2
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
- config_name: dataset3
features:
- name: id
dtype: int64
- name: value3
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
configs:
- config_name: dataset1
data_files:
- split: train
path: dataset1/data/train-*
- config_name: dataset2
data_files:
- split: train
path: dataset2/data/train-*
- config_name: dataset3
data_files:
- split: train
path: dataset3/data/train-*
to load these datasets from HuggingFace
dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')
dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')
dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')
Format of a dataset
A dataset should consist of a single table where each row is a single observation
The columns should follow typical database design guidelines
- Identifier columns
- sequential key
- For example:
[1, 2, 3, ...]
- For example:
- primary key
- single column that uniquely identify each row
- distinct for every row
- no non-missing values
- For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
- single column that uniquely identify each row
- composite key
- A set of columns that uniquely identify each row
- Either hierarchical or complementary ids that characterize the observation
- For example, for an observation of mutations, the (
structure_id,residue_id,mutation_aa) is a unique identifier
- A set of columns that uniquely identify each row
- additional/foreign key identifiers
- identifiers to link the observation with other data
- For example
- for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
- FDA drug name or the IUPAC substance name
- sequential key
- Tidy key/value columns
- Tidy vs array data
- tidy data sometimes called (long) has one measurement per row
- Multiple columns can be used to give details for each measurement including type, units, metadata
- Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
- Can handle variable number of measurements per object
- Duplicates object identifier columns for each measurement
- array data sometimes called (wide) has one object per row and multiple measurements as different columns
- Typically each measurement is typically a single column
- More compact, i.e. no duplication of identifier columns
- Good for certain ML/matrix based computational workflows
- tidy data sometimes called (long) has one measurement per row
- Tidy vs array data
Molecular formats
- Store molecular structure in standard text formats
- protein structure: PDB, mmCIF, modelCIF
- small molecule: SMILES, InChi
- use uncompressed, plaintext format
- Easier to computationally analyze
- the whole dataset will be compressed for data serialization
- Filtering / Standardization / sanitization
- Be clear about process methods used to process the molecular data
- Be especially careful for inferred / aspects of the data
- protonation states,
- salt form, stereochemistry for small molecules
- data missingness including unstructured loops for proteins
- Tools
- MolVS is useful for small molecule sanitization
Computational data formats
- On disk formats
- parquet format disk format
- column oriented (so can load only data that is needed, easier to compress)
- robust reader/write codes from apache arrow for Python, R etc.
- ArrowTable
- In memory format closely aligned with the on disk parquet format
- Native format for datasets stored in datasets python package
- tab/comma separated table
- Prefer tab separated, more consistent parsing without needing escaping values
- Widely used row-oriented text format for storing tabular data to disk
- Does not store data format and often needs custom format conversion code/QC for loading into python/R
- Can be compressed on disk but row-oriented, so less compressible than .parquet
- .pickle / .Rdata
- language specific serialization of complex data structures
- Often very fast to read/write, but may not be robust for across language/OS versions
- Not easily interoperable across programming languages
- parquet format disk format
- In memory formats
- R
data.frame/dplyr::tibble- Widely used format for R data science
- Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
- Python pandas DataFrame
- Widely used for python data science
- Out of the box not super fast for data science
- Python numpy array / R Matrix
- Uses single data type for all data
- Useful for efficient/matrix manipulation
- Python Pytorch dataset
- Format specifically geared for loading data for Pytorch deep-learning
- R
Recommendations
- On disk
- For small, config level tables use .tsv
- For large data format use .parquet
- Smaller than .csv/.tsv
- Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
- In memory
- Use
dplyr::tibble/ pandas DataFrame for data science tables - Use numpy array / pytorch dataset for machine learning
- Use