MolecularDatasetCurationGuide / sections /07_practical_recommendations.md
maom's picture
Update sections/07_practical_recommendations.md
92dbf79 verified

Practical Recommendations

Structure of data in a HuggingFace datasets

Datasets, sub-datasets, splits

  • A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.
  • Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.
  • The data in different splits of a single sub-dataset should non-overlapping
  • Example:
    • The MegaScale contains 6 datasets
      • dataset1 # all stability measurements
      • dataset2 # high-quality folding stabilities
      • dataset3 # ΔG measurements
      • dataset3_single # ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024) splits
      • dataset3_single_cv # 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024) splits
    • To load a specific subdataset:
      datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
      

Example: One .csv file dataset

One table named outcomes.csv to be pushed to HuggingFace dataset repository maomlab/example_dataset
First load the dataset locally then push it to the hub:

import datasets  
dataset = datasets.load_dataset(  
    "csv",  
    data_files ="outcomes.csv",  
    keep_in_memory = True)

dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")

This will create the following files in the repo

data/  
    train-00000-of-00001.parquet

and add the following to the header of README.md

dataset_info:  
  features:  
    - name: id  
      dtype: int64  
    - name: value  
      dtype: int64  
  splits:  
    - name: train  
      num_bytes: 64  
      num_examples: 4  
  download_size: 1332  
  dataset_size: 64  
configs:  
  - config_name: default  
    data_files:  
      - split: train  
        path: data/train-*

to load these data from HuggingFace

dataset = datasets.load_dataset("maomlab/example_dataset")

Example: train/valid/test split .csv files

Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository maomlab/example_dataset
load the three splits into one dataset and push it to the hub:

import datasets  
dataset = datasets.load_dataset(  
    'csv',  
    data_dir = "/tmp",  
    data_files = {  
      'train': 'train.csv',  
      'valid': 'valid.csv',  
      'test': 'test.csv'},  
    keep_in_memory = True)

dataset.push_to_hub(repo_id = "maomlab/example_dataset")

This will create the following files in the repo

data/  
    train-00000-of-00001.parquet  
    valid-00000-of-00001.parquet  
    test-00000-of-00001.parquet

and add the following to the header of the README.md

dataset_info:  
  features:  
    - name: id  
      dtype: int64  
    - name: value  
      dtype: int64  
   splits:  
    - name: train  
      num_bytes: 64  
      num_examples: 4  
    - name: valid  
      num_bytes: 64  
      num_examples: 4  
    - name: test  
      num_bytes: 64  
      num_examples: 4  
  download_size: 3996  
  dataset_size: 192  
configs:  
  - config_name: default  
    data_files:  
      - split: train  
        path: data/train-*  
      - split: valid  
        path: data/valid-*  
      - split: test  
        path: data/test-*

to load these data from HuggingFace

dataset = datasets.load_dataset("maomlab/example_dataset")

Example: sub-datasets

If you have different related datasets (dataset1.csv, dataset2.csv, dataset3.csv) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.

import datasets  
dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)  
dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)  
dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)

dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')  
dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')  
dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')

This will create the following files in the repo

dataset1/  
    data/  
    train-00000-of-00001.parquet  
dataset2/  
    data/  
    train-00000-of-00001.parquet  
dataset3/  
    data/  
    train-00000-of-00001.parquet

and add the following to the header of the README.md

dataset_info:  
  - config_name: dataset1  
    features:  
      - name: id  
        dtype: int64  
      - name: value1  
        dtype: int64  
    splits:  
      - name: train  
        num_bytes: 64  
        num_examples: 4  
    download_size: 1344  
    dataset_size: 64  
  - config_name: dataset2  
    features:  
      - name: id  
        dtype: int64  
      - name: value2  
        dtype: int64  
    splits:  
      - name: train  
        num_bytes: 64  
        num_examples: 4  
    download_size: 1344  
    dataset_size: 64  
  - config_name: dataset3  
    features:  
      - name: id  
        dtype: int64  
      - name: value3  
        dtype: int64  
    splits:  
      - name: train  
        num_bytes: 64  
        num_examples: 4  
    download_size: 1344  
    dataset_size: 64  
configs:  
  - config_name: dataset1  
    data_files:  
      - split: train  
        path: dataset1/data/train-*  
  - config_name: dataset2  
    data_files:  
      - split: train  
        path: dataset2/data/train-*  
  - config_name: dataset3  
    data_files:  
      - split: train  
        path: dataset3/data/train-*

to load these datasets from HuggingFace

dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')  
dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')  
dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')

Format of a dataset

A dataset should consist of a single table where each row is a single observation
The columns should follow typical database design guidelines

  • Identifier columns
    • sequential key
      • For example: [1, 2, 3, ...]
    • primary key
      • single column that uniquely identify each row
        • distinct for every row
        • no non-missing values
      • For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
    • composite key
      • A set of columns that uniquely identify each row
        • Either hierarchical or complementary ids that characterize the observation
        • For example, for an observation of mutations, the (structure_id, residue_id, mutation_aa) is a unique identifier
    • additional/foreign key identifiers
      • identifiers to link the observation with other data
      • For example
        • for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
        • FDA drug name or the IUPAC substance name
  • Tidy key/value columns
    • Tidy vs array data
      • tidy data sometimes called (long) has one measurement per row
        • Multiple columns can be used to give details for each measurement including type, units, metadata
        • Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
        • Can handle variable number of measurements per object
        • Duplicates object identifier columns for each measurement
      • array data sometimes called (wide) has one object per row and multiple measurements as different columns
        • Typically each measurement is typically a single column
        • More compact, i.e. no duplication of identifier columns
        • Good for certain ML/matrix based computational workflows

Molecular formats

  • Store molecular structure in standard text formats
    • protein structure: PDB, mmCIF, modelCIF
    • small molecule: SMILES, InChi
    • use uncompressed, plaintext format
      • Easier to computationally analyze
      • the whole dataset will be compressed for data serialization
  • Filtering / Standardization / sanitization
    • Be clear about process methods used to process the molecular data
    • Be especially careful for inferred / aspects of the data
      • protonation states,
      • salt form, stereochemistry for small molecules
      • data missingness including unstructured loops for proteins
    • Tools
      • MolVS is useful for small molecule sanitization

Computational data formats

  • On disk formats
    • parquet format disk format
      • column oriented (so can load only data that is needed, easier to compress)
      • robust reader/write codes from apache arrow for Python, R etc.
    • ArrowTable
      • In memory format closely aligned with the on disk parquet format
      • Native format for datasets stored in datasets python package
    • tab/comma separated table
      • Prefer tab separated, more consistent parsing without needing escaping values
      • Widely used row-oriented text format for storing tabular data to disk
      • Does not store data format and often needs custom format conversion code/QC for loading into python/R
      • Can be compressed on disk but row-oriented, so less compressible than .parquet
    • .pickle / .Rdata
      • language specific serialization of complex data structures
      • Often very fast to read/write, but may not be robust for across language/OS versions
      • Not easily interoperable across programming languages
  • In memory formats
    • R data.frame/dplyr::tibble
      • Widely used format for R data science
      • Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
    • Python pandas DataFrame
      • Widely used for python data science
      • Out of the box not super fast for data science
    • Python numpy array / R Matrix
      • Uses single data type for all data
      • Useful for efficient/matrix manipulation
    • Python Pytorch dataset
      • Format specifically geared for loading data for Pytorch deep-learning

Recommendations

  • On disk
    • For small, config level tables use .tsv
    • For large data format use .parquet
      • Smaller than .csv/.tsv
      • Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
  • In memory
    • Use dplyr::tibble / pandas DataFrame for data science tables
    • Use numpy array / pytorch dataset for machine learning