Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Running

App Files Files Community

MolecularDatasetCurationGuide / sections /07_practical_recommendations.md

maom

Update sections/07_practical_recommendations.md

92dbf79 verified 16 days ago

preview code

raw

history blame contribute delete

11.4 kB

Practical Recommendations

Structure of data in a HuggingFace datasets

Datasets, sub-datasets, splits

A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.
Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.
The data in different splits of a single sub-dataset should non-overlapping
Example:
- The MegaScale contains 6 datasets
  - dataset1 # all stability measurements
  - dataset2 # high-quality folding stabilities
  - dataset3 # ΔG measurements
  - dataset3_single # ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024) splits
  - dataset3_single_cv # 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024) splits
- To load a specific subdataset:
```
datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
```

Example: One .csv file dataset

One table named outcomes.csv to be pushed to HuggingFace dataset repository maomlab/example_dataset
First load the dataset locally then push it to the hub:

import datasets  
dataset = datasets.load_dataset(  
    "csv",  
    data_files ="outcomes.csv",  
    keep_in_memory = True)

dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")

This will create the following files in the repo

data/  
    train-00000-of-00001.parquet

and add the following to the header of README.md

dataset_info:  
  features:  
    - name: id  
      dtype: int64  
    - name: value  
      dtype: int64  
  splits:  
    - name: train  
      num_bytes: 64  
      num_examples: 4  
  download_size: 1332  
  dataset_size: 64  
configs:  
  - config_name: default  
    data_files:  
      - split: train  
        path: data/train-*

to load these data from HuggingFace

dataset = datasets.load_dataset("maomlab/example_dataset")

Example: train/valid/test split .csv files

Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository maomlab/example_dataset
load the three splits into one dataset and push it to the hub:

import datasets  
dataset = datasets.load_dataset(  
    'csv',  
    data_dir = "/tmp",  
    data_files = {  
      'train': 'train.csv',  
      'valid': 'valid.csv',  
      'test': 'test.csv'},  
    keep_in_memory = True)

dataset.push_to_hub(repo_id = "maomlab/example_dataset")

This will create the following files in the repo

data/  
    train-00000-of-00001.parquet  
    valid-00000-of-00001.parquet  
    test-00000-of-00001.parquet

and add the following to the header of the README.md

dataset_info:  
  features:  
    - name: id  
      dtype: int64  
    - name: value  
      dtype: int64  
   splits:  
    - name: train  
      num_bytes: 64  
      num_examples: 4  
    - name: valid  
      num_bytes: 64  
      num_examples: 4  
    - name: test  
      num_bytes: 64  
      num_examples: 4  
  download_size: 3996  
  dataset_size: 192  
configs:  
  - config_name: default  
    data_files:  
      - split: train  
        path: data/train-*  
      - split: valid  
        path: data/valid-*  
      - split: test  
        path: data/test-*

to load these data from HuggingFace

dataset = datasets.load_dataset("maomlab/example_dataset")

Example: sub-datasets

If you have different related datasets (dataset1.csv, dataset2.csv, dataset3.csv) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.

import datasets  
dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)  
dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)  
dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)

dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')  
dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')  
dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')

This will create the following files in the repo

dataset1/  
    data/  
    train-00000-of-00001.parquet  
dataset2/  
    data/  
    train-00000-of-00001.parquet  
dataset3/  
    data/  
    train-00000-of-00001.parquet

and add the following to the header of the README.md

dataset_info:  
  - config_name: dataset1  
    features:  
      - name: id  
        dtype: int64  
      - name: value1  
        dtype: int64  
    splits:  
      - name: train  
        num_bytes: 64  
        num_examples: 4  
    download_size: 1344  
    dataset_size: 64  
  - config_name: dataset2  
    features:  
      - name: id  
        dtype: int64  
      - name: value2  
        dtype: int64  
    splits:  
      - name: train  
        num_bytes: 64  
        num_examples: 4  
    download_size: 1344  
    dataset_size: 64  
  - config_name: dataset3  
    features:  
      - name: id  
        dtype: int64  
      - name: value3  
        dtype: int64  
    splits:  
      - name: train  
        num_bytes: 64  
        num_examples: 4  
    download_size: 1344  
    dataset_size: 64  
configs:  
  - config_name: dataset1  
    data_files:  
      - split: train  
        path: dataset1/data/train-*  
  - config_name: dataset2  
    data_files:  
      - split: train  
        path: dataset2/data/train-*  
  - config_name: dataset3  
    data_files:  
      - split: train  
        path: dataset3/data/train-*

to load these datasets from HuggingFace

dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')  
dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')  
dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')

Format of a dataset

A dataset should consist of a single table where each row is a single observation
The columns should follow typical database design guidelines

Identifier columns
- sequential key
  - For example: [1, 2, 3, ...]
- primary key
  - single column that uniquely identify each row
    - distinct for every row
    - no non-missing values
  - For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
- composite key
  - A set of columns that uniquely identify each row
    - Either hierarchical or complementary ids that characterize the observation
    - For example, for an observation of mutations, the (structure_id, residue_id, mutation_aa) is a unique identifier
- additional/foreign key identifiers
  - identifiers to link the observation with other data
  - For example
    - for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
    - FDA drug name or the IUPAC substance name
Tidy key/value columns
- Tidy vs array data
  - tidy data sometimes called (long) has one measurement per row
    - Multiple columns can be used to give details for each measurement including type, units, metadata
    - Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
    - Can handle variable number of measurements per object
    - Duplicates object identifier columns for each measurement
  - array data sometimes called (wide) has one object per row and multiple measurements as different columns
    - Typically each measurement is typically a single column
    - More compact, i.e. no duplication of identifier columns
    - Good for certain ML/matrix based computational workflows

Molecular formats

Store molecular structure in standard text formats
- protein structure: PDB, mmCIF, modelCIF
- small molecule: SMILES, InChi
- use uncompressed, plaintext format
  - Easier to computationally analyze
  - the whole dataset will be compressed for data serialization
Filtering / Standardization / sanitization
- Be clear about process methods used to process the molecular data
- Be especially careful for inferred / aspects of the data
  - protonation states,
  - salt form, stereochemistry for small molecules
  - data missingness including unstructured loops for proteins
- Tools
  - MolVS is useful for small molecule sanitization

Computational data formats

On disk formats
- parquet format disk format
  - column oriented (so can load only data that is needed, easier to compress)
  - robust reader/write codes from apache arrow for Python, R etc.
- ArrowTable
  - In memory format closely aligned with the on disk parquet format
  - Native format for datasets stored in datasets python package
- tab/comma separated table
  - Prefer tab separated, more consistent parsing without needing escaping values
  - Widely used row-oriented text format for storing tabular data to disk
  - Does not store data format and often needs custom format conversion code/QC for loading into python/R
  - Can be compressed on disk but row-oriented, so less compressible than .parquet
- .pickle / .Rdata
  - language specific serialization of complex data structures
  - Often very fast to read/write, but may not be robust for across language/OS versions
  - Not easily interoperable across programming languages
In memory formats
- R data.frame/dplyr::tibble
  - Widely used format for R data science
  - Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
- Python pandas DataFrame
  - Widely used for python data science
  - Out of the box not super fast for data science
- Python numpy array / R Matrix
  - Uses single data type for all data
  - Useful for efficient/matrix manipulation
- Python Pytorch dataset
  - Format specifically geared for loading data for Pytorch deep-learning

Recommendations

On disk
- For small, config level tables use .tsv
- For large data format use .parquet
  - Smaller than .csv/.tsv
  - Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
In memory
- Use dplyr::tibble / pandas DataFrame for data science tables
- Use numpy array / pytorch dataset for machine learning