Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Running

File size: 11,437 Bytes

## Practical Recommendations

### **Structure of data in a HuggingFace datasets**

#### Datasets, sub-datasets, splits

* A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.  
* Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.  
* The data in different splits of a single sub-dataset should non-overlapping  
* Example:  
  * The [MegaScale](https://huggingface.co/datasets/RosettaCommons/MegaScale) contains 6 datasets  
    * dataset1                   \# all stability measurements  
    * dataset2                   \# high-quality folding stabilities  
    * dataset3                   \# ΔG measurements  
    * dataset3\_single       \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits  
    * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits  
  * To load a specific subdataset:
    ```
    datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
    ```

#### Example: One .csv file dataset

One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`  
First load the dataset locally then push it to the hub:

    import datasets  
    dataset = datasets.load_dataset(  
        "csv",  
        data_files ="outcomes.csv",  
        keep_in_memory = True)

    dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")

This will create the following files in the repo

    data/  
	    train-00000-of-00001.parquet

and add the following to the header of README.md

    dataset_info:  
      features:  
        - name: id  
          dtype: int64  
        - name: value  
          dtype: int64  
      splits:  
        - name: train  
          num_bytes: 64  
          num_examples: 4  
      download_size: 1332  
      dataset_size: 64  
    configs:  
      - config_name: default  
        data_files:  
          - split: train  
            path: data/train-*

to load these data from HuggingFace

    dataset = datasets.load_dataset("maomlab/example_dataset")

#### Example: train/valid/test split .csv files

Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`  
load the three splits into one dataset and push it to the hub:

    import datasets  
    dataset = datasets.load_dataset(  
        'csv',  
        data_dir = "/tmp",  
        data_files = {  
          'train': 'train.csv',  
          'valid': 'valid.csv',  
          'test': 'test.csv'},  
        keep_in_memory = True)

	dataset.push_to_hub(repo_id = "maomlab/example_dataset")

This will create the following files in the repo

    data/  
	    train-00000-of-00001.parquet  
	    valid-00000-of-00001.parquet  
	    test-00000-of-00001.parquet

and add the following to the header of the README.md

    dataset_info:  
      features:  
        - name: id  
          dtype: int64  
        - name: value  
          dtype: int64  
       splits:  
        - name: train  
          num_bytes: 64  
          num_examples: 4  
        - name: valid  
          num_bytes: 64  
          num_examples: 4  
        - name: test  
          num_bytes: 64  
          num_examples: 4  
      download_size: 3996  
      dataset_size: 192  
    configs:  
      - config_name: default  
        data_files:  
          - split: train  
            path: data/train-*  
          - split: valid  
            path: data/valid-*  
          - split: test  
            path: data/test-*

to load these data from HuggingFace

    dataset = datasets.load_dataset("maomlab/example_dataset")

#### Example: sub-datasets

If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.

	import datasets  
    dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)  
    dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)  
    dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)

    dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')  
    dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')  
    dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')

This will create the following files in the repo

    dataset1/  
	    data/  
	    train-00000-of-00001.parquet  
    dataset2/  
	    data/  
	    train-00000-of-00001.parquet  
    dataset3/  
	    data/  
	    train-00000-of-00001.parquet

and add the following to the header of the README.md

    dataset_info:  
      - config_name: dataset1  
        features:  
          - name: id  
            dtype: int64  
          - name: value1  
            dtype: int64  
        splits:  
          - name: train  
            num_bytes: 64  
            num_examples: 4  
        download_size: 1344  
        dataset_size: 64  
      - config_name: dataset2  
        features:  
          - name: id  
            dtype: int64  
          - name: value2  
            dtype: int64  
        splits:  
          - name: train  
            num_bytes: 64  
            num_examples: 4  
        download_size: 1344  
        dataset_size: 64  
      - config_name: dataset3  
        features:  
          - name: id  
            dtype: int64  
          - name: value3  
            dtype: int64  
        splits:  
          - name: train  
            num_bytes: 64  
            num_examples: 4  
        download_size: 1344  
        dataset_size: 64  
    configs:  
      - config_name: dataset1  
        data_files:  
          - split: train  
            path: dataset1/data/train-*  
      - config_name: dataset2  
        data_files:  
          - split: train  
            path: dataset2/data/train-*  
      - config_name: dataset3  
        data_files:  
          - split: train  
            path: dataset3/data/train-*

to load these datasets from HuggingFace

    dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')  
	dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')  
	dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')


### **Format of a dataset**

A dataset should consist of a single table where each row is a single observation  
The columns should follow typical database design guidelines 

* Identifier columns  
  * sequential key  
    * For example: `[1, 2, 3, ...]`  
  * primary key  
    * single column that uniquely identify each row  
      * distinct for every row  
      * no non-missing values  
    * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key  
  * composite key  
    * A set of columns that uniquely identify each row  
      * Either hierarchical or complementary ids that characterize the observation  
      * For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier  
  * additional/foreign key identifiers  
    * identifiers to link the observation with other data  
    * For example  
      * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key  
      * FDA drug name or the IUPAC substance name  
* Tidy key/value columns  
  * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)  
    * tidy data sometimes called (long) has one measurement per row  
      * Multiple columns can be used to give details for each measurement including type, units, metadata  
      * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)  
      * Can handle variable number of measurements per object  
      * Duplicates object identifier columns for each measurement  
    * array data sometimes called (wide) has one object per row and multiple measurements as different columns  
      * Typically each measurement is typically a single column  
      * More compact, i.e. no duplication of identifier columns  
      * Good for certain ML/matrix based computational workflows

#### Molecular formats

* Store molecular structure in standard text formats   
  * protein structure: PDB, mmCIF, modelCIF  
  * small molecule: SMILES, InChi  
  * use uncompressed, plaintext format  
    * Easier to computationally analyze  
    * the whole dataset will be compressed for data serialization  
* Filtering / Standardization / sanitization  
  * Be clear about process methods used to process the molecular data  
  * Be especially careful for inferred / aspects of the data  
    * protonation states,  
    * salt form, stereochemistry for small molecules  
    * data missingness including unstructured loops for proteins  
  * Tools  
    * MolVS is useful for small molecule sanitization

#### Computational data formats

* On disk formats  
  * parquet format disk format  
    * column oriented (so can load only data that is needed, easier to compress)  
    * robust reader/write codes from apache arrow for Python, R etc.  
  * ArrowTable  
    * In memory format closely aligned with the on disk parquet format   
    * Native format for datasets stored in datasets python package  
  * tab/comma separated table  
    * Prefer tab separated, more consistent parsing without needing escaping values  
    * Widely used row-oriented text format for storing tabular data to disk  
    * Does not store data format and often needs custom format conversion code/QC for loading into python/R  
    * Can be compressed on disk but row-oriented, so less compressible than .parquet  
  * .pickle / .Rdata  
    * language specific serialization of complex data structures  
    * Often very fast to read/write, but may not be robust for across language/OS versions  
    * Not easily interoperable across programming languages  
* In memory formats  
  * R `data.frame`/`dplyr::tibble`  
    * Widely used format for R data science  
    * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows  
  * Python pandas DataFrame  
    * Widely used for python data science  
    * Out of the box not super fast for data science  
  * Python numpy array / R Matrix  
    * Uses single data type for all data  
    * Useful for efficient/matrix manipulation  
  * Python Pytorch dataset  
    * Format specifically geared for loading data for Pytorch deep-learning

Recommendations

* On disk  
  * For small, config level tables use .tsv  
  * For large data format use .parquet  
    * Smaller than .csv/.tsv  
    * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv  
* In memory  
  * Use `dplyr::tibble` / pandas DataFrame for data science tables  
  * Use numpy array / pytorch dataset for machine learning