MolecularDatasetCurationGuide / sections /07_practical_recommendations.md
maom's picture
Update sections/07_practical_recommendations.md
92dbf79 verified
## Practical Recommendations
### **Structure of data in a HuggingFace datasets**
#### Datasets, sub-datasets, splits
* A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.
* Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.
* The data in different splits of a single sub-dataset should non-overlapping
* Example:
* The [MegaScale](https://huggingface.co/datasets/RosettaCommons/MegaScale) contains 6 datasets
* dataset1 \# all stability measurements
* dataset2 \# high-quality folding stabilities
* dataset3 \# ΔG measurements
* dataset3\_single \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
* dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
* To load a specific subdataset:
```
datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
```
#### Example: One .csv file dataset
One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
First load the dataset locally then push it to the hub:
import datasets
dataset = datasets.load_dataset(
"csv",
data_files ="outcomes.csv",
keep_in_memory = True)
dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")
This will create the following files in the repo
data/
train-00000-of-00001.parquet
and add the following to the header of README.md
dataset_info:
features:
- name: id
dtype: int64
- name: value
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1332
dataset_size: 64
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
to load these data from HuggingFace
dataset = datasets.load_dataset("maomlab/example_dataset")
#### Example: train/valid/test split .csv files
Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
load the three splits into one dataset and push it to the hub:
import datasets
dataset = datasets.load_dataset(
'csv',
data_dir = "/tmp",
data_files = {
'train': 'train.csv',
'valid': 'valid.csv',
'test': 'test.csv'},
keep_in_memory = True)
dataset.push_to_hub(repo_id = "maomlab/example_dataset")
This will create the following files in the repo
data/
train-00000-of-00001.parquet
valid-00000-of-00001.parquet
test-00000-of-00001.parquet
and add the following to the header of the README.md
dataset_info:
features:
- name: id
dtype: int64
- name: value
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
- name: valid
num_bytes: 64
num_examples: 4
- name: test
num_bytes: 64
num_examples: 4
download_size: 3996
dataset_size: 192
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: valid
path: data/valid-*
- split: test
path: data/test-*
to load these data from HuggingFace
dataset = datasets.load_dataset("maomlab/example_dataset")
#### Example: sub-datasets
If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.
import datasets
dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)
dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)
dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)
dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')
dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')
dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')
This will create the following files in the repo
dataset1/
data/
train-00000-of-00001.parquet
dataset2/
data/
train-00000-of-00001.parquet
dataset3/
data/
train-00000-of-00001.parquet
and add the following to the header of the README.md
dataset_info:
- config_name: dataset1
features:
- name: id
dtype: int64
- name: value1
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
- config_name: dataset2
features:
- name: id
dtype: int64
- name: value2
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
- config_name: dataset3
features:
- name: id
dtype: int64
- name: value3
dtype: int64
splits:
- name: train
num_bytes: 64
num_examples: 4
download_size: 1344
dataset_size: 64
configs:
- config_name: dataset1
data_files:
- split: train
path: dataset1/data/train-*
- config_name: dataset2
data_files:
- split: train
path: dataset2/data/train-*
- config_name: dataset3
data_files:
- split: train
path: dataset3/data/train-*
to load these datasets from HuggingFace
dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')
dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')
dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')
### **Format of a dataset**
A dataset should consist of a single table where each row is a single observation
The columns should follow typical database design guidelines
* Identifier columns
* sequential key
* For example: `[1, 2, 3, ...]`
* primary key
* single column that uniquely identify each row
* distinct for every row
* no non-missing values
* For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
* composite key
* A set of columns that uniquely identify each row
* Either hierarchical or complementary ids that characterize the observation
* For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier
* additional/foreign key identifiers
* identifiers to link the observation with other data
* For example
* for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
* FDA drug name or the IUPAC substance name
* Tidy key/value columns
* [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
* tidy data sometimes called (long) has one measurement per row
* Multiple columns can be used to give details for each measurement including type, units, metadata
* Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
* Can handle variable number of measurements per object
* Duplicates object identifier columns for each measurement
* array data sometimes called (wide) has one object per row and multiple measurements as different columns
* Typically each measurement is typically a single column
* More compact, i.e. no duplication of identifier columns
* Good for certain ML/matrix based computational workflows
#### Molecular formats
* Store molecular structure in standard text formats
* protein structure: PDB, mmCIF, modelCIF
* small molecule: SMILES, InChi
* use uncompressed, plaintext format
* Easier to computationally analyze
* the whole dataset will be compressed for data serialization
* Filtering / Standardization / sanitization
* Be clear about process methods used to process the molecular data
* Be especially careful for inferred / aspects of the data
* protonation states,
* salt form, stereochemistry for small molecules
* data missingness including unstructured loops for proteins
* Tools
* MolVS is useful for small molecule sanitization
#### Computational data formats
* On disk formats
* parquet format disk format
* column oriented (so can load only data that is needed, easier to compress)
* robust reader/write codes from apache arrow for Python, R etc.
* ArrowTable
* In memory format closely aligned with the on disk parquet format
* Native format for datasets stored in datasets python package
* tab/comma separated table
* Prefer tab separated, more consistent parsing without needing escaping values
* Widely used row-oriented text format for storing tabular data to disk
* Does not store data format and often needs custom format conversion code/QC for loading into python/R
* Can be compressed on disk but row-oriented, so less compressible than .parquet
* .pickle / .Rdata
* language specific serialization of complex data structures
* Often very fast to read/write, but may not be robust for across language/OS versions
* Not easily interoperable across programming languages
* In memory formats
* R `data.frame`/`dplyr::tibble`
* Widely used format for R data science
* Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
* Python pandas DataFrame
* Widely used for python data science
* Out of the box not super fast for data science
* Python numpy array / R Matrix
* Uses single data type for all data
* Useful for efficient/matrix manipulation
* Python Pytorch dataset
* Format specifically geared for loading data for Pytorch deep-learning
Recommendations
* On disk
* For small, config level tables use .tsv
* For large data format use .parquet
* Smaller than .csv/.tsv
* Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
* In memory
* Use `dplyr::tibble` / pandas DataFrame for data science tables
* Use numpy array / pytorch dataset for machine learning