| ## Practical Recommendations | |
| ### **Structure of data in a HuggingFace datasets** | |
| #### Datasets, sub-datasets, splits | |
| * A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels. | |
| * Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'. | |
| * The data in different splits of a single sub-dataset should non-overlapping | |
| * Example: | |
| * The [MegaScale](https://huggingface.co/datasets/RosettaCommons/MegaScale) contains 6 datasets | |
| * dataset1 \# all stability measurements | |
| * dataset2 \# high-quality folding stabilities | |
| * dataset3 \# ΔG measurements | |
| * dataset3\_single \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits | |
| * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits | |
| * To load a specific subdataset: | |
| ``` | |
| datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1") | |
| ``` | |
| #### Example: One .csv file dataset | |
| One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset` | |
| First load the dataset locally then push it to the hub: | |
| import datasets | |
| dataset = datasets.load_dataset( | |
| "csv", | |
| data_files ="outcomes.csv", | |
| keep_in_memory = True) | |
| dataset.push_to_hub(repo_id = "`maomlab/example_dataset`") | |
| This will create the following files in the repo | |
| data/ | |
| train-00000-of-00001.parquet | |
| and add the following to the header of README.md | |
| dataset_info: | |
| features: | |
| - name: id | |
| dtype: int64 | |
| - name: value | |
| dtype: int64 | |
| splits: | |
| - name: train | |
| num_bytes: 64 | |
| num_examples: 4 | |
| download_size: 1332 | |
| dataset_size: 64 | |
| configs: | |
| - config_name: default | |
| data_files: | |
| - split: train | |
| path: data/train-* | |
| to load these data from HuggingFace | |
| dataset = datasets.load_dataset("maomlab/example_dataset") | |
| #### Example: train/valid/test split .csv files | |
| Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset` | |
| load the three splits into one dataset and push it to the hub: | |
| import datasets | |
| dataset = datasets.load_dataset( | |
| 'csv', | |
| data_dir = "/tmp", | |
| data_files = { | |
| 'train': 'train.csv', | |
| 'valid': 'valid.csv', | |
| 'test': 'test.csv'}, | |
| keep_in_memory = True) | |
| dataset.push_to_hub(repo_id = "maomlab/example_dataset") | |
| This will create the following files in the repo | |
| data/ | |
| train-00000-of-00001.parquet | |
| valid-00000-of-00001.parquet | |
| test-00000-of-00001.parquet | |
| and add the following to the header of the README.md | |
| dataset_info: | |
| features: | |
| - name: id | |
| dtype: int64 | |
| - name: value | |
| dtype: int64 | |
| splits: | |
| - name: train | |
| num_bytes: 64 | |
| num_examples: 4 | |
| - name: valid | |
| num_bytes: 64 | |
| num_examples: 4 | |
| - name: test | |
| num_bytes: 64 | |
| num_examples: 4 | |
| download_size: 3996 | |
| dataset_size: 192 | |
| configs: | |
| - config_name: default | |
| data_files: | |
| - split: train | |
| path: data/train-* | |
| - split: valid | |
| path: data/valid-* | |
| - split: test | |
| path: data/test-* | |
| to load these data from HuggingFace | |
| dataset = datasets.load_dataset("maomlab/example_dataset") | |
| #### Example: sub-datasets | |
| If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name. | |
| import datasets | |
| dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True) | |
| dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True) | |
| dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True) | |
| dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data') | |
| dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data') | |
| dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data') | |
| This will create the following files in the repo | |
| dataset1/ | |
| data/ | |
| train-00000-of-00001.parquet | |
| dataset2/ | |
| data/ | |
| train-00000-of-00001.parquet | |
| dataset3/ | |
| data/ | |
| train-00000-of-00001.parquet | |
| and add the following to the header of the README.md | |
| dataset_info: | |
| - config_name: dataset1 | |
| features: | |
| - name: id | |
| dtype: int64 | |
| - name: value1 | |
| dtype: int64 | |
| splits: | |
| - name: train | |
| num_bytes: 64 | |
| num_examples: 4 | |
| download_size: 1344 | |
| dataset_size: 64 | |
| - config_name: dataset2 | |
| features: | |
| - name: id | |
| dtype: int64 | |
| - name: value2 | |
| dtype: int64 | |
| splits: | |
| - name: train | |
| num_bytes: 64 | |
| num_examples: 4 | |
| download_size: 1344 | |
| dataset_size: 64 | |
| - config_name: dataset3 | |
| features: | |
| - name: id | |
| dtype: int64 | |
| - name: value3 | |
| dtype: int64 | |
| splits: | |
| - name: train | |
| num_bytes: 64 | |
| num_examples: 4 | |
| download_size: 1344 | |
| dataset_size: 64 | |
| configs: | |
| - config_name: dataset1 | |
| data_files: | |
| - split: train | |
| path: dataset1/data/train-* | |
| - config_name: dataset2 | |
| data_files: | |
| - split: train | |
| path: dataset2/data/train-* | |
| - config_name: dataset3 | |
| data_files: | |
| - split: train | |
| path: dataset3/data/train-* | |
| to load these datasets from HuggingFace | |
| dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1') | |
| dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2') | |
| dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3') | |
| ### **Format of a dataset** | |
| A dataset should consist of a single table where each row is a single observation | |
| The columns should follow typical database design guidelines | |
| * Identifier columns | |
| * sequential key | |
| * For example: `[1, 2, 3, ...]` | |
| * primary key | |
| * single column that uniquely identify each row | |
| * distinct for every row | |
| * no non-missing values | |
| * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key | |
| * composite key | |
| * A set of columns that uniquely identify each row | |
| * Either hierarchical or complementary ids that characterize the observation | |
| * For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier | |
| * additional/foreign key identifiers | |
| * identifiers to link the observation with other data | |
| * For example | |
| * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key | |
| * FDA drug name or the IUPAC substance name | |
| * Tidy key/value columns | |
| * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf) | |
| * tidy data sometimes called (long) has one measurement per row | |
| * Multiple columns can be used to give details for each measurement including type, units, metadata | |
| * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr) | |
| * Can handle variable number of measurements per object | |
| * Duplicates object identifier columns for each measurement | |
| * array data sometimes called (wide) has one object per row and multiple measurements as different columns | |
| * Typically each measurement is typically a single column | |
| * More compact, i.e. no duplication of identifier columns | |
| * Good for certain ML/matrix based computational workflows | |
| #### Molecular formats | |
| * Store molecular structure in standard text formats | |
| * protein structure: PDB, mmCIF, modelCIF | |
| * small molecule: SMILES, InChi | |
| * use uncompressed, plaintext format | |
| * Easier to computationally analyze | |
| * the whole dataset will be compressed for data serialization | |
| * Filtering / Standardization / sanitization | |
| * Be clear about process methods used to process the molecular data | |
| * Be especially careful for inferred / aspects of the data | |
| * protonation states, | |
| * salt form, stereochemistry for small molecules | |
| * data missingness including unstructured loops for proteins | |
| * Tools | |
| * MolVS is useful for small molecule sanitization | |
| #### Computational data formats | |
| * On disk formats | |
| * parquet format disk format | |
| * column oriented (so can load only data that is needed, easier to compress) | |
| * robust reader/write codes from apache arrow for Python, R etc. | |
| * ArrowTable | |
| * In memory format closely aligned with the on disk parquet format | |
| * Native format for datasets stored in datasets python package | |
| * tab/comma separated table | |
| * Prefer tab separated, more consistent parsing without needing escaping values | |
| * Widely used row-oriented text format for storing tabular data to disk | |
| * Does not store data format and often needs custom format conversion code/QC for loading into python/R | |
| * Can be compressed on disk but row-oriented, so less compressible than .parquet | |
| * .pickle / .Rdata | |
| * language specific serialization of complex data structures | |
| * Often very fast to read/write, but may not be robust for across language/OS versions | |
| * Not easily interoperable across programming languages | |
| * In memory formats | |
| * R `data.frame`/`dplyr::tibble` | |
| * Widely used format for R data science | |
| * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows | |
| * Python pandas DataFrame | |
| * Widely used for python data science | |
| * Out of the box not super fast for data science | |
| * Python numpy array / R Matrix | |
| * Uses single data type for all data | |
| * Useful for efficient/matrix manipulation | |
| * Python Pytorch dataset | |
| * Format specifically geared for loading data for Pytorch deep-learning | |
| Recommendations | |
| * On disk | |
| * For small, config level tables use .tsv | |
| * For large data format use .parquet | |
| * Smaller than .csv/.tsv | |
| * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv | |
| * In memory | |
| * Use `dplyr::tibble` / pandas DataFrame for data science tables | |
| * Use numpy array / pytorch dataset for machine learning |