## Practical Recommendations ### **Structure of data in a HuggingFace datasets** #### Datasets, sub-datasets, splits * A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels. * Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'. * The data in different splits of a single sub-dataset should non-overlapping * Example: * The [MegaScale](https://huggingface.co/datasets/RosettaCommons/MegaScale) contains 6 datasets * dataset1 \# all stability measurements * dataset2 \# high-quality folding stabilities * dataset3 \# ΔG measurements * dataset3\_single \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits * To load a specific subdataset: ``` datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1") ``` #### Example: One .csv file dataset One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset` First load the dataset locally then push it to the hub: import datasets dataset = datasets.load_dataset( "csv", data_files ="outcomes.csv", keep_in_memory = True) dataset.push_to_hub(repo_id = "`maomlab/example_dataset`") This will create the following files in the repo data/ train-00000-of-00001.parquet and add the following to the header of README.md dataset_info: features: - name: id dtype: int64 - name: value dtype: int64 splits: - name: train num_bytes: 64 num_examples: 4 download_size: 1332 dataset_size: 64 configs: - config_name: default data_files: - split: train path: data/train-* to load these data from HuggingFace dataset = datasets.load_dataset("maomlab/example_dataset") #### Example: train/valid/test split .csv files Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset` load the three splits into one dataset and push it to the hub: import datasets dataset = datasets.load_dataset( 'csv', data_dir = "/tmp", data_files = { 'train': 'train.csv', 'valid': 'valid.csv', 'test': 'test.csv'}, keep_in_memory = True) dataset.push_to_hub(repo_id = "maomlab/example_dataset") This will create the following files in the repo data/ train-00000-of-00001.parquet valid-00000-of-00001.parquet test-00000-of-00001.parquet and add the following to the header of the README.md dataset_info: features: - name: id dtype: int64 - name: value dtype: int64 splits: - name: train num_bytes: 64 num_examples: 4 - name: valid num_bytes: 64 num_examples: 4 - name: test num_bytes: 64 num_examples: 4 download_size: 3996 dataset_size: 192 configs: - config_name: default data_files: - split: train path: data/train-* - split: valid path: data/valid-* - split: test path: data/test-* to load these data from HuggingFace dataset = datasets.load_dataset("maomlab/example_dataset") #### Example: sub-datasets If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name. import datasets dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True) dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True) dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True) dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data') dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data') dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data') This will create the following files in the repo dataset1/ data/ train-00000-of-00001.parquet dataset2/ data/ train-00000-of-00001.parquet dataset3/ data/ train-00000-of-00001.parquet and add the following to the header of the README.md dataset_info: - config_name: dataset1 features: - name: id dtype: int64 - name: value1 dtype: int64 splits: - name: train num_bytes: 64 num_examples: 4 download_size: 1344 dataset_size: 64 - config_name: dataset2 features: - name: id dtype: int64 - name: value2 dtype: int64 splits: - name: train num_bytes: 64 num_examples: 4 download_size: 1344 dataset_size: 64 - config_name: dataset3 features: - name: id dtype: int64 - name: value3 dtype: int64 splits: - name: train num_bytes: 64 num_examples: 4 download_size: 1344 dataset_size: 64 configs: - config_name: dataset1 data_files: - split: train path: dataset1/data/train-* - config_name: dataset2 data_files: - split: train path: dataset2/data/train-* - config_name: dataset3 data_files: - split: train path: dataset3/data/train-* to load these datasets from HuggingFace dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1') dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2') dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3') ### **Format of a dataset** A dataset should consist of a single table where each row is a single observation The columns should follow typical database design guidelines * Identifier columns * sequential key * For example: `[1, 2, 3, ...]` * primary key * single column that uniquely identify each row * distinct for every row * no non-missing values * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key * composite key * A set of columns that uniquely identify each row * Either hierarchical or complementary ids that characterize the observation * For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier * additional/foreign key identifiers * identifiers to link the observation with other data * For example * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key * FDA drug name or the IUPAC substance name * Tidy key/value columns * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf) * tidy data sometimes called (long) has one measurement per row * Multiple columns can be used to give details for each measurement including type, units, metadata * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr) * Can handle variable number of measurements per object * Duplicates object identifier columns for each measurement * array data sometimes called (wide) has one object per row and multiple measurements as different columns * Typically each measurement is typically a single column * More compact, i.e. no duplication of identifier columns * Good for certain ML/matrix based computational workflows #### Molecular formats * Store molecular structure in standard text formats * protein structure: PDB, mmCIF, modelCIF * small molecule: SMILES, InChi * use uncompressed, plaintext format * Easier to computationally analyze * the whole dataset will be compressed for data serialization * Filtering / Standardization / sanitization * Be clear about process methods used to process the molecular data * Be especially careful for inferred / aspects of the data * protonation states, * salt form, stereochemistry for small molecules * data missingness including unstructured loops for proteins * Tools * MolVS is useful for small molecule sanitization #### Computational data formats * On disk formats * parquet format disk format * column oriented (so can load only data that is needed, easier to compress) * robust reader/write codes from apache arrow for Python, R etc. * ArrowTable * In memory format closely aligned with the on disk parquet format * Native format for datasets stored in datasets python package * tab/comma separated table * Prefer tab separated, more consistent parsing without needing escaping values * Widely used row-oriented text format for storing tabular data to disk * Does not store data format and often needs custom format conversion code/QC for loading into python/R * Can be compressed on disk but row-oriented, so less compressible than .parquet * .pickle / .Rdata * language specific serialization of complex data structures * Often very fast to read/write, but may not be robust for across language/OS versions * Not easily interoperable across programming languages * In memory formats * R `data.frame`/`dplyr::tibble` * Widely used format for R data science * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows * Python pandas DataFrame * Widely used for python data science * Out of the box not super fast for data science * Python numpy array / R Matrix * Uses single data type for all data * Useful for efficient/matrix manipulation * Python Pytorch dataset * Format specifically geared for loading data for Pytorch deep-learning Recommendations * On disk * For small, config level tables use .tsv * For large data format use .parquet * Smaller than .csv/.tsv * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv * In memory * Use `dplyr::tibble` / pandas DataFrame for data science tables * Use numpy array / pytorch dataset for machine learning