| ### **Format of a dataset** | |
| A dataset should consist of a single table where each row is a single observation | |
| The columns should follow typical database design guidelines | |
| * Identifier columns | |
| * sequential key | |
| * For example: \[1, 2, 3, …\] | |
| * primary key | |
| * single column that uniquely identify each row | |
| * distinct for every row | |
| * no non-missing values | |
| * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key | |
| * composite key | |
| * A set of columns that uniquely identify each row | |
| * Either hierarchical or complementary ids that characterize the observation | |
| * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier | |
| * additional/foreign key identifiers | |
| * identifiers to link the observation with other data | |
| * For example | |
| * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key | |
| * FDA drug name or the IUPAC substance name | |
| * Tidy key/value columns | |
| * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf) | |
| * tidy data sometimes called (long) has one measurement per row | |
| * Multiple columns can be used to give details for each measurement including type, units, metadata | |
| * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr) | |
| * Can handle variable number of measurements per object | |
| * Duplicates object identifier columns for each measurement | |
| * array data sometimes called (wide) has one object per row and multiple measurements as different columns | |
| * Typically each measurement is typically a single column | |
| * More compact, i.e. no duplication of identifier columns | |
| * Good for certain ML/matrix based computational workflows | |
| #### Molecular formats | |
| * Store molecular structure in standard text formats | |
| * protein structure: PDB, mmCIF, modelCIF | |
| * small molecule: SMILES, InChi | |
| * use uncompressed, plaintext format | |
| * Easier to computationally analyze | |
| * the whole dataset will be compressed for data serialization | |
| * Filtering / Standardization / sanitization | |
| * Be clear about process methods used to process the molecular data | |
| * Be especially careful for inferred / aspects of the data | |
| * protonation states, | |
| * salt form, stereochemistry for small molecules | |
| * data missingness including unstructured loops for proteins | |
| * Tools | |
| * MolVS is useful for small molecule sanitization | |
| #### Computational data formats | |
| * On disk formats | |
| * parquet format disk format | |
| * column oriented (so can load only data that is needed, easier to compress) | |
| * robust reader/write codes from apache arrow for Python, R etc. | |
| * ArrowTable | |
| * In memory format closely aligned with the on disk parquet format | |
| * Native format for datasets stored in datasets python package | |
| * tab/comma separated table | |
| * Prefer tab separated, more consistent parsing without needing escaping values | |
| * Widely used row-oriented text format for storing tabular data to disk | |
| * Does not store data format and often needs custom format conversion code/QC for loading into python/R | |
| * Can be compressed on disk but row-oriented, so less compressible than .parquet | |
| * .pickle / .Rdata | |
| * language specific serialization of complex data structures | |
| * Often very fast to read/write, but may not be robust for across language/OS versions | |
| * Not easily interoperable across programming languages | |
| * In memory formats | |
| * R data.frame/dplyr::tibble | |
| * Widely used format for R data science | |
| * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows | |
| * Python pandas DataFrame | |
| * Widely used for python data science | |
| * Out of the box not super fast for data science | |
| * Python numpy array / R Matrix | |
| * Uses single data type for all data | |
| * Useful for efficient/matrix manipulation | |
| * Python Pytorch dataset | |
| * Format specifically geared for loading data for Pytorch deep-learning | |
| Recommendations | |
| * On disk | |
| * For small, config level tables use .tsv | |
| * For large data format use .parquet | |
| * Smaller than .csv/.tsv | |
| * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv | |
| * In memory | |
| * Use dplyr::tibble / pandas DataFrame for data science tables | |
| * Use numpy array / pytorch dataset for machine learning |