### **Format of a dataset** A dataset should consist of a single table where each row is a single observation The columns should follow typical database design guidelines * Identifier columns * sequential key * For example: \[1, 2, 3, …\] * primary key * single column that uniquely identify each row * distinct for every row * no non-missing values * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key * composite key * A set of columns that uniquely identify each row * Either hierarchical or complementary ids that characterize the observation * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier * additional/foreign key identifiers * identifiers to link the observation with other data * For example * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key * FDA drug name or the IUPAC substance name * Tidy key/value columns * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf) * tidy data sometimes called (long) has one measurement per row * Multiple columns can be used to give details for each measurement including type, units, metadata * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr) * Can handle variable number of measurements per object * Duplicates object identifier columns for each measurement * array data sometimes called (wide) has one object per row and multiple measurements as different columns * Typically each measurement is typically a single column * More compact, i.e. no duplication of identifier columns * Good for certain ML/matrix based computational workflows #### Molecular formats * Store molecular structure in standard text formats * protein structure: PDB, mmCIF, modelCIF * small molecule: SMILES, InChi * use uncompressed, plaintext format * Easier to computationally analyze * the whole dataset will be compressed for data serialization * Filtering / Standardization / sanitization * Be clear about process methods used to process the molecular data * Be especially careful for inferred / aspects of the data * protonation states, * salt form, stereochemistry for small molecules * data missingness including unstructured loops for proteins * Tools * MolVS is useful for small molecule sanitization #### Computational data formats * On disk formats * parquet format disk format * column oriented (so can load only data that is needed, easier to compress) * robust reader/write codes from apache arrow for Python, R etc. * ArrowTable * In memory format closely aligned with the on disk parquet format * Native format for datasets stored in datasets python package * tab/comma separated table * Prefer tab separated, more consistent parsing without needing escaping values * Widely used row-oriented text format for storing tabular data to disk * Does not store data format and often needs custom format conversion code/QC for loading into python/R * Can be compressed on disk but row-oriented, so less compressible than .parquet * .pickle / .Rdata * language specific serialization of complex data structures * Often very fast to read/write, but may not be robust for across language/OS versions * Not easily interoperable across programming languages * In memory formats * R data.frame/dplyr::tibble * Widely used format for R data science * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows * Python pandas DataFrame * Widely used for python data science * Out of the box not super fast for data science * Python numpy array / R Matrix * Uses single data type for all data * Useful for efficient/matrix manipulation * Python Pytorch dataset * Format specifically geared for loading data for Pytorch deep-learning Recommendations * On disk * For small, config level tables use .tsv * For large data format use .parquet * Smaller than .csv/.tsv * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv * In memory * Use dplyr::tibble / pandas DataFrame for data science tables * Use numpy array / pytorch dataset for machine learning