MolecularDatasetCurationGuide / sections /08_how_to_structure_data
maom's picture
Create 08_how_to_structure_data
ccd3a9f verified
raw
history blame
4.61 kB
### **Format of a dataset**
A dataset should consist of a single table where each row is a single observation
The columns should follow typical database design guidelines
* Identifier columns
* sequential key
* For example: \[1, 2, 3, …\]
* primary key
* single column that uniquely identify each row
* distinct for every row
* no non-missing values
* For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
* composite key
* A set of columns that uniquely identify each row
* Either hierarchical or complementary ids that characterize the observation
* For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
* additional/foreign key identifiers
* identifiers to link the observation with other data
* For example
* for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
* FDA drug name or the IUPAC substance name
* Tidy key/value columns
* [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
* tidy data sometimes called (long) has one measurement per row
* Multiple columns can be used to give details for each measurement including type, units, metadata
* Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
* Can handle variable number of measurements per object
* Duplicates object identifier columns for each measurement
* array data sometimes called (wide) has one object per row and multiple measurements as different columns
* Typically each measurement is typically a single column
* More compact, i.e. no duplication of identifier columns
* Good for certain ML/matrix based computational workflows
#### Molecular formats
* Store molecular structure in standard text formats
* protein structure: PDB, mmCIF, modelCIF
* small molecule: SMILES, InChi
* use uncompressed, plaintext format
* Easier to computationally analyze
* the whole dataset will be compressed for data serialization
* Filtering / Standardization / sanitization
* Be clear about process methods used to process the molecular data
* Be especially careful for inferred / aspects of the data
* protonation states,
* salt form, stereochemistry for small molecules
* data missingness including unstructured loops for proteins
* Tools
* MolVS is useful for small molecule sanitization
#### Computational data formats
* On disk formats
* parquet format disk format
* column oriented (so can load only data that is needed, easier to compress)
* robust reader/write codes from apache arrow for Python, R etc.
* ArrowTable
* In memory format closely aligned with the on disk parquet format
* Native format for datasets stored in datasets python package
* tab/comma separated table
* Prefer tab separated, more consistent parsing without needing escaping values
* Widely used row-oriented text format for storing tabular data to disk
* Does not store data format and often needs custom format conversion code/QC for loading into python/R
* Can be compressed on disk but row-oriented, so less compressible than .parquet
* .pickle / .Rdata
* language specific serialization of complex data structures
* Often very fast to read/write, but may not be robust for across language/OS versions
* Not easily interoperable across programming languages
* In memory formats
* R data.frame/dplyr::tibble
* Widely used format for R data science
* Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
* Python pandas DataFrame
* Widely used for python data science
* Out of the box not super fast for data science
* Python numpy array / R Matrix
* Uses single data type for all data
* Useful for efficient/matrix manipulation
* Python Pytorch dataset
* Format specifically geared for loading data for Pytorch deep-learning
Recommendations
* On disk
* For small, config level tables use .tsv
* For large data format use .parquet
* Smaller than .csv/.tsv
* Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
* In memory
* Use dplyr::tibble / pandas DataFrame for data science tables
* Use numpy array / pytorch dataset for machine learning