Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Running

App Files Files Community

MolecularDatasetCurationGuide / sections /08_how_to_structure_data

maom

Create 08_how_to_structure_data

ccd3a9f verified 2 months ago

raw

history blame

4.61 kB


	### Format of a dataset

	A dataset should consist of a single table where each row is a single observation
	The columns should follow typical database design guidelines

	* Identifier columns
	* sequential key
	* For example: \[1, 2, 3, …\]
	* primary key
	* single column that uniquely identify each row
	* distinct for every row
	* no non-missing values
	* For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
	* composite key
	* A set of columns that uniquely identify each row
	* Either hierarchical or complementary ids that characterize the observation
	* For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
	* additional/foreign key identifiers
	* identifiers to link the observation with other data
	* For example
	* for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
	* FDA drug name or the IUPAC substance name
	* Tidy key/value columns
	* [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
	* tidy data sometimes called (long) has one measurement per row
	* Multiple columns can be used to give details for each measurement including type, units, metadata
	* Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
	* Can handle variable number of measurements per object
	* Duplicates object identifier columns for each measurement
	* array data sometimes called (wide) has one object per row and multiple measurements as different columns
	* Typically each measurement is typically a single column
	* More compact, i.e. no duplication of identifier columns
	* Good for certain ML/matrix based computational workflows

	#### Molecular formats

	* Store molecular structure in standard text formats
	* protein structure: PDB, mmCIF, modelCIF
	* small molecule: SMILES, InChi
	* use uncompressed, plaintext format
	* Easier to computationally analyze
	* the whole dataset will be compressed for data serialization
	* Filtering / Standardization / sanitization
	* Be clear about process methods used to process the molecular data
	* Be especially careful for inferred / aspects of the data
	* protonation states,
	* salt form, stereochemistry for small molecules
	* data missingness including unstructured loops for proteins
	* Tools
	* MolVS is useful for small molecule sanitization

	#### Computational data formats

	* On disk formats
	* parquet format disk format
	* column oriented (so can load only data that is needed, easier to compress)
	* robust reader/write codes from apache arrow for Python, R etc.
	* ArrowTable
	* In memory format closely aligned with the on disk parquet format
	* Native format for datasets stored in datasets python package
	* tab/comma separated table
	* Prefer tab separated, more consistent parsing without needing escaping values
	* Widely used row-oriented text format for storing tabular data to disk
	* Does not store data format and often needs custom format conversion code/QC for loading into python/R
	* Can be compressed on disk but row-oriented, so less compressible than .parquet
	* .pickle / .Rdata
	* language specific serialization of complex data structures
	* Often very fast to read/write, but may not be robust for across language/OS versions
	* Not easily interoperable across programming languages
	* In memory formats
	* R data.frame/dplyr::tibble
	* Widely used format for R data science
	* Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
	* Python pandas DataFrame
	* Widely used for python data science
	* Out of the box not super fast for data science
	* Python numpy array / R Matrix
	* Uses single data type for all data
	* Useful for efficient/matrix manipulation
	* Python Pytorch dataset
	* Format specifically geared for loading data for Pytorch deep-learning

	Recommendations

	* On disk
	* For small, config level tables use .tsv
	* For large data format use .parquet
	* Smaller than .csv/.tsv
	* Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
	* In memory
	* Use dplyr::tibble / pandas DataFrame for data science tables
	* Use numpy array / pytorch dataset for machine learning