Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Sleeping

App Files Files Community

maom commited on Feb 3

Commit

6ff9d5f

verified ·

1 Parent(s): 120e0ad

Delete sections/08_how_to_structure_data

Browse files

Files changed (1) hide show

sections/08_how_to_structure_data +0 -93

sections/08_how_to_structure_data DELETED Viewed

@@ -1,93 +0,0 @@
-### **Format of a dataset**
-A dataset should consist of a single table where each row is a single observation
-The columns should follow typical database design guidelines
-* Identifier columns
-  * sequential key
-    * For example: \[1, 2, 3, …\]
-  * primary key
-    * single column that uniquely identify each row
-      * distinct for every row
-      * no non-missing values
-    * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
-  * composite key
-    * A set of columns that uniquely identify each row
-      * Either hierarchical or complementary ids that characterize the observation
-      * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
-  * additional/foreign key identifiers
-    * identifiers to link the observation with other data
-    * For example
-      * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
-      * FDA drug name or the IUPAC substance name
-* Tidy key/value columns
-  * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
-    * tidy data sometimes called (long) has one measurement per row
-      * Multiple columns can be used to give details for each measurement including type, units, metadata
-      * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
-      * Can handle variable number of measurements per object
-      * Duplicates object identifier columns for each measurement
-    * array data sometimes called (wide) has one object per row and multiple measurements as different columns
-      * Typically each measurement is typically a single column
-      * More compact, i.e. no duplication of identifier columns
-      * Good for certain ML/matrix based computational workflows
-#### Molecular formats
-* Store molecular structure in standard text formats
-  * protein structure: PDB, mmCIF, modelCIF
-  * small molecule: SMILES, InChi
-  * use uncompressed, plaintext format
-    * Easier to computationally analyze
-    * the whole dataset will be compressed for data serialization
-* Filtering / Standardization / sanitization
-  * Be clear about process methods used to process the molecular data
-  * Be especially careful for inferred / aspects of the data
-    * protonation states,
-    * salt form, stereochemistry for small molecules
-    * data missingness including unstructured loops for proteins
-  * Tools
-    * MolVS is useful for small molecule sanitization
-#### Computational data formats
-* On disk formats
-  * parquet format disk format
-    * column oriented (so can load only data that is needed, easier to compress)
-    * robust reader/write codes from apache arrow for Python, R etc.
-  * ArrowTable
-    * In memory format closely aligned with the on disk parquet format
-    * Native format for datasets stored in datasets python package
-  * tab/comma separated table
-    * Prefer tab separated, more consistent parsing without needing escaping values
-    * Widely used row-oriented text format for storing tabular data to disk
-    * Does not store data format and often needs custom format conversion code/QC for loading into python/R
-    * Can be compressed on disk but row-oriented, so less compressible than .parquet
-  * .pickle / .Rdata
-    * language specific serialization of complex data structures
-    * Often very fast to read/write, but may not be robust for across language/OS versions
-    * Not easily interoperable across programming languages
-* In memory formats
-  * R data.frame/dplyr::tibble
-    * Widely used format for R data science
-    * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
-  * Python pandas DataFrame
-    * Widely used for python data science
-    * Out of the box not super fast for data science
-  * Python numpy array / R Matrix
-    * Uses single data type for all data
-    * Useful for efficient/matrix manipulation
-  * Python Pytorch dataset
-    * Format specifically geared for loading data for Pytorch deep-learning
-Recommendations
-* On disk
-  * For small, config level tables use .tsv
-  * For large data format use .parquet
-    * Smaller than .csv/.tsv
-    * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
-* In memory
-  * Use dplyr::tibble / pandas DataFrame for data science tables
-  * Use numpy array / pytorch dataset for machine learning