File size: 4,609 Bytes
ccd3a9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

### **Format of a dataset**

A dataset should consist of a single table where each row is a single observation  
The columns should follow typical database design guidelines 

* Identifier columns  
  * sequential key  
    * For example: \[1, 2, 3, …\]  
  * primary key  
    * single column that uniquely identify each row  
      * distinct for every row  
      * no non-missing values  
    * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key  
  * composite key  
    * A set of columns that uniquely identify each row  
      * Either hierarchical or complementary ids that characterize the observation  
      * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier  
  * additional/foreign key identifiers  
    * identifiers to link the observation with other data  
    * For example  
      * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key  
      * FDA drug name or the IUPAC substance name  
* Tidy key/value columns  
  * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)  
    * tidy data sometimes called (long) has one measurement per row  
      * Multiple columns can be used to give details for each measurement including type, units, metadata  
      * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)  
      * Can handle variable number of measurements per object  
      * Duplicates object identifier columns for each measurement  
    * array data sometimes called (wide) has one object per row and multiple measurements as different columns  
      * Typically each measurement is typically a single column  
      * More compact, i.e. no duplication of identifier columns  
      * Good for certain ML/matrix based computational workflows

#### Molecular formats

* Store molecular structure in standard text formats   
  * protein structure: PDB, mmCIF, modelCIF  
  * small molecule: SMILES, InChi  
  * use uncompressed, plaintext format  
    * Easier to computationally analyze  
    * the whole dataset will be compressed for data serialization  
* Filtering / Standardization / sanitization  
  * Be clear about process methods used to process the molecular data  
  * Be especially careful for inferred / aspects of the data  
    * protonation states,  
    * salt form, stereochemistry for small molecules  
    * data missingness including unstructured loops for proteins  
  * Tools  
    * MolVS is useful for small molecule sanitization

#### Computational data formats

* On disk formats  
  * parquet format disk format  
    * column oriented (so can load only data that is needed, easier to compress)  
    * robust reader/write codes from apache arrow for Python, R etc.  
  * ArrowTable  
    * In memory format closely aligned with the on disk parquet format   
    * Native format for datasets stored in datasets python package  
  * tab/comma separated table  
    * Prefer tab separated, more consistent parsing without needing escaping values  
    * Widely used row-oriented text format for storing tabular data to disk  
    * Does not store data format and often needs custom format conversion code/QC for loading into python/R  
    * Can be compressed on disk but row-oriented, so less compressible than .parquet  
  * .pickle / .Rdata  
    * language specific serialization of complex data structures  
    * Often very fast to read/write, but may not be robust for across language/OS versions  
    * Not easily interoperable across programming languages  
* In memory formats  
  * R data.frame/dplyr::tibble  
    * Widely used format for R data science  
    * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows  
  * Python pandas DataFrame  
    * Widely used for python data science  
    * Out of the box not super fast for data science  
  * Python numpy array / R Matrix  
    * Uses single data type for all data  
    * Useful for efficient/matrix manipulation  
  * Python Pytorch dataset  
    * Format specifically geared for loading data for Pytorch deep-learning

Recommendations

* On disk  
  * For small, config level tables use .tsv  
  * For large data format use .parquet  
    * Smaller than .csv/.tsv  
    * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv  
* In memory  
  * Use dplyr::tibble / pandas DataFrame for data science tables  
  * Use numpy array / pytorch dataset for machine learning