maom commited on
Commit
ccd3a9f
·
verified ·
1 Parent(s): f6c0344

Create 08_how_to_structure_data

Browse files
Files changed (1) hide show
  1. sections/08_how_to_structure_data +93 -0
sections/08_how_to_structure_data ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ### **Format of a dataset**
3
+
4
+ A dataset should consist of a single table where each row is a single observation
5
+ The columns should follow typical database design guidelines
6
+
7
+ * Identifier columns
8
+ * sequential key
9
+ * For example: \[1, 2, 3, …\]
10
+ * primary key
11
+ * single column that uniquely identify each row
12
+ * distinct for every row
13
+ * no non-missing values
14
+ * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
15
+ * composite key
16
+ * A set of columns that uniquely identify each row
17
+ * Either hierarchical or complementary ids that characterize the observation
18
+ * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
19
+ * additional/foreign key identifiers
20
+ * identifiers to link the observation with other data
21
+ * For example
22
+ * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
23
+ * FDA drug name or the IUPAC substance name
24
+ * Tidy key/value columns
25
+ * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
26
+ * tidy data sometimes called (long) has one measurement per row
27
+ * Multiple columns can be used to give details for each measurement including type, units, metadata
28
+ * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
29
+ * Can handle variable number of measurements per object
30
+ * Duplicates object identifier columns for each measurement
31
+ * array data sometimes called (wide) has one object per row and multiple measurements as different columns
32
+ * Typically each measurement is typically a single column
33
+ * More compact, i.e. no duplication of identifier columns
34
+ * Good for certain ML/matrix based computational workflows
35
+
36
+ #### Molecular formats
37
+
38
+ * Store molecular structure in standard text formats
39
+ * protein structure: PDB, mmCIF, modelCIF
40
+ * small molecule: SMILES, InChi
41
+ * use uncompressed, plaintext format
42
+ * Easier to computationally analyze
43
+ * the whole dataset will be compressed for data serialization
44
+ * Filtering / Standardization / sanitization
45
+ * Be clear about process methods used to process the molecular data
46
+ * Be especially careful for inferred / aspects of the data
47
+ * protonation states,
48
+ * salt form, stereochemistry for small molecules
49
+ * data missingness including unstructured loops for proteins
50
+ * Tools
51
+ * MolVS is useful for small molecule sanitization
52
+
53
+ #### Computational data formats
54
+
55
+ * On disk formats
56
+ * parquet format disk format
57
+ * column oriented (so can load only data that is needed, easier to compress)
58
+ * robust reader/write codes from apache arrow for Python, R etc.
59
+ * ArrowTable
60
+ * In memory format closely aligned with the on disk parquet format
61
+ * Native format for datasets stored in datasets python package
62
+ * tab/comma separated table
63
+ * Prefer tab separated, more consistent parsing without needing escaping values
64
+ * Widely used row-oriented text format for storing tabular data to disk
65
+ * Does not store data format and often needs custom format conversion code/QC for loading into python/R
66
+ * Can be compressed on disk but row-oriented, so less compressible than .parquet
67
+ * .pickle / .Rdata
68
+ * language specific serialization of complex data structures
69
+ * Often very fast to read/write, but may not be robust for across language/OS versions
70
+ * Not easily interoperable across programming languages
71
+ * In memory formats
72
+ * R data.frame/dplyr::tibble
73
+ * Widely used format for R data science
74
+ * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
75
+ * Python pandas DataFrame
76
+ * Widely used for python data science
77
+ * Out of the box not super fast for data science
78
+ * Python numpy array / R Matrix
79
+ * Uses single data type for all data
80
+ * Useful for efficient/matrix manipulation
81
+ * Python Pytorch dataset
82
+ * Format specifically geared for loading data for Pytorch deep-learning
83
+
84
+ Recommendations
85
+
86
+ * On disk
87
+ * For small, config level tables use .tsv
88
+ * For large data format use .parquet
89
+ * Smaller than .csv/.tsv
90
+ * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
91
+ * In memory
92
+ * Use dplyr::tibble / pandas DataFrame for data science tables
93
+ * Use numpy array / pytorch dataset for machine learning