maom commited on
Commit
6ff9d5f
·
verified ·
1 Parent(s): 120e0ad

Delete sections/08_how_to_structure_data

Browse files
Files changed (1) hide show
  1. sections/08_how_to_structure_data +0 -93
sections/08_how_to_structure_data DELETED
@@ -1,93 +0,0 @@
1
-
2
- ### **Format of a dataset**
3
-
4
- A dataset should consist of a single table where each row is a single observation
5
- The columns should follow typical database design guidelines
6
-
7
- * Identifier columns
8
- * sequential key
9
- * For example: \[1, 2, 3, …\]
10
- * primary key
11
- * single column that uniquely identify each row
12
- * distinct for every row
13
- * no non-missing values
14
- * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
15
- * composite key
16
- * A set of columns that uniquely identify each row
17
- * Either hierarchical or complementary ids that characterize the observation
18
- * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
19
- * additional/foreign key identifiers
20
- * identifiers to link the observation with other data
21
- * For example
22
- * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
23
- * FDA drug name or the IUPAC substance name
24
- * Tidy key/value columns
25
- * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
26
- * tidy data sometimes called (long) has one measurement per row
27
- * Multiple columns can be used to give details for each measurement including type, units, metadata
28
- * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
29
- * Can handle variable number of measurements per object
30
- * Duplicates object identifier columns for each measurement
31
- * array data sometimes called (wide) has one object per row and multiple measurements as different columns
32
- * Typically each measurement is typically a single column
33
- * More compact, i.e. no duplication of identifier columns
34
- * Good for certain ML/matrix based computational workflows
35
-
36
- #### Molecular formats
37
-
38
- * Store molecular structure in standard text formats
39
- * protein structure: PDB, mmCIF, modelCIF
40
- * small molecule: SMILES, InChi
41
- * use uncompressed, plaintext format
42
- * Easier to computationally analyze
43
- * the whole dataset will be compressed for data serialization
44
- * Filtering / Standardization / sanitization
45
- * Be clear about process methods used to process the molecular data
46
- * Be especially careful for inferred / aspects of the data
47
- * protonation states,
48
- * salt form, stereochemistry for small molecules
49
- * data missingness including unstructured loops for proteins
50
- * Tools
51
- * MolVS is useful for small molecule sanitization
52
-
53
- #### Computational data formats
54
-
55
- * On disk formats
56
- * parquet format disk format
57
- * column oriented (so can load only data that is needed, easier to compress)
58
- * robust reader/write codes from apache arrow for Python, R etc.
59
- * ArrowTable
60
- * In memory format closely aligned with the on disk parquet format
61
- * Native format for datasets stored in datasets python package
62
- * tab/comma separated table
63
- * Prefer tab separated, more consistent parsing without needing escaping values
64
- * Widely used row-oriented text format for storing tabular data to disk
65
- * Does not store data format and often needs custom format conversion code/QC for loading into python/R
66
- * Can be compressed on disk but row-oriented, so less compressible than .parquet
67
- * .pickle / .Rdata
68
- * language specific serialization of complex data structures
69
- * Often very fast to read/write, but may not be robust for across language/OS versions
70
- * Not easily interoperable across programming languages
71
- * In memory formats
72
- * R data.frame/dplyr::tibble
73
- * Widely used format for R data science
74
- * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
75
- * Python pandas DataFrame
76
- * Widely used for python data science
77
- * Out of the box not super fast for data science
78
- * Python numpy array / R Matrix
79
- * Uses single data type for all data
80
- * Useful for efficient/matrix manipulation
81
- * Python Pytorch dataset
82
- * Format specifically geared for loading data for Pytorch deep-learning
83
-
84
- Recommendations
85
-
86
- * On disk
87
- * For small, config level tables use .tsv
88
- * For large data format use .parquet
89
- * Smaller than .csv/.tsv
90
- * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
91
- * In memory
92
- * Use dplyr::tibble / pandas DataFrame for data science tables
93
- * Use numpy array / pytorch dataset for machine learning