Create 08_how_to_structure_data
Browse files
sections/08_how_to_structure_data
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
### **Format of a dataset**
|
| 3 |
+
|
| 4 |
+
A dataset should consist of a single table where each row is a single observation
|
| 5 |
+
The columns should follow typical database design guidelines
|
| 6 |
+
|
| 7 |
+
* Identifier columns
|
| 8 |
+
* sequential key
|
| 9 |
+
* For example: \[1, 2, 3, …\]
|
| 10 |
+
* primary key
|
| 11 |
+
* single column that uniquely identify each row
|
| 12 |
+
* distinct for every row
|
| 13 |
+
* no non-missing values
|
| 14 |
+
* For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
|
| 15 |
+
* composite key
|
| 16 |
+
* A set of columns that uniquely identify each row
|
| 17 |
+
* Either hierarchical or complementary ids that characterize the observation
|
| 18 |
+
* For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
|
| 19 |
+
* additional/foreign key identifiers
|
| 20 |
+
* identifiers to link the observation with other data
|
| 21 |
+
* For example
|
| 22 |
+
* for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
|
| 23 |
+
* FDA drug name or the IUPAC substance name
|
| 24 |
+
* Tidy key/value columns
|
| 25 |
+
* [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
|
| 26 |
+
* tidy data sometimes called (long) has one measurement per row
|
| 27 |
+
* Multiple columns can be used to give details for each measurement including type, units, metadata
|
| 28 |
+
* Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
|
| 29 |
+
* Can handle variable number of measurements per object
|
| 30 |
+
* Duplicates object identifier columns for each measurement
|
| 31 |
+
* array data sometimes called (wide) has one object per row and multiple measurements as different columns
|
| 32 |
+
* Typically each measurement is typically a single column
|
| 33 |
+
* More compact, i.e. no duplication of identifier columns
|
| 34 |
+
* Good for certain ML/matrix based computational workflows
|
| 35 |
+
|
| 36 |
+
#### Molecular formats
|
| 37 |
+
|
| 38 |
+
* Store molecular structure in standard text formats
|
| 39 |
+
* protein structure: PDB, mmCIF, modelCIF
|
| 40 |
+
* small molecule: SMILES, InChi
|
| 41 |
+
* use uncompressed, plaintext format
|
| 42 |
+
* Easier to computationally analyze
|
| 43 |
+
* the whole dataset will be compressed for data serialization
|
| 44 |
+
* Filtering / Standardization / sanitization
|
| 45 |
+
* Be clear about process methods used to process the molecular data
|
| 46 |
+
* Be especially careful for inferred / aspects of the data
|
| 47 |
+
* protonation states,
|
| 48 |
+
* salt form, stereochemistry for small molecules
|
| 49 |
+
* data missingness including unstructured loops for proteins
|
| 50 |
+
* Tools
|
| 51 |
+
* MolVS is useful for small molecule sanitization
|
| 52 |
+
|
| 53 |
+
#### Computational data formats
|
| 54 |
+
|
| 55 |
+
* On disk formats
|
| 56 |
+
* parquet format disk format
|
| 57 |
+
* column oriented (so can load only data that is needed, easier to compress)
|
| 58 |
+
* robust reader/write codes from apache arrow for Python, R etc.
|
| 59 |
+
* ArrowTable
|
| 60 |
+
* In memory format closely aligned with the on disk parquet format
|
| 61 |
+
* Native format for datasets stored in datasets python package
|
| 62 |
+
* tab/comma separated table
|
| 63 |
+
* Prefer tab separated, more consistent parsing without needing escaping values
|
| 64 |
+
* Widely used row-oriented text format for storing tabular data to disk
|
| 65 |
+
* Does not store data format and often needs custom format conversion code/QC for loading into python/R
|
| 66 |
+
* Can be compressed on disk but row-oriented, so less compressible than .parquet
|
| 67 |
+
* .pickle / .Rdata
|
| 68 |
+
* language specific serialization of complex data structures
|
| 69 |
+
* Often very fast to read/write, but may not be robust for across language/OS versions
|
| 70 |
+
* Not easily interoperable across programming languages
|
| 71 |
+
* In memory formats
|
| 72 |
+
* R data.frame/dplyr::tibble
|
| 73 |
+
* Widely used format for R data science
|
| 74 |
+
* Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
|
| 75 |
+
* Python pandas DataFrame
|
| 76 |
+
* Widely used for python data science
|
| 77 |
+
* Out of the box not super fast for data science
|
| 78 |
+
* Python numpy array / R Matrix
|
| 79 |
+
* Uses single data type for all data
|
| 80 |
+
* Useful for efficient/matrix manipulation
|
| 81 |
+
* Python Pytorch dataset
|
| 82 |
+
* Format specifically geared for loading data for Pytorch deep-learning
|
| 83 |
+
|
| 84 |
+
Recommendations
|
| 85 |
+
|
| 86 |
+
* On disk
|
| 87 |
+
* For small, config level tables use .tsv
|
| 88 |
+
* For large data format use .parquet
|
| 89 |
+
* Smaller than .csv/.tsv
|
| 90 |
+
* Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
|
| 91 |
+
* In memory
|
| 92 |
+
* Use dplyr::tibble / pandas DataFrame for data science tables
|
| 93 |
+
* Use numpy array / pytorch dataset for machine learning
|