Delete sections/08_how_to_structure_data
Browse files
sections/08_how_to_structure_data
DELETED
|
@@ -1,93 +0,0 @@
|
|
| 1 |
-
|
| 2 |
-
### **Format of a dataset**
|
| 3 |
-
|
| 4 |
-
A dataset should consist of a single table where each row is a single observation
|
| 5 |
-
The columns should follow typical database design guidelines
|
| 6 |
-
|
| 7 |
-
* Identifier columns
|
| 8 |
-
* sequential key
|
| 9 |
-
* For example: \[1, 2, 3, …\]
|
| 10 |
-
* primary key
|
| 11 |
-
* single column that uniquely identify each row
|
| 12 |
-
* distinct for every row
|
| 13 |
-
* no non-missing values
|
| 14 |
-
* For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
|
| 15 |
-
* composite key
|
| 16 |
-
* A set of columns that uniquely identify each row
|
| 17 |
-
* Either hierarchical or complementary ids that characterize the observation
|
| 18 |
-
* For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
|
| 19 |
-
* additional/foreign key identifiers
|
| 20 |
-
* identifiers to link the observation with other data
|
| 21 |
-
* For example
|
| 22 |
-
* for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
|
| 23 |
-
* FDA drug name or the IUPAC substance name
|
| 24 |
-
* Tidy key/value columns
|
| 25 |
-
* [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
|
| 26 |
-
* tidy data sometimes called (long) has one measurement per row
|
| 27 |
-
* Multiple columns can be used to give details for each measurement including type, units, metadata
|
| 28 |
-
* Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
|
| 29 |
-
* Can handle variable number of measurements per object
|
| 30 |
-
* Duplicates object identifier columns for each measurement
|
| 31 |
-
* array data sometimes called (wide) has one object per row and multiple measurements as different columns
|
| 32 |
-
* Typically each measurement is typically a single column
|
| 33 |
-
* More compact, i.e. no duplication of identifier columns
|
| 34 |
-
* Good for certain ML/matrix based computational workflows
|
| 35 |
-
|
| 36 |
-
#### Molecular formats
|
| 37 |
-
|
| 38 |
-
* Store molecular structure in standard text formats
|
| 39 |
-
* protein structure: PDB, mmCIF, modelCIF
|
| 40 |
-
* small molecule: SMILES, InChi
|
| 41 |
-
* use uncompressed, plaintext format
|
| 42 |
-
* Easier to computationally analyze
|
| 43 |
-
* the whole dataset will be compressed for data serialization
|
| 44 |
-
* Filtering / Standardization / sanitization
|
| 45 |
-
* Be clear about process methods used to process the molecular data
|
| 46 |
-
* Be especially careful for inferred / aspects of the data
|
| 47 |
-
* protonation states,
|
| 48 |
-
* salt form, stereochemistry for small molecules
|
| 49 |
-
* data missingness including unstructured loops for proteins
|
| 50 |
-
* Tools
|
| 51 |
-
* MolVS is useful for small molecule sanitization
|
| 52 |
-
|
| 53 |
-
#### Computational data formats
|
| 54 |
-
|
| 55 |
-
* On disk formats
|
| 56 |
-
* parquet format disk format
|
| 57 |
-
* column oriented (so can load only data that is needed, easier to compress)
|
| 58 |
-
* robust reader/write codes from apache arrow for Python, R etc.
|
| 59 |
-
* ArrowTable
|
| 60 |
-
* In memory format closely aligned with the on disk parquet format
|
| 61 |
-
* Native format for datasets stored in datasets python package
|
| 62 |
-
* tab/comma separated table
|
| 63 |
-
* Prefer tab separated, more consistent parsing without needing escaping values
|
| 64 |
-
* Widely used row-oriented text format for storing tabular data to disk
|
| 65 |
-
* Does not store data format and often needs custom format conversion code/QC for loading into python/R
|
| 66 |
-
* Can be compressed on disk but row-oriented, so less compressible than .parquet
|
| 67 |
-
* .pickle / .Rdata
|
| 68 |
-
* language specific serialization of complex data structures
|
| 69 |
-
* Often very fast to read/write, but may not be robust for across language/OS versions
|
| 70 |
-
* Not easily interoperable across programming languages
|
| 71 |
-
* In memory formats
|
| 72 |
-
* R data.frame/dplyr::tibble
|
| 73 |
-
* Widely used format for R data science
|
| 74 |
-
* Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
|
| 75 |
-
* Python pandas DataFrame
|
| 76 |
-
* Widely used for python data science
|
| 77 |
-
* Out of the box not super fast for data science
|
| 78 |
-
* Python numpy array / R Matrix
|
| 79 |
-
* Uses single data type for all data
|
| 80 |
-
* Useful for efficient/matrix manipulation
|
| 81 |
-
* Python Pytorch dataset
|
| 82 |
-
* Format specifically geared for loading data for Pytorch deep-learning
|
| 83 |
-
|
| 84 |
-
Recommendations
|
| 85 |
-
|
| 86 |
-
* On disk
|
| 87 |
-
* For small, config level tables use .tsv
|
| 88 |
-
* For large data format use .parquet
|
| 89 |
-
* Smaller than .csv/.tsv
|
| 90 |
-
* Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
|
| 91 |
-
* In memory
|
| 92 |
-
* Use dplyr::tibble / pandas DataFrame for data science tables
|
| 93 |
-
* Use numpy array / pytorch dataset for machine learning
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|