Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Sleeping

App Files Files Community

maom commited on Feb 3

Commit

120e0ad

verified ·

1 Parent(s): e7ea25a

Rename sections/07_how_to_structure_curation.md to sections/07_practical_recommendations.md

Browse files

Files changed (1) hide show

sections/{07_how_to_structure_curation.md → 07_practical_recommendations.md} +97 -1

sections/{07_how_to_structure_curation.md → 07_practical_recommendations.md} RENAMED Viewed

@@ -1,3 +1,5 @@
 ### **Structure of data in a HuggingFace datasets**
 #### Datasets, sub-datasets, splits
@@ -197,4 +199,98 @@ to load these datasets from HuggingFace
 `dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')`
 	`dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')`
-	`dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')`

+## Practical Recommendations
 ### **Structure of data in a HuggingFace datasets**
 #### Datasets, sub-datasets, splits
 `dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')`
 	`dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')`
+	`dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')`
+### **Format of a dataset**
+A dataset should consist of a single table where each row is a single observation
+The columns should follow typical database design guidelines
+* Identifier columns
+  * sequential key
+    * For example: \[1, 2, 3, …\]
+  * primary key
+    * single column that uniquely identify each row
+      * distinct for every row
+      * no non-missing values
+    * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
+  * composite key
+    * A set of columns that uniquely identify each row
+      * Either hierarchical or complementary ids that characterize the observation
+      * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
+  * additional/foreign key identifiers
+    * identifiers to link the observation with other data
+    * For example
+      * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
+      * FDA drug name or the IUPAC substance name
+* Tidy key/value columns
+  * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
+    * tidy data sometimes called (long) has one measurement per row
+      * Multiple columns can be used to give details for each measurement including type, units, metadata
+      * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
+      * Can handle variable number of measurements per object
+      * Duplicates object identifier columns for each measurement
+    * array data sometimes called (wide) has one object per row and multiple measurements as different columns
+      * Typically each measurement is typically a single column
+      * More compact, i.e. no duplication of identifier columns
+      * Good for certain ML/matrix based computational workflows
+#### Molecular formats
+* Store molecular structure in standard text formats
+  * protein structure: PDB, mmCIF, modelCIF
+  * small molecule: SMILES, InChi
+  * use uncompressed, plaintext format
+    * Easier to computationally analyze
+    * the whole dataset will be compressed for data serialization
+* Filtering / Standardization / sanitization
+  * Be clear about process methods used to process the molecular data
+  * Be especially careful for inferred / aspects of the data
+    * protonation states,
+    * salt form, stereochemistry for small molecules
+    * data missingness including unstructured loops for proteins
+  * Tools
+    * MolVS is useful for small molecule sanitization
+#### Computational data formats
+* On disk formats
+  * parquet format disk format
+    * column oriented (so can load only data that is needed, easier to compress)
+    * robust reader/write codes from apache arrow for Python, R etc.
+  * ArrowTable
+    * In memory format closely aligned with the on disk parquet format
+    * Native format for datasets stored in datasets python package
+  * tab/comma separated table
+    * Prefer tab separated, more consistent parsing without needing escaping values
+    * Widely used row-oriented text format for storing tabular data to disk
+    * Does not store data format and often needs custom format conversion code/QC for loading into python/R
+    * Can be compressed on disk but row-oriented, so less compressible than .parquet
+  * .pickle / .Rdata
+    * language specific serialization of complex data structures
+    * Often very fast to read/write, but may not be robust for across language/OS versions
+    * Not easily interoperable across programming languages
+* In memory formats
+  * R data.frame/dplyr::tibble
+    * Widely used format for R data science
+    * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
+  * Python pandas DataFrame
+    * Widely used for python data science
+    * Out of the box not super fast for data science
+  * Python numpy array / R Matrix
+    * Uses single data type for all data
+    * Useful for efficient/matrix manipulation
+  * Python Pytorch dataset
+    * Format specifically geared for loading data for Pytorch deep-learning
+Recommendations
+* On disk
+  * For small, config level tables use .tsv
+  * For large data format use .parquet
+    * Smaller than .csv/.tsv
+    * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
+* In memory
+  * Use dplyr::tibble / pandas DataFrame for data science tables
+  * Use numpy array / pytorch dataset for machine learning