Rename sections/07_how_to_structure_curation.md to sections/07_practical_recommendations.md
Browse files
sections/{07_how_to_structure_curation.md → 07_practical_recommendations.md}
RENAMED
|
@@ -1,3 +1,5 @@
|
|
|
|
|
|
|
|
| 1 |
### **Structure of data in a HuggingFace datasets**
|
| 2 |
|
| 3 |
#### Datasets, sub-datasets, splits
|
|
@@ -197,4 +199,98 @@ to load these datasets from HuggingFace
|
|
| 197 |
|
| 198 |
`dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')`
|
| 199 |
`dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')`
|
| 200 |
-
`dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Practical Recommendations
|
| 2 |
+
|
| 3 |
### **Structure of data in a HuggingFace datasets**
|
| 4 |
|
| 5 |
#### Datasets, sub-datasets, splits
|
|
|
|
| 199 |
|
| 200 |
`dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')`
|
| 201 |
`dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')`
|
| 202 |
+
`dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')`
|
| 203 |
+
|
| 204 |
+
|
| 205 |
+
### **Format of a dataset**
|
| 206 |
+
|
| 207 |
+
A dataset should consist of a single table where each row is a single observation
|
| 208 |
+
The columns should follow typical database design guidelines
|
| 209 |
+
|
| 210 |
+
* Identifier columns
|
| 211 |
+
* sequential key
|
| 212 |
+
* For example: \[1, 2, 3, …\]
|
| 213 |
+
* primary key
|
| 214 |
+
* single column that uniquely identify each row
|
| 215 |
+
* distinct for every row
|
| 216 |
+
* no non-missing values
|
| 217 |
+
* For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
|
| 218 |
+
* composite key
|
| 219 |
+
* A set of columns that uniquely identify each row
|
| 220 |
+
* Either hierarchical or complementary ids that characterize the observation
|
| 221 |
+
* For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
|
| 222 |
+
* additional/foreign key identifiers
|
| 223 |
+
* identifiers to link the observation with other data
|
| 224 |
+
* For example
|
| 225 |
+
* for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
|
| 226 |
+
* FDA drug name or the IUPAC substance name
|
| 227 |
+
* Tidy key/value columns
|
| 228 |
+
* [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
|
| 229 |
+
* tidy data sometimes called (long) has one measurement per row
|
| 230 |
+
* Multiple columns can be used to give details for each measurement including type, units, metadata
|
| 231 |
+
* Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
|
| 232 |
+
* Can handle variable number of measurements per object
|
| 233 |
+
* Duplicates object identifier columns for each measurement
|
| 234 |
+
* array data sometimes called (wide) has one object per row and multiple measurements as different columns
|
| 235 |
+
* Typically each measurement is typically a single column
|
| 236 |
+
* More compact, i.e. no duplication of identifier columns
|
| 237 |
+
* Good for certain ML/matrix based computational workflows
|
| 238 |
+
|
| 239 |
+
#### Molecular formats
|
| 240 |
+
|
| 241 |
+
* Store molecular structure in standard text formats
|
| 242 |
+
* protein structure: PDB, mmCIF, modelCIF
|
| 243 |
+
* small molecule: SMILES, InChi
|
| 244 |
+
* use uncompressed, plaintext format
|
| 245 |
+
* Easier to computationally analyze
|
| 246 |
+
* the whole dataset will be compressed for data serialization
|
| 247 |
+
* Filtering / Standardization / sanitization
|
| 248 |
+
* Be clear about process methods used to process the molecular data
|
| 249 |
+
* Be especially careful for inferred / aspects of the data
|
| 250 |
+
* protonation states,
|
| 251 |
+
* salt form, stereochemistry for small molecules
|
| 252 |
+
* data missingness including unstructured loops for proteins
|
| 253 |
+
* Tools
|
| 254 |
+
* MolVS is useful for small molecule sanitization
|
| 255 |
+
|
| 256 |
+
#### Computational data formats
|
| 257 |
+
|
| 258 |
+
* On disk formats
|
| 259 |
+
* parquet format disk format
|
| 260 |
+
* column oriented (so can load only data that is needed, easier to compress)
|
| 261 |
+
* robust reader/write codes from apache arrow for Python, R etc.
|
| 262 |
+
* ArrowTable
|
| 263 |
+
* In memory format closely aligned with the on disk parquet format
|
| 264 |
+
* Native format for datasets stored in datasets python package
|
| 265 |
+
* tab/comma separated table
|
| 266 |
+
* Prefer tab separated, more consistent parsing without needing escaping values
|
| 267 |
+
* Widely used row-oriented text format for storing tabular data to disk
|
| 268 |
+
* Does not store data format and often needs custom format conversion code/QC for loading into python/R
|
| 269 |
+
* Can be compressed on disk but row-oriented, so less compressible than .parquet
|
| 270 |
+
* .pickle / .Rdata
|
| 271 |
+
* language specific serialization of complex data structures
|
| 272 |
+
* Often very fast to read/write, but may not be robust for across language/OS versions
|
| 273 |
+
* Not easily interoperable across programming languages
|
| 274 |
+
* In memory formats
|
| 275 |
+
* R data.frame/dplyr::tibble
|
| 276 |
+
* Widely used format for R data science
|
| 277 |
+
* Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
|
| 278 |
+
* Python pandas DataFrame
|
| 279 |
+
* Widely used for python data science
|
| 280 |
+
* Out of the box not super fast for data science
|
| 281 |
+
* Python numpy array / R Matrix
|
| 282 |
+
* Uses single data type for all data
|
| 283 |
+
* Useful for efficient/matrix manipulation
|
| 284 |
+
* Python Pytorch dataset
|
| 285 |
+
* Format specifically geared for loading data for Pytorch deep-learning
|
| 286 |
+
|
| 287 |
+
Recommendations
|
| 288 |
+
|
| 289 |
+
* On disk
|
| 290 |
+
* For small, config level tables use .tsv
|
| 291 |
+
* For large data format use .parquet
|
| 292 |
+
* Smaller than .csv/.tsv
|
| 293 |
+
* Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
|
| 294 |
+
* In memory
|
| 295 |
+
* Use dplyr::tibble / pandas DataFrame for data science tables
|
| 296 |
+
* Use numpy array / pytorch dataset for machine learning
|