maom commited on
Commit
120e0ad
·
verified ·
1 Parent(s): e7ea25a

Rename sections/07_how_to_structure_curation.md to sections/07_practical_recommendations.md

Browse files
sections/{07_how_to_structure_curation.md → 07_practical_recommendations.md} RENAMED
@@ -1,3 +1,5 @@
 
 
1
  ### **Structure of data in a HuggingFace datasets**
2
 
3
  #### Datasets, sub-datasets, splits
@@ -197,4 +199,98 @@ to load these datasets from HuggingFace
197
 
198
  `dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')`
199
  `dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')`
200
- `dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Practical Recommendations
2
+
3
  ### **Structure of data in a HuggingFace datasets**
4
 
5
  #### Datasets, sub-datasets, splits
 
199
 
200
  `dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')`
201
  `dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')`
202
+ `dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')`
203
+
204
+
205
+ ### **Format of a dataset**
206
+
207
+ A dataset should consist of a single table where each row is a single observation
208
+ The columns should follow typical database design guidelines
209
+
210
+ * Identifier columns
211
+ * sequential key
212
+ * For example: \[1, 2, 3, …\]
213
+ * primary key
214
+ * single column that uniquely identify each row
215
+ * distinct for every row
216
+ * no non-missing values
217
+ * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
218
+ * composite key
219
+ * A set of columns that uniquely identify each row
220
+ * Either hierarchical or complementary ids that characterize the observation
221
+ * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
222
+ * additional/foreign key identifiers
223
+ * identifiers to link the observation with other data
224
+ * For example
225
+ * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
226
+ * FDA drug name or the IUPAC substance name
227
+ * Tidy key/value columns
228
+ * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
229
+ * tidy data sometimes called (long) has one measurement per row
230
+ * Multiple columns can be used to give details for each measurement including type, units, metadata
231
+ * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
232
+ * Can handle variable number of measurements per object
233
+ * Duplicates object identifier columns for each measurement
234
+ * array data sometimes called (wide) has one object per row and multiple measurements as different columns
235
+ * Typically each measurement is typically a single column
236
+ * More compact, i.e. no duplication of identifier columns
237
+ * Good for certain ML/matrix based computational workflows
238
+
239
+ #### Molecular formats
240
+
241
+ * Store molecular structure in standard text formats
242
+ * protein structure: PDB, mmCIF, modelCIF
243
+ * small molecule: SMILES, InChi
244
+ * use uncompressed, plaintext format
245
+ * Easier to computationally analyze
246
+ * the whole dataset will be compressed for data serialization
247
+ * Filtering / Standardization / sanitization
248
+ * Be clear about process methods used to process the molecular data
249
+ * Be especially careful for inferred / aspects of the data
250
+ * protonation states,
251
+ * salt form, stereochemistry for small molecules
252
+ * data missingness including unstructured loops for proteins
253
+ * Tools
254
+ * MolVS is useful for small molecule sanitization
255
+
256
+ #### Computational data formats
257
+
258
+ * On disk formats
259
+ * parquet format disk format
260
+ * column oriented (so can load only data that is needed, easier to compress)
261
+ * robust reader/write codes from apache arrow for Python, R etc.
262
+ * ArrowTable
263
+ * In memory format closely aligned with the on disk parquet format
264
+ * Native format for datasets stored in datasets python package
265
+ * tab/comma separated table
266
+ * Prefer tab separated, more consistent parsing without needing escaping values
267
+ * Widely used row-oriented text format for storing tabular data to disk
268
+ * Does not store data format and often needs custom format conversion code/QC for loading into python/R
269
+ * Can be compressed on disk but row-oriented, so less compressible than .parquet
270
+ * .pickle / .Rdata
271
+ * language specific serialization of complex data structures
272
+ * Often very fast to read/write, but may not be robust for across language/OS versions
273
+ * Not easily interoperable across programming languages
274
+ * In memory formats
275
+ * R data.frame/dplyr::tibble
276
+ * Widely used format for R data science
277
+ * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
278
+ * Python pandas DataFrame
279
+ * Widely used for python data science
280
+ * Out of the box not super fast for data science
281
+ * Python numpy array / R Matrix
282
+ * Uses single data type for all data
283
+ * Useful for efficient/matrix manipulation
284
+ * Python Pytorch dataset
285
+ * Format specifically geared for loading data for Pytorch deep-learning
286
+
287
+ Recommendations
288
+
289
+ * On disk
290
+ * For small, config level tables use .tsv
291
+ * For large data format use .parquet
292
+ * Smaller than .csv/.tsv
293
+ * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
294
+ * In memory
295
+ * Use dplyr::tibble / pandas DataFrame for data science tables
296
+ * Use numpy array / pytorch dataset for machine learning