File size: 11,437 Bytes
120e0ad
 
9a6e5f9
 
 
 
 
 
 
 
 
 
 
 
 
 
92dbf79
 
 
 
9a6e5f9
 
 
 
 
 
92dbf79
 
 
 
 
9a6e5f9
92dbf79
9a6e5f9
 
 
92dbf79
 
9a6e5f9
 
 
92dbf79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a6e5f9
 
 
92dbf79
9a6e5f9
 
 
 
 
 
92dbf79
 
 
 
 
 
 
 
 
9a6e5f9
92dbf79
9a6e5f9
 
 
92dbf79
 
 
 
9a6e5f9
 
 
92dbf79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a6e5f9
 
 
92dbf79
9a6e5f9
 
 
 
 
 
92dbf79
 
 
9a6e5f9
92dbf79
 
 
9a6e5f9
 
 
92dbf79
 
 
 
 
 
 
 
 
9a6e5f9
 
 
92dbf79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a6e5f9
 
 
92dbf79
 
 
120e0ad
 
 
 
 
 
 
 
 
92dbf79
120e0ad
 
 
 
 
 
 
 
92dbf79
120e0ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92dbf79
120e0ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92dbf79
120e0ad
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
## Practical Recommendations

### **Structure of data in a HuggingFace datasets**

#### Datasets, sub-datasets, splits

* A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.  
* Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.  
* The data in different splits of a single sub-dataset should non-overlapping  
* Example:  
  * The [MegaScale](https://huggingface.co/datasets/RosettaCommons/MegaScale) contains 6 datasets  
    * dataset1                   \# all stability measurements  
    * dataset2                   \# high-quality folding stabilities  
    * dataset3                   \# ΔG measurements  
    * dataset3\_single       \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits  
    * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits  
  * To load a specific subdataset:
    ```
    datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
    ```

#### Example: One .csv file dataset

One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`  
First load the dataset locally then push it to the hub:

    import datasets  
    dataset = datasets.load_dataset(  
        "csv",  
        data_files ="outcomes.csv",  
        keep_in_memory = True)

    dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")

This will create the following files in the repo

    data/  
	    train-00000-of-00001.parquet

and add the following to the header of README.md

    dataset_info:  
      features:  
        - name: id  
          dtype: int64  
        - name: value  
          dtype: int64  
      splits:  
        - name: train  
          num_bytes: 64  
          num_examples: 4  
      download_size: 1332  
      dataset_size: 64  
    configs:  
      - config_name: default  
        data_files:  
          - split: train  
            path: data/train-*

to load these data from HuggingFace

    dataset = datasets.load_dataset("maomlab/example_dataset")

#### Example: train/valid/test split .csv files

Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`  
load the three splits into one dataset and push it to the hub:

    import datasets  
    dataset = datasets.load_dataset(  
        'csv',  
        data_dir = "/tmp",  
        data_files = {  
          'train': 'train.csv',  
          'valid': 'valid.csv',  
          'test': 'test.csv'},  
        keep_in_memory = True)

	dataset.push_to_hub(repo_id = "maomlab/example_dataset")

This will create the following files in the repo

    data/  
	    train-00000-of-00001.parquet  
	    valid-00000-of-00001.parquet  
	    test-00000-of-00001.parquet

and add the following to the header of the README.md

    dataset_info:  
      features:  
        - name: id  
          dtype: int64  
        - name: value  
          dtype: int64  
       splits:  
        - name: train  
          num_bytes: 64  
          num_examples: 4  
        - name: valid  
          num_bytes: 64  
          num_examples: 4  
        - name: test  
          num_bytes: 64  
          num_examples: 4  
      download_size: 3996  
      dataset_size: 192  
    configs:  
      - config_name: default  
        data_files:  
          - split: train  
            path: data/train-*  
          - split: valid  
            path: data/valid-*  
          - split: test  
            path: data/test-*

to load these data from HuggingFace

    dataset = datasets.load_dataset("maomlab/example_dataset")

#### Example: sub-datasets

If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.

	import datasets  
    dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)  
    dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)  
    dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)

    dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')  
    dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')  
    dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')

This will create the following files in the repo

    dataset1/  
	    data/  
	    train-00000-of-00001.parquet  
    dataset2/  
	    data/  
	    train-00000-of-00001.parquet  
    dataset3/  
	    data/  
	    train-00000-of-00001.parquet

and add the following to the header of the README.md

    dataset_info:  
      - config_name: dataset1  
        features:  
          - name: id  
            dtype: int64  
          - name: value1  
            dtype: int64  
        splits:  
          - name: train  
            num_bytes: 64  
            num_examples: 4  
        download_size: 1344  
        dataset_size: 64  
      - config_name: dataset2  
        features:  
          - name: id  
            dtype: int64  
          - name: value2  
            dtype: int64  
        splits:  
          - name: train  
            num_bytes: 64  
            num_examples: 4  
        download_size: 1344  
        dataset_size: 64  
      - config_name: dataset3  
        features:  
          - name: id  
            dtype: int64  
          - name: value3  
            dtype: int64  
        splits:  
          - name: train  
            num_bytes: 64  
            num_examples: 4  
        download_size: 1344  
        dataset_size: 64  
    configs:  
      - config_name: dataset1  
        data_files:  
          - split: train  
            path: dataset1/data/train-*  
      - config_name: dataset2  
        data_files:  
          - split: train  
            path: dataset2/data/train-*  
      - config_name: dataset3  
        data_files:  
          - split: train  
            path: dataset3/data/train-*

to load these datasets from HuggingFace

    dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')  
	dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')  
	dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')


### **Format of a dataset**

A dataset should consist of a single table where each row is a single observation  
The columns should follow typical database design guidelines 

* Identifier columns  
  * sequential key  
    * For example: `[1, 2, 3, ...]`  
  * primary key  
    * single column that uniquely identify each row  
      * distinct for every row  
      * no non-missing values  
    * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key  
  * composite key  
    * A set of columns that uniquely identify each row  
      * Either hierarchical or complementary ids that characterize the observation  
      * For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier  
  * additional/foreign key identifiers  
    * identifiers to link the observation with other data  
    * For example  
      * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key  
      * FDA drug name or the IUPAC substance name  
* Tidy key/value columns  
  * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)  
    * tidy data sometimes called (long) has one measurement per row  
      * Multiple columns can be used to give details for each measurement including type, units, metadata  
      * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)  
      * Can handle variable number of measurements per object  
      * Duplicates object identifier columns for each measurement  
    * array data sometimes called (wide) has one object per row and multiple measurements as different columns  
      * Typically each measurement is typically a single column  
      * More compact, i.e. no duplication of identifier columns  
      * Good for certain ML/matrix based computational workflows

#### Molecular formats

* Store molecular structure in standard text formats   
  * protein structure: PDB, mmCIF, modelCIF  
  * small molecule: SMILES, InChi  
  * use uncompressed, plaintext format  
    * Easier to computationally analyze  
    * the whole dataset will be compressed for data serialization  
* Filtering / Standardization / sanitization  
  * Be clear about process methods used to process the molecular data  
  * Be especially careful for inferred / aspects of the data  
    * protonation states,  
    * salt form, stereochemistry for small molecules  
    * data missingness including unstructured loops for proteins  
  * Tools  
    * MolVS is useful for small molecule sanitization

#### Computational data formats

* On disk formats  
  * parquet format disk format  
    * column oriented (so can load only data that is needed, easier to compress)  
    * robust reader/write codes from apache arrow for Python, R etc.  
  * ArrowTable  
    * In memory format closely aligned with the on disk parquet format   
    * Native format for datasets stored in datasets python package  
  * tab/comma separated table  
    * Prefer tab separated, more consistent parsing without needing escaping values  
    * Widely used row-oriented text format for storing tabular data to disk  
    * Does not store data format and often needs custom format conversion code/QC for loading into python/R  
    * Can be compressed on disk but row-oriented, so less compressible than .parquet  
  * .pickle / .Rdata  
    * language specific serialization of complex data structures  
    * Often very fast to read/write, but may not be robust for across language/OS versions  
    * Not easily interoperable across programming languages  
* In memory formats  
  * R `data.frame`/`dplyr::tibble`  
    * Widely used format for R data science  
    * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows  
  * Python pandas DataFrame  
    * Widely used for python data science  
    * Out of the box not super fast for data science  
  * Python numpy array / R Matrix  
    * Uses single data type for all data  
    * Useful for efficient/matrix manipulation  
  * Python Pytorch dataset  
    * Format specifically geared for loading data for Pytorch deep-learning

Recommendations

* On disk  
  * For small, config level tables use .tsv  
  * For large data format use .parquet  
    * Smaller than .csv/.tsv  
    * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv  
* In memory  
  * Use `dplyr::tibble` / pandas DataFrame for data science tables  
  * Use numpy array / pytorch dataset for machine learning