maom commited on
Commit
92dbf79
·
verified ·
1 Parent(s): 1788b20

Update sections/07_practical_recommendations.md

Browse files
Files changed (1) hide show
  1. sections/07_practical_recommendations.md +144 -142
sections/07_practical_recommendations.md CHANGED
@@ -14,192 +14,194 @@
14
  * dataset3 \# ΔG measurements
15
  * dataset3\_single \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
16
  * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
17
- * To load a specific subdataset:
18
- * datasets.load\_dataset(path \= "RosettaCommons/MegaScale", name \= "dataset1", data\_dir \= "dataset1")
 
 
19
 
20
  #### Example: One .csv file dataset
21
 
22
  One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
23
  First load the dataset locally then push it to the hub:
24
 
25
- import datasets
26
- dataset \= datasets.load\_dataset(
27
- "csv",
28
- data\_files \="outcomes.csv",
29
- keep\_in\_memory \= True)
30
 
31
- dataset.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`")
32
 
33
  This will create the following files in the repo
34
 
35
- data/
36
- train-00000-of-00001.parquet
37
 
38
  and add the following to the header of README.md
39
 
40
- dataset\_info:
41
- features:
42
- \- name: id
43
- dtype: int64
44
- \- name: value
45
- dtype: int64
46
- splits:
47
- \- name: train
48
- num\_bytes: 64
49
- num\_examples: 4
50
- download\_size: 1332
51
- dataset\_size: 64
52
- configs:
53
- \- config\_name: default
54
- data\_files:
55
- \- split: train
56
- path: data/train-\*
57
 
58
  to load these data from HuggingFace
59
 
60
- `dataset = datasets.load_dataset("maomlab/example_dataset")`
61
 
62
  #### Example: train/valid/test split .csv files
63
 
64
  Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
65
  load the three splits into one dataset and push it to the hub:
66
 
67
- import datasets
68
- dataset \= datasets.load\_dataset(
69
- 'csv',
70
- data\_dir \= "/tmp",
71
- data\_files \= {
72
- 'train': 'train.csv',
73
- 'valid': 'valid.csv',
74
- 'test': 'test.csv'},
75
- keep\_in\_memory \= True)
76
 
77
- dataset.push\_to\_hub(repo\_id \= "maomlab/example\_dataset")
78
 
79
  This will create the following files in the repo
80
 
81
- data/
82
- train-00000-of-00001.parquet
83
- valid-00000-of-00001.parquet
84
- test-00000-of-00001.parquet
85
 
86
  and add the following to the header of the README.md
87
 
88
- dataset\_info:
89
- features:
90
- \- name: id
91
- dtype: int64
92
- \- name: value
93
- dtype: int64
94
- splits:
95
- \- name: train
96
- num\_bytes: 64
97
- num\_examples: 4
98
- \- name: valid
99
- num\_bytes: 64
100
- num\_examples: 4
101
- \- name: test
102
- num\_bytes: 64
103
- num\_examples: 4
104
- download\_size: 3996
105
- dataset\_size: 192
106
- configs:
107
- \- config\_name: default
108
- data\_files:
109
- \- split: train
110
- path: data/train-\*
111
- \- split: valid
112
- path: data/valid-\*
113
- \- split: test
114
- path: data/test-\*
115
 
116
  to load these data from HuggingFace
117
 
118
- `dataset = datasets.load_dataset("maomlab/example_dataset")`
119
 
120
  #### Example: sub-datasets
121
 
122
  If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.
123
 
124
  import datasets
125
- dataset1 \= datasets.load\_dataset('csv', data\_files \= '/tmp/dataset1.csv', keep\_in\_memory \= True)
126
- dataset2 \= datasets.load\_dataset('csv', data\_files \= '/tmp/dataset2.csv', keep\_in\_memory \= True)
127
- dataset3 \= datasets.load\_dataset('csv', data\_files \= '/tmp/dataset3.csv', keep\_in\_memory \= True)
128
 
129
- dataset1.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`", config\_name \= 'dataset1', data\_dir \= 'dataset1/data')
130
- dataset2.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`", config\_name \= 'dataset2', data\_dir \= 'dataset2/data')
131
- dataset3.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`", config\_name \= 'dataset3', data\_dir \= 'dataset3/data')
132
 
133
  This will create the following files in the repo
134
 
135
- dataset1/
136
- data/
137
- train-00000-of-00001.parquet
138
- dataset2/
139
- data/
140
- train-00000-of-00001.parquet
141
- dataset3/
142
- data/
143
- train-00000-of-00001.parquet
144
 
145
  and add the following to the header of the README.md
146
 
147
- dataset\_info:
148
- \- config\_name: dataset1
149
- features:
150
- \- name: id
151
- dtype: int64
152
- \- name: value1
153
- dtype: int64
154
- splits:
155
- \- name: train
156
- num\_bytes: 64
157
- num\_examples: 4
158
- download\_size: 1344
159
- dataset\_size: 64
160
- \- config\_name: dataset2
161
- features:
162
- \- name: id
163
- dtype: int64
164
- \- name: value2
165
- dtype: int64
166
- splits:
167
- \- name: train
168
- num\_bytes: 64
169
- num\_examples: 4
170
- download\_size: 1344
171
- dataset\_size: 64
172
- \- config\_name: dataset3
173
- features:
174
- \- name: id
175
- dtype: int64
176
- \- name: value3
177
- dtype: int64
178
- splits:
179
- \- name: train
180
- num\_bytes: 64
181
- num\_examples: 4
182
- download\_size: 1344
183
- dataset\_size: 64
184
- configs:
185
- \- config\_name: dataset1
186
- data\_files:
187
- \- split: train
188
- path: dataset1/data/train-\*
189
- \- config\_name: dataset2
190
- data\_files:
191
- \- split: train
192
- path: dataset2/data/train-\*
193
- \- config\_name: dataset3
194
- data\_files:
195
- \- split: train
196
- path: dataset3/data/train-\*
197
 
198
  to load these datasets from HuggingFace
199
 
200
- `dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')`
201
- `dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')`
202
- `dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')`
203
 
204
 
205
  ### **Format of a dataset**
@@ -209,7 +211,7 @@ The columns should follow typical database design guidelines
209
 
210
  * Identifier columns
211
  * sequential key
212
- * For example: \[1, 2, 3, …\]
213
  * primary key
214
  * single column that uniquely identify each row
215
  * distinct for every row
@@ -218,7 +220,7 @@ The columns should follow typical database design guidelines
218
  * composite key
219
  * A set of columns that uniquely identify each row
220
  * Either hierarchical or complementary ids that characterize the observation
221
- * For example, for an observation of mutations, the (structure\_id, residue\_id, mutation\_aa) is a unique identifier
222
  * additional/foreign key identifiers
223
  * identifiers to link the observation with other data
224
  * For example
@@ -272,7 +274,7 @@ The columns should follow typical database design guidelines
272
  * Often very fast to read/write, but may not be robust for across language/OS versions
273
  * Not easily interoperable across programming languages
274
  * In memory formats
275
- * R data.frame/dplyr::tibble
276
  * Widely used format for R data science
277
  * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
278
  * Python pandas DataFrame
@@ -292,5 +294,5 @@ Recommendations
292
  * Smaller than .csv/.tsv
293
  * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
294
  * In memory
295
- * Use dplyr::tibble / pandas DataFrame for data science tables
296
  * Use numpy array / pytorch dataset for machine learning
 
14
  * dataset3 \# ΔG measurements
15
  * dataset3\_single \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
16
  * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
17
+ * To load a specific subdataset:
18
+ ```
19
+ datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
20
+ ```
21
 
22
  #### Example: One .csv file dataset
23
 
24
  One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
25
  First load the dataset locally then push it to the hub:
26
 
27
+ import datasets
28
+ dataset = datasets.load_dataset(
29
+ "csv",
30
+ data_files ="outcomes.csv",
31
+ keep_in_memory = True)
32
 
33
+ dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")
34
 
35
  This will create the following files in the repo
36
 
37
+ data/
38
+ train-00000-of-00001.parquet
39
 
40
  and add the following to the header of README.md
41
 
42
+ dataset_info:
43
+ features:
44
+ - name: id
45
+ dtype: int64
46
+ - name: value
47
+ dtype: int64
48
+ splits:
49
+ - name: train
50
+ num_bytes: 64
51
+ num_examples: 4
52
+ download_size: 1332
53
+ dataset_size: 64
54
+ configs:
55
+ - config_name: default
56
+ data_files:
57
+ - split: train
58
+ path: data/train-*
59
 
60
  to load these data from HuggingFace
61
 
62
+ dataset = datasets.load_dataset("maomlab/example_dataset")
63
 
64
  #### Example: train/valid/test split .csv files
65
 
66
  Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
67
  load the three splits into one dataset and push it to the hub:
68
 
69
+ import datasets
70
+ dataset = datasets.load_dataset(
71
+ 'csv',
72
+ data_dir = "/tmp",
73
+ data_files = {
74
+ 'train': 'train.csv',
75
+ 'valid': 'valid.csv',
76
+ 'test': 'test.csv'},
77
+ keep_in_memory = True)
78
 
79
+ dataset.push_to_hub(repo_id = "maomlab/example_dataset")
80
 
81
  This will create the following files in the repo
82
 
83
+ data/
84
+ train-00000-of-00001.parquet
85
+ valid-00000-of-00001.parquet
86
+ test-00000-of-00001.parquet
87
 
88
  and add the following to the header of the README.md
89
 
90
+ dataset_info:
91
+ features:
92
+ - name: id
93
+ dtype: int64
94
+ - name: value
95
+ dtype: int64
96
+ splits:
97
+ - name: train
98
+ num_bytes: 64
99
+ num_examples: 4
100
+ - name: valid
101
+ num_bytes: 64
102
+ num_examples: 4
103
+ - name: test
104
+ num_bytes: 64
105
+ num_examples: 4
106
+ download_size: 3996
107
+ dataset_size: 192
108
+ configs:
109
+ - config_name: default
110
+ data_files:
111
+ - split: train
112
+ path: data/train-*
113
+ - split: valid
114
+ path: data/valid-*
115
+ - split: test
116
+ path: data/test-*
117
 
118
  to load these data from HuggingFace
119
 
120
+ dataset = datasets.load_dataset("maomlab/example_dataset")
121
 
122
  #### Example: sub-datasets
123
 
124
  If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.
125
 
126
  import datasets
127
+ dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)
128
+ dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)
129
+ dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)
130
 
131
+ dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')
132
+ dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')
133
+ dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')
134
 
135
  This will create the following files in the repo
136
 
137
+ dataset1/
138
+ data/
139
+ train-00000-of-00001.parquet
140
+ dataset2/
141
+ data/
142
+ train-00000-of-00001.parquet
143
+ dataset3/
144
+ data/
145
+ train-00000-of-00001.parquet
146
 
147
  and add the following to the header of the README.md
148
 
149
+ dataset_info:
150
+ - config_name: dataset1
151
+ features:
152
+ - name: id
153
+ dtype: int64
154
+ - name: value1
155
+ dtype: int64
156
+ splits:
157
+ - name: train
158
+ num_bytes: 64
159
+ num_examples: 4
160
+ download_size: 1344
161
+ dataset_size: 64
162
+ - config_name: dataset2
163
+ features:
164
+ - name: id
165
+ dtype: int64
166
+ - name: value2
167
+ dtype: int64
168
+ splits:
169
+ - name: train
170
+ num_bytes: 64
171
+ num_examples: 4
172
+ download_size: 1344
173
+ dataset_size: 64
174
+ - config_name: dataset3
175
+ features:
176
+ - name: id
177
+ dtype: int64
178
+ - name: value3
179
+ dtype: int64
180
+ splits:
181
+ - name: train
182
+ num_bytes: 64
183
+ num_examples: 4
184
+ download_size: 1344
185
+ dataset_size: 64
186
+ configs:
187
+ - config_name: dataset1
188
+ data_files:
189
+ - split: train
190
+ path: dataset1/data/train-*
191
+ - config_name: dataset2
192
+ data_files:
193
+ - split: train
194
+ path: dataset2/data/train-*
195
+ - config_name: dataset3
196
+ data_files:
197
+ - split: train
198
+ path: dataset3/data/train-*
199
 
200
  to load these datasets from HuggingFace
201
 
202
+ dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')
203
+ dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')
204
+ dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')
205
 
206
 
207
  ### **Format of a dataset**
 
211
 
212
  * Identifier columns
213
  * sequential key
214
+ * For example: `[1, 2, 3, ...]`
215
  * primary key
216
  * single column that uniquely identify each row
217
  * distinct for every row
 
220
  * composite key
221
  * A set of columns that uniquely identify each row
222
  * Either hierarchical or complementary ids that characterize the observation
223
+ * For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier
224
  * additional/foreign key identifiers
225
  * identifiers to link the observation with other data
226
  * For example
 
274
  * Often very fast to read/write, but may not be robust for across language/OS versions
275
  * Not easily interoperable across programming languages
276
  * In memory formats
277
+ * R `data.frame`/`dplyr::tibble`
278
  * Widely used format for R data science
279
  * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
280
  * Python pandas DataFrame
 
294
  * Smaller than .csv/.tsv
295
  * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
296
  * In memory
297
+ * Use `dplyr::tibble` / pandas DataFrame for data science tables
298
  * Use numpy array / pytorch dataset for machine learning