maom commited on
Commit
9a6e5f9
·
verified ·
1 Parent(s): ece0e2f

Create 07_how_to_structure_curation.md

Browse files
sections/07_how_to_structure_curation.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### **Structure of data in a HuggingFace datasets**
2
+
3
+ #### Datasets, sub-datasets, splits
4
+
5
+ * A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.
6
+ * Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.
7
+ * The data in different splits of a single sub-dataset should non-overlapping
8
+ * Example:
9
+ * The [MegaScale](https://huggingface.co/datasets/RosettaCommons/MegaScale) contains 6 datasets
10
+ * dataset1 \# all stability measurements
11
+ * dataset2 \# high-quality folding stabilities
12
+ * dataset3 \# ΔG measurements
13
+ * dataset3\_single \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
14
+ * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
15
+ * To load a specific subdataset:
16
+ * datasets.load\_dataset(path \= "RosettaCommons/MegaScale", name \= "dataset1", data\_dir \= "dataset1")
17
+
18
+ #### Example: One .csv file dataset
19
+
20
+ One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
21
+ First load the dataset locally then push it to the hub:
22
+
23
+ import datasets
24
+ dataset \= datasets.load\_dataset(
25
+ "csv",
26
+ data\_files \="outcomes.csv",
27
+ keep\_in\_memory \= True)
28
+
29
+ dataset.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`")
30
+
31
+ This will create the following files in the repo
32
+
33
+ data/
34
+ train-00000-of-00001.parquet
35
+
36
+ and add the following to the header of README.md
37
+
38
+ dataset\_info:
39
+ features:
40
+ \- name: id
41
+ dtype: int64
42
+ \- name: value
43
+ dtype: int64
44
+ splits:
45
+ \- name: train
46
+ num\_bytes: 64
47
+ num\_examples: 4
48
+ download\_size: 1332
49
+ dataset\_size: 64
50
+ configs:
51
+ \- config\_name: default
52
+ data\_files:
53
+ \- split: train
54
+ path: data/train-\*
55
+
56
+ to load these data from HuggingFace
57
+
58
+ `dataset = datasets.load_dataset("maomlab/example_dataset")`
59
+
60
+ #### Example: train/valid/test split .csv files
61
+
62
+ Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
63
+ load the three splits into one dataset and push it to the hub:
64
+
65
+ import datasets
66
+ dataset \= datasets.load\_dataset(
67
+ 'csv',
68
+ data\_dir \= "/tmp",
69
+ data\_files \= {
70
+ 'train': 'train.csv',
71
+ 'valid': 'valid.csv',
72
+ 'test': 'test.csv'},
73
+ keep\_in\_memory \= True)
74
+
75
+ dataset.push\_to\_hub(repo\_id \= "maomlab/example\_dataset")
76
+
77
+ This will create the following files in the repo
78
+
79
+ data/
80
+ train-00000-of-00001.parquet
81
+ valid-00000-of-00001.parquet
82
+ test-00000-of-00001.parquet
83
+
84
+ and add the following to the header of the README.md
85
+
86
+ dataset\_info:
87
+ features:
88
+ \- name: id
89
+ dtype: int64
90
+ \- name: value
91
+ dtype: int64
92
+ splits:
93
+ \- name: train
94
+ num\_bytes: 64
95
+ num\_examples: 4
96
+ \- name: valid
97
+ num\_bytes: 64
98
+ num\_examples: 4
99
+ \- name: test
100
+ num\_bytes: 64
101
+ num\_examples: 4
102
+ download\_size: 3996
103
+ dataset\_size: 192
104
+ configs:
105
+ \- config\_name: default
106
+ data\_files:
107
+ \- split: train
108
+ path: data/train-\*
109
+ \- split: valid
110
+ path: data/valid-\*
111
+ \- split: test
112
+ path: data/test-\*
113
+
114
+ to load these data from HuggingFace
115
+
116
+ `dataset = datasets.load_dataset("maomlab/example_dataset")`
117
+
118
+ #### Example: sub-datasets
119
+
120
+ If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.
121
+
122
+ import datasets
123
+ dataset1 \= datasets.load\_dataset('csv', data\_files \= '/tmp/dataset1.csv', keep\_in\_memory \= True)
124
+ dataset2 \= datasets.load\_dataset('csv', data\_files \= '/tmp/dataset2.csv', keep\_in\_memory \= True)
125
+ dataset3 \= datasets.load\_dataset('csv', data\_files \= '/tmp/dataset3.csv', keep\_in\_memory \= True)
126
+
127
+ dataset1.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`", config\_name \= 'dataset1', data\_dir \= 'dataset1/data')
128
+ dataset2.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`", config\_name \= 'dataset2', data\_dir \= 'dataset2/data')
129
+ dataset3.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`", config\_name \= 'dataset3', data\_dir \= 'dataset3/data')
130
+
131
+ This will create the following files in the repo
132
+
133
+ dataset1/
134
+ data/
135
+ train-00000-of-00001.parquet
136
+ dataset2/
137
+ data/
138
+ train-00000-of-00001.parquet
139
+ dataset3/
140
+ data/
141
+ train-00000-of-00001.parquet
142
+
143
+ and add the following to the header of the README.md
144
+
145
+ dataset\_info:
146
+ \- config\_name: dataset1
147
+ features:
148
+ \- name: id
149
+ dtype: int64
150
+ \- name: value1
151
+ dtype: int64
152
+ splits:
153
+ \- name: train
154
+ num\_bytes: 64
155
+ num\_examples: 4
156
+ download\_size: 1344
157
+ dataset\_size: 64
158
+ \- config\_name: dataset2
159
+ features:
160
+ \- name: id
161
+ dtype: int64
162
+ \- name: value2
163
+ dtype: int64
164
+ splits:
165
+ \- name: train
166
+ num\_bytes: 64
167
+ num\_examples: 4
168
+ download\_size: 1344
169
+ dataset\_size: 64
170
+ \- config\_name: dataset3
171
+ features:
172
+ \- name: id
173
+ dtype: int64
174
+ \- name: value3
175
+ dtype: int64
176
+ splits:
177
+ \- name: train
178
+ num\_bytes: 64
179
+ num\_examples: 4
180
+ download\_size: 1344
181
+ dataset\_size: 64
182
+ configs:
183
+ \- config\_name: dataset1
184
+ data\_files:
185
+ \- split: train
186
+ path: dataset1/data/train-\*
187
+ \- config\_name: dataset2
188
+ data\_files:
189
+ \- split: train
190
+ path: dataset2/data/train-\*
191
+ \- config\_name: dataset3
192
+ data\_files:
193
+ \- split: train
194
+ path: dataset3/data/train-\*
195
+
196
+ to load these datasets from HuggingFace
197
+
198
+ `dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')`
199
+ `dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')`
200
+ `dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')`