maom commited on
Commit
26849b9
·
verified ·
1 Parent(s): 9a6e5f9

Create examples/01_megascale_dataset

Browse files
Files changed (1) hide show
  1. examples/01_megascale_dataset +58 -0
examples/01_megascale_dataset ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #### MegaScale dataset example
2
+
3
+ Here is a working example from the [MegaScale dataset](https://huggingface.co/datasets/RosettaCommons/MegaScale/blob/main/src/03.1_upload_data.py).
4
+
5
+ 1. Process data through scripts and store each split as a separate `.parquet` file
6
+ intermediate/\<dataset\_id\>\_\<split\_id\>.parquet
7
+
8
+ 2. Use datasets package to load the local dataset into memory. See below for more examples of how to load different types of datasets
9
+
10
+ dataset\_tag \= "dataset3"
11
+
12
+ cache\_dir \= "path/to/scratch"
13
+
14
+ dataset \= datasets.load\_dataset(
15
+
16
+ "parquet",
17
+
18
+ name \= dataset\_tag,
19
+
20
+ data\_dir \= "./intermediate",
21
+
22
+ data\_files \= {
23
+
24
+ "train" : f"{dataset\_tag}\_train.parquet",
25
+
26
+ "validation" : f"{dataset\_tag}\_valdation.parquet",
27
+
28
+ "test" : f"{dataset\_tag}\_test.parquet"},
29
+
30
+ cache\_dir \= cache\_dir,
31
+
32
+ keep\_in\_memory \= True)
33
+
34
+
35
+
36
+ 3. Set up Personal Access Keys on HuggingFace (see above)
37
+
38
+ 4. Use the datasets package to push the dataset to hub, for ex
39
+
40
+ repo\_id \= "RosettaCommons/MegaScale"
41
+
42
+ dataset.push\_to\_hub(
43
+
44
+ repo\_id \= repo\_id,
45
+
46
+ config\_name \= dataset\_tag,
47
+
48
+ data\_dir \= f"{dataset\_tag}/data",
49
+
50
+ commit\_message \= "Upload {dataset\_tag}")
51
+
52
+
53
+
54
+ 5. This produces on HuggingFace
55
+
56
+ https://huggingface.co/datasets/{repo\_id}/tree/main/{dataset\_tag}/data/
57
+
58
+ {split}-\<split-chunk-index\>-of-\<n-split-chunks\>.parquet