Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Sleeping

maom commited on Feb 3

Commit

26849b9

verified ·

1 Parent(s): 9a6e5f9

Create examples/01_megascale_dataset

Files changed (1) hide show

examples/01_megascale_dataset ADDED Viewed

+#### MegaScale dataset example
+Here is a working example from the [MegaScale dataset](https://huggingface.co/datasets/RosettaCommons/MegaScale/blob/main/src/03.1_upload_data.py).
+1. Process data through scripts and store each split as a separate `.parquet` file
+   intermediate/\<dataset\_id\>\_\<split\_id\>.parquet
+2. Use datasets package to load the local dataset into memory. See below for more examples of how to load different types of datasets
+   dataset\_tag \= "dataset3"
+   cache\_dir \= "path/to/scratch"
+   dataset \= datasets.load\_dataset(
+       "parquet",
+       name \= dataset\_tag,
+       data\_dir \= "./intermediate",
+       data\_files \= {
+           "train" : f"{dataset\_tag}\_train.parquet",
+           "validation" : f"{dataset\_tag}\_valdation.parquet",
+           "test" : f"{dataset\_tag}\_test.parquet"},
+       cache\_dir \= cache\_dir,
+       keep\_in\_memory \= True)
+3. Set up Personal Access Keys on HuggingFace (see above)
+4. Use the datasets package to push the dataset to hub, for ex
+   repo\_id \= "RosettaCommons/MegaScale"
+   dataset.push\_to\_hub(
+       repo\_id \= repo\_id,
+       config\_name \= dataset\_tag,
+       data\_dir \= f"{dataset\_tag}/data",
+       commit\_message \= "Upload {dataset\_tag}")
+5. This produces on HuggingFace
+   https://huggingface.co/datasets/{repo\_id}/tree/main/{dataset\_tag}/data/
+   {split}-\<split-chunk-index\>-of-\<n-split-chunks\>.parquet