Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Sleeping

App Files Files Community

maom commited on Feb 3

Commit

cd44827

verified ·

1 Parent(s): a59eefd

Create 03_create_dataset.md

Browse files

Files changed (1) hide show

sections/03_create_dataset.md +83 -0

sections/03_create_dataset.md ADDED Viewed

	@@ -0,0 +1,83 @@

+## **3 Create Dataset**
+#### Set up Personal Access Tokens (PAT)
+See the help page on how to set up [security tokens](https://huggingface.co/docs/hub/en/security-tokens). This is needed to clone/push the repository using git
+* Navigate to: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
+* Click Create New Token → fill out information
+* Save the token, e.g. in a password manager
+#### Data processing workflow overview
+1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection
+   1. Click Name icon ⇒ [New → Dataset](https://huggingface.co/new)
+      1. Fill out dataset name
+   2. Navigate to "Files and Versions" → README.md
+   3. Fill out the top Dataset Card metadata (you can come back and fill\` out more details later)
+2. Web-workflow
+   1. Edit READ.md directly in the browser
+   2. upload/delete other files directly
+3. Add any data processing scripts/workflows for reproducibility
+   1. `git clone https://<user_name>:<security_token>@huggingface.co/<repo_path>`
+   2. create analysis folder structure
+      src/            \# scripts for data curation
+      data/           \# stored raw data for processing/curation
+      intermediate/   \# store processed/curated data for uploading
+   3. Add `.gitignore`
+      data/\*
+      intermediate/\*
+   4. Use standard git workflow for modifying README.md and curation scripts
+#### Uploading data to HuggingFace
+Steps to upload data
+1. Create the dataset locally using `datasets.load_dataset(...`)
+2. Call `datasets.push_to_hub(...`) to upload the data
+For example
+import datasets
+dataset \= datasets.load\_dataset(
+    "csv",
+    data\_files \="outcomes.csv",
+    keep\_in\_memory \= True)
+dataset.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`")
+***NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load\_data(...)***
+If your dataset is more  complex
+* see below in the section "**Structure of data in a HuggingFace datasets**" for guidance on how to organize the dataset
+* See other dataset in the Rosetta Data Bazaar
+#### Downloading data from HuggingFace
+To load the dataset remotely,
+dataset \= datasets.load\_dataset(path \= repo\_id)
+optionally select specific split and/or columns to download a subset
+dataset\_tag \= "\<dataset\_tag\>"
+dataset \= datasets.load\_dataset(
+        path \= repo\_id,
+        name \= dataset\_tag,
+        data\_dir \= dataset\_tag,
+        cache\_dir \= cache\_dir,
+        keep\_in\_memory \= True)
+If needed, convert data to pandas
+import pandas as pd
+df \= dataset.data\['train'\].to\_pandas()