maom commited on
Commit
cd44827
·
verified ·
1 Parent(s): a59eefd

Create 03_create_dataset.md

Browse files
Files changed (1) hide show
  1. sections/03_create_dataset.md +83 -0
sections/03_create_dataset.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## **3 Create Dataset**
2
+
3
+ #### Set up Personal Access Tokens (PAT)
4
+
5
+ See the help page on how to set up [security tokens](https://huggingface.co/docs/hub/en/security-tokens). This is needed to clone/push the repository using git
6
+
7
+ * Navigate to: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
8
+ * Click Create New Token → fill out information
9
+ * Save the token, e.g. in a password manager
10
+
11
+ #### Data processing workflow overview
12
+
13
+ 1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection
14
+ 1. Click Name icon ⇒ [New → Dataset](https://huggingface.co/new)
15
+ 1. Fill out dataset name
16
+ 2. Navigate to "Files and Versions" → README.md
17
+ 3. Fill out the top Dataset Card metadata (you can come back and fill\` out more details later)
18
+ 2. Web-workflow
19
+ 1. Edit READ.md directly in the browser
20
+ 2. upload/delete other files directly
21
+ 3. Add any data processing scripts/workflows for reproducibility
22
+ 1. `git clone https://<user_name>:<security_token>@huggingface.co/<repo_path>`
23
+ 2. create analysis folder structure
24
+
25
+ src/ \# scripts for data curation
26
+
27
+ data/ \# stored raw data for processing/curation
28
+
29
+ intermediate/ \# store processed/curated data for uploading
30
+
31
+ 3. Add `.gitignore`
32
+
33
+ data/\*
34
+
35
+ intermediate/\*
36
+
37
+ 4. Use standard git workflow for modifying README.md and curation scripts
38
+
39
+ #### Uploading data to HuggingFace
40
+
41
+ Steps to upload data
42
+
43
+ 1. Create the dataset locally using `datasets.load_dataset(...`)
44
+ 2. Call `datasets.push_to_hub(...`) to upload the data
45
+
46
+ For example
47
+
48
+ import datasets
49
+ dataset \= datasets.load\_dataset(
50
+ "csv",
51
+ data\_files \="outcomes.csv",
52
+ keep\_in\_memory \= True)
53
+
54
+ dataset.push\_to\_hub(repo\_id \= "`maomlab/example_dataset`")
55
+
56
+ ***NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load\_data(...)***
57
+
58
+ If your dataset is more complex
59
+
60
+ * see below in the section "**Structure of data in a HuggingFace datasets**" for guidance on how to organize the dataset
61
+ * See other dataset in the Rosetta Data Bazaar
62
+
63
+
64
+ #### Downloading data from HuggingFace
65
+
66
+ To load the dataset remotely,
67
+
68
+ dataset \= datasets.load\_dataset(path \= repo\_id)
69
+
70
+ optionally select specific split and/or columns to download a subset
71
+
72
+ dataset\_tag \= "\<dataset\_tag\>"
73
+ dataset \= datasets.load\_dataset(
74
+ path \= repo\_id,
75
+ name \= dataset\_tag,
76
+ data\_dir \= dataset\_tag,
77
+ cache\_dir \= cache\_dir,
78
+ keep\_in\_memory \= True)
79
+
80
+ If needed, convert data to pandas
81
+
82
+ import pandas as pd
83
+ df \= dataset.data\['train'\].to\_pandas()