File size: 2,947 Bytes
cd44827 18ad662 cd44827 1788b20 cd44827 18ad662 1788b20 cd44827 1788b20 cd44827 1788b20 cd44827 6d376c9 cd44827 6d376c9 cd44827 6d376c9 cd44827 10ac16e cd44827 10ac16e cd44827 10ac16e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | ## **3 Create Dataset**
#### Set up Personal Access Tokens (PAT)
See the help page on how to set up [security tokens](https://huggingface.co/docs/hub/en/security-tokens). This is needed to clone/push the repository using git
* Navigate to: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
* Click Create New Token → fill out information
* Save the token, e.g. in a password manager
After retriving your personal access token, you can set up git with HuggingFace via command-line. Briefly, this looks like:
```
pip install huggingface_hub
huggingface-cli login
```
#### Data processing workflow overview
1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection
1. Click Name icon ⇒ [New → Dataset](https://huggingface.co/new)
1. Fill out dataset name
2. Navigate to "Files and Versions" → README.md
3. Fill out the top Dataset Card metadata (you can come back and fill out more details later)
2. Web-workflow
1. Edit READ.md directly in the browser
2. upload/delete other files directly
3. Add any data processing scripts/workflows for reproducibility
1. `git clone https://huggingface.co/datasets/<username>/<repo-name>`
2. create analysis folder structure, such as:
```
src/ \# scripts for data curation
data/ \# stored raw data for processing/curation
intermediate/ \# store processed/curated data for uploading
```
3. Add `.gitignore`
```
data/*
intermediate/*
```
4. Use standard git workflow for modifying README.md and curation scripts
#### Uploading data to HuggingFace
Steps to upload data
1. Create the dataset locally using `datasets.load_dataset(...)`
2. Call `datasets.push_to_hub(...)` to upload the data
For example
import datasets
dataset = datasets.load_dataset(
"csv",
data_files = "outcomes.csv",
keep_in_memory = True)
dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")
***NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load\_data(...)***
If your dataset is more complex
* see below in the section "**Structure of data in a HuggingFace datasets**" for guidance on how to organize the dataset
* See other dataset in the Rosetta Data Bazaar
#### Downloading data from HuggingFace
To load the dataset remotely,
dataset = datasets.load_dataset(path = repo_id)
optionally select specific split and/or columns to download a subset
dataset_tag = "<dataset_tag>"
dataset = datasets.load_dataset(
path = repo_id,
name = dataset_tag,
data_dir = dataset_tag,
cache_dir = cache_dir,
keep_in_memory = True)
If needed, convert data to pandas
import pandas as pd
df = dataset.data['train'].to_pandas()
|