maom's picture
Update sections/03_create_dataset.md
1788b20 verified
## **3 Create Dataset**
#### Set up Personal Access Tokens (PAT)
See the help page on how to set up [security tokens](https://huggingface.co/docs/hub/en/security-tokens). This is needed to clone/push the repository using git
* Navigate to: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
* Click Create New Token β†’ fill out information
* Save the token, e.g. in a password manager
After retriving your personal access token, you can set up git with HuggingFace via command-line. Briefly, this looks like:
```
pip install huggingface_hub
huggingface-cli login
```
#### Data processing workflow overview
1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection
1. Click Name icon β‡’ [New β†’ Dataset](https://huggingface.co/new)
1. Fill out dataset name
2. Navigate to "Files and Versions" β†’ README.md
3. Fill out the top Dataset Card metadata (you can come back and fill out more details later)
2. Web-workflow
1. Edit READ.md directly in the browser
2. upload/delete other files directly
3. Add any data processing scripts/workflows for reproducibility
1. `git clone https://huggingface.co/datasets/<username>/<repo-name>`
2. create analysis folder structure, such as:
```
src/ \# scripts for data curation
data/ \# stored raw data for processing/curation
intermediate/ \# store processed/curated data for uploading
```
3. Add `.gitignore`
```
data/*
intermediate/*
```
4. Use standard git workflow for modifying README.md and curation scripts
#### Uploading data to HuggingFace
Steps to upload data
1. Create the dataset locally using `datasets.load_dataset(...)`
2. Call `datasets.push_to_hub(...)` to upload the data
For example
import datasets
dataset = datasets.load_dataset(
"csv",
data_files = "outcomes.csv",
keep_in_memory = True)
dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")
***NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load\_data(...)***
If your dataset is more complex
* see below in the section "**Structure of data in a HuggingFace datasets**" for guidance on how to organize the dataset
* See other dataset in the Rosetta Data Bazaar
#### Downloading data from HuggingFace
To load the dataset remotely,
dataset = datasets.load_dataset(path = repo_id)
optionally select specific split and/or columns to download a subset
dataset_tag = "<dataset_tag>"
dataset = datasets.load_dataset(
path = repo_id,
name = dataset_tag,
data_dir = dataset_tag,
cache_dir = cache_dir,
keep_in_memory = True)
If needed, convert data to pandas
import pandas as pd
df = dataset.data['train'].to_pandas()