3 Create Dataset
Set up Personal Access Tokens (PAT)
See the help page on how to set up security tokens. This is needed to clone/push the repository using git
- Navigate to: https://huggingface.co/settings/tokens
- Click Create New Token → fill out information
- Save the token, e.g. in a password manager
After retriving your personal access token, you can set up git with HuggingFace via command-line. Briefly, this looks like:
pip install huggingface_hub
huggingface-cli login
Data processing workflow overview
- Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection
- Click Name icon ⇒ New → Dataset
- Fill out dataset name
- Navigate to "Files and Versions" → README.md
- Fill out the top Dataset Card metadata (you can come back and fill out more details later)
- Click Name icon ⇒ New → Dataset
- Web-workflow
- Edit READ.md directly in the browser
- upload/delete other files directly
- Add any data processing scripts/workflows for reproducibility
git clone https://huggingface.co/datasets/<username>/<repo-name>- create analysis folder structure, such as:
src/ \# scripts for data curation data/ \# stored raw data for processing/curation intermediate/ \# store processed/curated data for uploading - Add
.gitignoredata/* intermediate/* - Use standard git workflow for modifying README.md and curation scripts
Uploading data to HuggingFace
Steps to upload data
- Create the dataset locally using
datasets.load_dataset(...) - Call
datasets.push_to_hub(...)to upload the data
For example
import datasets
dataset = datasets.load_dataset(
"csv",
data_files = "outcomes.csv",
keep_in_memory = True)
dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")
NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load_data(...)
If your dataset is more complex
- see below in the section "Structure of data in a HuggingFace datasets" for guidance on how to organize the dataset
- See other dataset in the Rosetta Data Bazaar
Downloading data from HuggingFace
To load the dataset remotely,
dataset = datasets.load_dataset(path = repo_id)
optionally select specific split and/or columns to download a subset
dataset_tag = "<dataset_tag>"
dataset = datasets.load_dataset(
path = repo_id,
name = dataset_tag,
data_dir = dataset_tag,
cache_dir = cache_dir,
keep_in_memory = True)
If needed, convert data to pandas
import pandas as pd
df = dataset.data['train'].to_pandas()