maom's picture
Update sections/03_create_dataset.md
1788b20 verified

3 Create Dataset

Set up Personal Access Tokens (PAT)

See the help page on how to set up security tokens. This is needed to clone/push the repository using git

After retriving your personal access token, you can set up git with HuggingFace via command-line. Briefly, this looks like:

pip install huggingface_hub
huggingface-cli login

Data processing workflow overview

  1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection
    1. Click Name icon ⇒ New → Dataset
      1. Fill out dataset name
    2. Navigate to "Files and Versions" → README.md
    3. Fill out the top Dataset Card metadata (you can come back and fill out more details later)
  2. Web-workflow
    1. Edit READ.md directly in the browser
    2. upload/delete other files directly
  3. Add any data processing scripts/workflows for reproducibility
    1. git clone https://huggingface.co/datasets/<username>/<repo-name>
    2. create analysis folder structure, such as:
      src/            \# scripts for data curation
      data/           \# stored raw data for processing/curation
      intermediate/   \# store processed/curated data for uploading
      
    3. Add .gitignore
      data/*
      intermediate/*
      
    4. Use standard git workflow for modifying README.md and curation scripts

Uploading data to HuggingFace

Steps to upload data

  1. Create the dataset locally using datasets.load_dataset(...)
  2. Call datasets.push_to_hub(...) to upload the data

For example

import datasets  
dataset = datasets.load_dataset(  
    "csv",  
    data_files = "outcomes.csv",  
    keep_in_memory = True)

dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")

NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load_data(...)

If your dataset is more complex

  • see below in the section "Structure of data in a HuggingFace datasets" for guidance on how to organize the dataset
  • See other dataset in the Rosetta Data Bazaar

Downloading data from HuggingFace

To load the dataset remotely,

dataset = datasets.load_dataset(path = repo_id)

optionally select specific split and/or columns to download a subset

dataset_tag = "<dataset_tag>"  
dataset = datasets.load_dataset(  
    path = repo_id,  
    name = dataset_tag,  
    data_dir = dataset_tag,  
    cache_dir = cache_dir,  
    keep_in_memory = True)

If needed, convert data to pandas

import pandas as pd  
df = dataset.data['train'].to_pandas()