MolecularDatasetCurationGuide / sections /03_create_dataset.md
maom's picture
Create 03_create_dataset.md
cd44827 verified
|
raw
history blame
2.73 kB

3 Create Dataset

Set up Personal Access Tokens (PAT)

See the help page on how to set up security tokens. This is needed to clone/push the repository using git

Data processing workflow overview

  1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection
    1. Click Name icon ⇒ New → Dataset
      1. Fill out dataset name
    2. Navigate to "Files and Versions" → README.md
    3. Fill out the top Dataset Card metadata (you can come back and fill` out more details later)
  2. Web-workflow
    1. Edit READ.md directly in the browser
    2. upload/delete other files directly
  3. Add any data processing scripts/workflows for reproducibility
    1. git clone https://<user_name>:<security_token>@huggingface.co/<repo_path>

    2. create analysis folder structure

      src/ # scripts for data curation

      data/ # stored raw data for processing/curation

      intermediate/ # store processed/curated data for uploading

    3. Add .gitignore

      data/*

      intermediate/*

    4. Use standard git workflow for modifying README.md and curation scripts

Uploading data to HuggingFace

Steps to upload data

  1. Create the dataset locally using datasets.load_dataset(...)
  2. Call datasets.push_to_hub(...) to upload the data

For example

import datasets
dataset = datasets.load_dataset(
"csv",
data_files ="outcomes.csv",
keep_in_memory = True)

dataset.push_to_hub(repo_id = "maomlab/example_dataset")

NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load_data(...)

If your dataset is more complex

  • see below in the section "Structure of data in a HuggingFace datasets" for guidance on how to organize the dataset
  • See other dataset in the Rosetta Data Bazaar

Downloading data from HuggingFace

To load the dataset remotely,

dataset = datasets.load_dataset(path = repo_id)

optionally select specific split and/or columns to download a subset

dataset_tag = "<dataset_tag>"
dataset = datasets.load_dataset(
path = repo_id,
name = dataset_tag,
data_dir = dataset_tag,
cache_dir = cache_dir,
keep_in_memory = True)

If needed, convert data to pandas

import pandas as pd
df = dataset.data['train'].to_pandas()