Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Running

App Files Files Community

MolecularDatasetCurationGuide / sections /03_create_dataset.md

maom

Create 03_create_dataset.md

cd44827 verified 2 months ago

preview code

raw

history blame

2.73 kB

3 Create Dataset

Set up Personal Access Tokens (PAT)

See the help page on how to set up security tokens. This is needed to clone/push the repository using git

Navigate to: https://huggingface.co/settings/tokens
Click Create New Token → fill out information
Save the token, e.g. in a password manager

Data processing workflow overview

Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection
1. Click Name icon ⇒ New → Dataset
  1. Fill out dataset name
2. Navigate to "Files and Versions" → README.md
3. Fill out the top Dataset Card metadata (you can come back and fill` out more details later)
Web-workflow
1. Edit READ.md directly in the browser
2. upload/delete other files directly
Add any data processing scripts/workflows for reproducibility
1. git clone https://<user_name>:<security_token>@huggingface.co/<repo_path>
2. create analysis folder structure
  
  src/ # scripts for data curation
  
  data/ # stored raw data for processing/curation
  
  intermediate/ # store processed/curated data for uploading
3. Add .gitignore
  
  data/*
  
  intermediate/*
4. Use standard git workflow for modifying README.md and curation scripts

Uploading data to HuggingFace

Steps to upload data

Create the dataset locally using datasets.load_dataset(...)
Call datasets.push_to_hub(...) to upload the data

For example

import datasets
dataset = datasets.load_dataset(
"csv",
data_files ="outcomes.csv",
keep_in_memory = True)

dataset.push_to_hub(repo_id = "maomlab/example_dataset")

NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load_data(...)

If your dataset is more complex

see below in the section "Structure of data in a HuggingFace datasets" for guidance on how to organize the dataset
See other dataset in the Rosetta Data Bazaar

Downloading data from HuggingFace

To load the dataset remotely,

dataset = datasets.load_dataset(path = repo_id)

optionally select specific split and/or columns to download a subset

dataset_tag = "<dataset_tag>"
dataset = datasets.load_dataset(
path = repo_id,
name = dataset_tag,
data_dir = dataset_tag,
cache_dir = cache_dir,
keep_in_memory = True)

If needed, convert data to pandas

import pandas as pd
df = dataset.data['train'].to_pandas()