Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Running

App Files Files Community

MolecularDatasetCurationGuide / sections /03_create_dataset.md

maom

Update sections/03_create_dataset.md

1788b20 verified 17 days ago

preview code

raw

history blame contribute delete

2.95 kB

	## 3 Create Dataset

	#### Set up Personal Access Tokens (PAT)

	See the help page on how to set up [security tokens](https://huggingface.co/docs/hub/en/security-tokens). This is needed to clone/push the repository using git

	* Navigate to: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
	* Click Create New Token → fill out information
	* Save the token, e.g. in a password manager

	After retriving your personal access token, you can set up git with HuggingFace via command-line. Briefly, this looks like:
	```
	pip install huggingface_hub
	huggingface-cli login
	```

	#### Data processing workflow overview

	1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection
	1. Click Name icon ⇒ [New → Dataset](https://huggingface.co/new)
	1. Fill out dataset name
	2. Navigate to "Files and Versions" → README.md
	3. Fill out the top Dataset Card metadata (you can come back and fill out more details later)
	2. Web-workflow
	1. Edit READ.md directly in the browser
	2. upload/delete other files directly
	3. Add any data processing scripts/workflows for reproducibility
	1. `git clone https://huggingface.co/datasets/<username>/<repo-name>`
	2. create analysis folder structure, such as:
	```
	src/ \# scripts for data curation
	data/ \# stored raw data for processing/curation
	intermediate/ \# store processed/curated data for uploading
	```
	3. Add `.gitignore`
	```
	data/*
	intermediate/*
	```
	4. Use standard git workflow for modifying README.md and curation scripts

	#### Uploading data to HuggingFace

	Steps to upload data

	1. Create the dataset locally using `datasets.load_dataset(...)`
	2. Call `datasets.push_to_hub(...)` to upload the data

	For example

	import datasets
	dataset = datasets.load_dataset(
	"csv",
	data_files = "outcomes.csv",
	keep_in_memory = True)

	dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")

	*NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load\_data(...)*

	If your dataset is more complex

	* see below in the section "Structure of data in a HuggingFace datasets" for guidance on how to organize the dataset
	* See other dataset in the Rosetta Data Bazaar


	#### Downloading data from HuggingFace

	To load the dataset remotely,

	dataset = datasets.load_dataset(path = repo_id)

	optionally select specific split and/or columns to download a subset

	dataset_tag = "<dataset_tag>"
	dataset = datasets.load_dataset(
	path = repo_id,
	name = dataset_tag,
	data_dir = dataset_tag,
	cache_dir = cache_dir,
	keep_in_memory = True)

	If needed, convert data to pandas

	import pandas as pd
	df = dataset.data['train'].to_pandas()