## **3 Create Dataset** #### Set up Personal Access Tokens (PAT) See the help page on how to set up [security tokens](https://huggingface.co/docs/hub/en/security-tokens). This is needed to clone/push the repository using git * Navigate to: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) * Click Create New Token → fill out information * Save the token, e.g. in a password manager After retriving your personal access token, you can set up git with HuggingFace via command-line. Briefly, this looks like: ``` pip install huggingface_hub huggingface-cli login ``` #### Data processing workflow overview 1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection 1. Click Name icon ⇒ [New → Dataset](https://huggingface.co/new) 1. Fill out dataset name 2. Navigate to "Files and Versions" → README.md 3. Fill out the top Dataset Card metadata (you can come back and fill out more details later) 2. Web-workflow 1. Edit READ.md directly in the browser 2. upload/delete other files directly 3. Add any data processing scripts/workflows for reproducibility 1. `git clone https://huggingface.co/datasets//` 2. create analysis folder structure, such as: ``` src/ \# scripts for data curation data/ \# stored raw data for processing/curation intermediate/ \# store processed/curated data for uploading ``` 3. Add `.gitignore` ``` data/* intermediate/* ``` 4. Use standard git workflow for modifying README.md and curation scripts #### Uploading data to HuggingFace Steps to upload data 1. Create the dataset locally using `datasets.load_dataset(...)` 2. Call `datasets.push_to_hub(...)` to upload the data For example import datasets dataset = datasets.load_dataset( "csv", data_files = "outcomes.csv", keep_in_memory = True) dataset.push_to_hub(repo_id = "`maomlab/example_dataset`") ***NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load\_data(...)*** If your dataset is more complex * see below in the section "**Structure of data in a HuggingFace datasets**" for guidance on how to organize the dataset * See other dataset in the Rosetta Data Bazaar #### Downloading data from HuggingFace To load the dataset remotely, dataset = datasets.load_dataset(path = repo_id) optionally select specific split and/or columns to download a subset dataset_tag = "" dataset = datasets.load_dataset( path = repo_id, name = dataset_tag, data_dir = dataset_tag, cache_dir = cache_dir, keep_in_memory = True) If needed, convert data to pandas import pandas as pd df = dataset.data['train'].to_pandas()