Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Sleeping

File size: 2,947 Bytes

## **3 Create Dataset**

#### Set up Personal Access Tokens (PAT)

See the help page on how to set up [security tokens](https://huggingface.co/docs/hub/en/security-tokens). This is needed to clone/push the repository using git

* Navigate to: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)  
* Click Create New Token → fill out information  
* Save the token, e.g. in a password manager

After retriving your personal access token, you can set up git with HuggingFace via command-line. Briefly, this looks like:
```
pip install huggingface_hub
huggingface-cli login
```

#### Data processing workflow overview

1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection  
   1. Click Name icon ⇒ [New → Dataset](https://huggingface.co/new)  
      1. Fill out dataset name  
   2. Navigate to "Files and Versions" → README.md  
   3. Fill out the top Dataset Card metadata (you can come back and fill out more details later)  
2. Web-workflow  
   1. Edit READ.md directly in the browser  
   2. upload/delete other files directly  
3. Add any data processing scripts/workflows for reproducibility  
   1. `git clone https://huggingface.co/datasets/<username>/<repo-name>`  
   2. create analysis folder structure, such as:
      ```
      src/            \# scripts for data curation
      data/           \# stored raw data for processing/curation
      intermediate/   \# store processed/curated data for uploading
      ```
   3. Add `.gitignore`
      ```
      data/*
      intermediate/*
      ```
   4. Use standard git workflow for modifying README.md and curation scripts

#### Uploading data to HuggingFace

Steps to upload data

1. Create the dataset locally using `datasets.load_dataset(...)`  
2. Call `datasets.push_to_hub(...)` to upload the data

For example

    import datasets  
    dataset = datasets.load_dataset(  
        "csv",  
        data_files = "outcomes.csv",  
        keep_in_memory = True)

    dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")

***NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load\_data(...)***

If your dataset is more  complex

* see below in the section "**Structure of data in a HuggingFace datasets**" for guidance on how to organize the dataset  
* See other dataset in the Rosetta Data Bazaar


#### Downloading data from HuggingFace

To load the dataset remotely, 

    dataset = datasets.load_dataset(path = repo_id)

optionally select specific split and/or columns to download a subset

    dataset_tag = "<dataset_tag>"  
    dataset = datasets.load_dataset(  
        path = repo_id,  
        name = dataset_tag,  
        data_dir = dataset_tag,  
        cache_dir = cache_dir,  
        keep_in_memory = True)

If needed, convert data to pandas

    import pandas as pd  
    df = dataset.data['train'].to_pandas()