| ## **4 Add an informative README.md** | |
| The `README.md` is a markdown file that is displayed when goes to the front page for the dataset. It should give appropriate context for the dataset and guidance on how to use it. As a template, consider having these sections--where the parts in brackets should be filled in. See the [MIP](https://huggingface.co/datasets/RosettaCommons/MIP/blob/main/README.md) dataset as an example. | |
| | \# \<DATASET TITLE\> | |
| \<short descriptive abstract the dataset\> | |
| \#\# Quickstart Usage | |
| \#\#\# Install HuggingFace Datasets package | |
| Each subset can be loaded into python using the HuggingFace \[datasets\](https://huggingface.co/docs/datasets/index) library. | |
| First, from the command line install the \`datasets\` library | |
| $ pip install datasets | |
| Optionally set the cache directory, e.g. | |
| $ HF\_HOME=${HOME}/.cache/huggingface/ | |
| $ export HF\_HOME | |
| then, from within python load the datasets library | |
| \>\>\> import datasets | |
| \#\#\# Load model datasets | |
| To load one of the \<DATASET ID\> model datasets, use \`datasets.load\_dataset(...)\`: | |
| \>\>\> dataset\_tag \= "\<DATASET TAG\>" | |
| \>\>\> dataset \= datasets.load\_dataset( | |
| path \= "\<HF PATH TO DATASET\>", | |
| name \= f"{dataset\_tag}", | |
| data\_dir \= f"{dataset\_tag}")\['train'\] | |
| and the dataset is loaded as a \`datasets.arrow\_dataset.Dataset\` | |
| \>\>\> dataset | |
| \<RESULT OF LOADING DATASET MODEL\> | |
| which is a column oriented format that can be accessed directly, converted in to a \`pandas.DataFrame\`, or \`parquet\` format, e.g. | |
| \>\>\> dataset.data.column('\<COLUMN NAME IN DATASET\>') | |
| \>\>\> dataset.to\_pandas() | |
| \>\>\> dataset.to\_parquet("dataset.parquet") | |
| \#\#\# \<BREIF EXAMPLE OF HOW TO USE DIFFERENT PARTS OF THE DATASET\> | |
| \#\# Dataset Details | |
| \#\#\# Dataset Description | |
| \<DETAILED DESCRIPTION OF DATASET\> | |
| \- \*\*Acknowledgements:\*\* | |
| \<ACKNOWLEDGEMENTS\> | |
| \- \*\*License:\*\* \<LICENSE\> | |
| \#\#\# Dataset Sources | |
| \- \*\*Repository:\*\* \<URL FOR SOURCE OF DATA\> | |
| \- \*\*Paper:\*\* \<APA CITATION REFERENCE FOR SOURCE DATA\> | |
| \- \*\*Zenodo Repository:\*\* \<ZENODO LINK IF RELEVANT\> | |
| \#\# Uses | |
| \<DESCRIPTION OF INTENDED USE OF DATASET\> | |
| \#\#\# Out-of-Scope Use | |
| \<DESCRIPTION OF OUT OF SCOPE USES OF DATASET\> | |
| \#\#\# Source Data | |
| \<DESCRIPTION OF SOURCE DATA\> | |
| \#\# Citation | |
| \<BIBTEX REFERENCE FOR DATASET\> | |
| \#\# Dataset Card Authors | |
| \<NAME/INFO OF DATASET AUTHORS\> | | |
| | :---- | | |
| ## **5 Add Metadata to the Dataset Card** | |
| ### **Overview** | |
| A the top of the \`README.md file include metadata about the dataset in yaml format | |
| \--- | |
| language: … | |
| license: … | |
| size\_categories: … | |
| pretty\_name: '...' | |
| tags: … | |
| dataset\_summary: … | |
| dataset\_description: … | |
| acknowledgements: … | |
| repo: … | |
| citation\_bibtex: … | |
| citation\_apa: … | |
| \--- | |
| For the full spec, see the Dataset Card specification | |
| * [Dataset Card Documentation](https://huggingface.co/docs/hub/en/datasets-cards) | |
| * [Dataset Card Specification](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1) | |
| * [Dataset Card Template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md) | |
| To allow the datasets to be loaded automatically through the datasets python library, additional info needs to be in the header of the README.md. It should reflect how the [repository is structured](https://huggingface.co/docs/datasets/en/repository_structure) | |
| configs: | |
| dataset\_info: | |
| While it is possible to create these by hand, it highly recommended allowing it to be created automatically when uploaded via loading the dataset locally with [datasets.load\_dataset(...)](https://huggingface.co/docs/datasets/en/loading), then pushing it to the hub with [datasets.push\_to\_hub(...)](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub) | |
| * [Example of uploading data using push\_to\_hub()](https://huggingface.co/datasets/RosettaCommons/MegaScale/blob/main/src/03.1_upload_data.py) | |
| * See below for more details about how to use push\_to\_hub(...) for different common formats | |
| ### **Metadata fields** | |
| #### License | |
| * If the dataset is licensed under an existing standard license, then use it | |
| * If it is unclear, then the authors need to be contacted for clarification | |
| * Licensing it under the Rosetta License | |
| * Add the following to the dataset card: | |
| license: other | |
| license\_name: rosetta-license-1.0 | |
| license\_link: LICENSE.md | |
| * Upload the Rosetta [LICENSE.md](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md) to the Dataset | |
| #### Citation | |
| * If the dataset has a DOI (e.g. associated with a published paper), use [doi2bib.org](http://doi2bib.org) | |
| * [DOI → APA converter](https://paperpile.com/t/doi-to-apa-converter/): | |
| #### tags | |
| * Standard tags for searching for HuggingFace datasets | |
| * typically: | |
| \- biology | |
| \- chemistry | |
| #### repo | |
| * Github, repository, figshare, etc. URL for data or project | |
| #### citation\_bibtex | |
| * Citation in bibtex format | |
| * You can use https://www.doi2bib.org/ | |
| #### citation\_apa | |
| * Citation in APA format |