## **4 Add an informative README.md** The `README.md` is a markdown file that is displayed when goes to the front page for the dataset. It should give appropriate context for the dataset and guidance on how to use it. As a template, consider having these sections--where the parts in brackets should be filled in. See the [MIP](https://huggingface.co/datasets/RosettaCommons/MIP/blob/main/README.md) dataset as an example. | \# \ \ \#\# Quickstart Usage \#\#\# Install HuggingFace Datasets package Each subset can be loaded into python using the HuggingFace \[datasets\](https://huggingface.co/docs/datasets/index) library. First, from the command line install the \`datasets\` library $ pip install datasets Optionally set the cache directory, e.g. $ HF\_HOME=${HOME}/.cache/huggingface/ $ export HF\_HOME then, from within python load the datasets library \>\>\> import datasets \#\#\# Load model datasets To load one of the \ model datasets, use \`datasets.load\_dataset(...)\`: \>\>\> dataset\_tag \= "\" \>\>\> dataset \= datasets.load\_dataset( path \= "\", name \= f"{dataset\_tag}", data\_dir \= f"{dataset\_tag}")\['train'\] and the dataset is loaded as a \`datasets.arrow\_dataset.Dataset\` \>\>\> dataset \ which is a column oriented format that can be accessed directly, converted in to a \`pandas.DataFrame\`, or \`parquet\` format, e.g. \>\>\> dataset.data.column('\') \>\>\> dataset.to\_pandas() \>\>\> dataset.to\_parquet("dataset.parquet") \#\#\# \ \#\# Dataset Details \#\#\# Dataset Description \ \- \*\*Acknowledgements:\*\* \ \- \*\*License:\*\* \ \#\#\# Dataset Sources \- \*\*Repository:\*\* \ \- \*\*Paper:\*\* \ \- \*\*Zenodo Repository:\*\* \ \#\# Uses \ \#\#\# Out-of-Scope Use \ \#\#\# Source Data \ \#\# Citation \ \#\# Dataset Card Authors \ | | :---- | ## **5 Add Metadata to the Dataset Card** ### **Overview** A the top of the \`README.md file include metadata about the dataset in yaml format \--- language: … license: … size\_categories: … pretty\_name: '...' tags: … dataset\_summary: … dataset\_description: … acknowledgements: … repo: … citation\_bibtex: … citation\_apa: … \--- For the full spec, see the Dataset Card specification * [Dataset Card Documentation](https://huggingface.co/docs/hub/en/datasets-cards) * [Dataset Card Specification](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1) * [Dataset Card Template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md) To allow the datasets to be loaded automatically through the datasets python library, additional info needs to be in the header of the README.md. It should reflect how the [repository is structured](https://huggingface.co/docs/datasets/en/repository_structure) configs: dataset\_info: While it is possible to create these by hand, it highly recommended allowing it to be created automatically when uploaded via loading the dataset locally with [datasets.load\_dataset(...)](https://huggingface.co/docs/datasets/en/loading), then pushing it to the hub with [datasets.push\_to\_hub(...)](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub) * [Example of uploading data using push\_to\_hub()](https://huggingface.co/datasets/RosettaCommons/MegaScale/blob/main/src/03.1_upload_data.py) * See below for more details about how to use push\_to\_hub(...) for different common formats ### **Metadata fields** #### License * If the dataset is licensed under an existing standard license, then use it * If it is unclear, then the authors need to be contacted for clarification * Licensing it under the Rosetta License * Add the following to the dataset card: license: other license\_name: rosetta-license-1.0 license\_link: LICENSE.md * Upload the Rosetta [LICENSE.md](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md) to the Dataset #### Citation * If the dataset has a DOI (e.g. associated with a published paper), use [doi2bib.org](http://doi2bib.org) * [DOI → APA converter](https://paperpile.com/t/doi-to-apa-converter/): #### tags * Standard tags for searching for HuggingFace datasets * typically: \- biology \- chemistry #### repo * Github, repository, figshare, etc. URL for data or project #### citation\_bibtex * Citation in bibtex format * You can use https://www.doi2bib.org/ #### citation\_apa * Citation in APA format