Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Sleeping

App Files Files Community

MolecularDatasetCurationGuide / sections /04_create_dataset_card.md

woodsh

Fix small typo in sections/04_create_dataset_card.md

55d3ecc verified 23 days ago

preview code

raw

history blame contribute delete

5.11 kB

	## 4 Add an informative README.md

	The `README.md` is a markdown file that is displayed when goes to the front page for the dataset. It should give appropriate context for the dataset and guidance on how to use it. As a template, consider having these sections--where the parts in brackets should be filled in. See the [MIP](https://huggingface.co/datasets/RosettaCommons/MIP/blob/main/README.md) dataset as an example.

	\| \# \<DATASET TITLE\>
	\<short descriptive abstract the dataset\>


	\#\# Quickstart Usage

	\#\#\# Install HuggingFace Datasets package

	Each subset can be loaded into python using the HuggingFace \[datasets\](https://huggingface.co/docs/datasets/index) library.
	First, from the command line install the \`datasets\` library

	$ pip install datasets

	Optionally set the cache directory, e.g.

	$ HF\_HOME=${HOME}/.cache/huggingface/
	$ export HF\_HOME

	then, from within python load the datasets library

	\>\>\> import datasets

	\#\#\# Load model datasets

	To load one of the \<DATASET ID\> model datasets, use \`datasets.load\_dataset(...)\`:

	\>\>\> dataset\_tag \= "\<DATASET TAG\>"
	\>\>\> dataset \= datasets.load\_dataset(
	path \= "\<HF PATH TO DATASET\>",
	name \= f"{dataset\_tag}",
	data\_dir \= f"{dataset\_tag}")\['train'\]


	and the dataset is loaded as a \`datasets.arrow\_dataset.Dataset\`

	\>\>\> dataset
	\<RESULT OF LOADING DATASET MODEL\>

	which is a column oriented format that can be accessed directly, converted in to a \`pandas.DataFrame\`, or \`parquet\` format, e.g.

	\>\>\> dataset.data.column('\<COLUMN NAME IN DATASET\>')
	\>\>\> dataset.to\_pandas()
	\>\>\> dataset.to\_parquet("dataset.parquet")

	\#\#\# \<BREIF EXAMPLE OF HOW TO USE DIFFERENT PARTS OF THE DATASET\>

	\#\# Dataset Details

	\#\#\# Dataset Description
	\<DETAILED DESCRIPTION OF DATASET\>

	\- \\Acknowledgements:\\
	\<ACKNOWLEDGEMENTS\>

	\- \\License:\\ \<LICENSE\>

	\#\#\# Dataset Sources
	\- \\Repository:\\ \<URL FOR SOURCE OF DATA\>
	\- \\Paper:\\ \<APA CITATION REFERENCE FOR SOURCE DATA\>
	\- \\Zenodo Repository:\\ \<ZENODO LINK IF RELEVANT\>


	\#\# Uses
	\<DESCRIPTION OF INTENDED USE OF DATASET\>

	\#\#\# Out-of-Scope Use
	\<DESCRIPTION OF OUT OF SCOPE USES OF DATASET\>


	\#\#\# Source Data
	\<DESCRIPTION OF SOURCE DATA\>

	\#\# Citation
	\<BIBTEX REFERENCE FOR DATASET\>

	\#\# Dataset Card Authors
	\<NAME/INFO OF DATASET AUTHORS\> \|
	\| :---- \|

	## 5 Add Metadata to the Dataset Card

	### Overview

	A the top of the \`README.md file include metadata about the dataset in yaml format
	\---
	language: …
	license: …
	size\_categories: …
	pretty\_name: '...'
	tags: …
	dataset\_summary: …
	dataset\_description: …
	acknowledgements: …
	repo: …
	citation\_bibtex: …
	citation\_apa: …
	\---

	For the full spec, see the Dataset Card specification

	* [Dataset Card Documentation](https://huggingface.co/docs/hub/en/datasets-cards)
	* [Dataset Card Specification](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1)
	* [Dataset Card Template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md)

	To allow the datasets to be loaded automatically through the datasets python library, additional info needs to be in the header of the README.md. It should reflect how the [repository is structured](https://huggingface.co/docs/datasets/en/repository_structure)
	configs:
	dataset\_info:

	While it is possible to create these by hand, it highly recommended allowing it to be created automatically when uploaded via loading the dataset locally with [datasets.load\_dataset(...)](https://huggingface.co/docs/datasets/en/loading), then pushing it to the hub with [datasets.push\_to\_hub(...)](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub)

	* [Example of uploading data using push\_to\_hub()](https://huggingface.co/datasets/RosettaCommons/MegaScale/blob/main/src/03.1_upload_data.py)
	* See below for more details about how to use push\_to\_hub(...) for different common formats

	### Metadata fields

	#### License

	* If the dataset is licensed under an existing standard license, then use it
	* If it is unclear, then the authors need to be contacted for clarification
	* Licensing it under the Rosetta License
	* Add the following to the dataset card:

	license: other

	license\_name: rosetta-license-1.0

	license\_link: LICENSE.md

	* Upload the Rosetta [LICENSE.md](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md) to the Dataset

	#### Citation

	* If the dataset has a DOI (e.g. associated with a published paper), use [doi2bib.org](http://doi2bib.org)
	* [DOI → APA converter](https://paperpile.com/t/doi-to-apa-converter/):

	#### tags

	* Standard tags for searching for HuggingFace datasets
	* typically:

	\- biology

	\- chemistry

	#### repo

	* Github, repository, figshare, etc. URL for data or project

	#### citation\_bibtex

	* Citation in bibtex format
	* You can use https://www.doi2bib.org/

	#### citation\_apa

	* Citation in APA format