File size: 5,110 Bytes
cf03a4c
 
55d3ecc
cf03a4c
54bfc55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf03a4c
54bfc55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
## **4 Add an informative README.md**

The `README.md` is a markdown file that is displayed when goes to the front page for the dataset. It should give appropriate context for the dataset and guidance on how to use it. As a template, consider having these sections--where the parts in brackets should be filled in. See the [MIP](https://huggingface.co/datasets/RosettaCommons/MIP/blob/main/README.md) dataset as an example.

| \# \<DATASET TITLE\>
\<short descriptive abstract the dataset\>


\#\# Quickstart Usage

\#\#\# Install HuggingFace Datasets package

Each subset can be loaded into python using the HuggingFace \[datasets\](https://huggingface.co/docs/datasets/index) library.
First, from the command line install the \`datasets\` library

    $ pip install datasets

Optionally set the cache directory, e.g.

    $ HF\_HOME=${HOME}/.cache/huggingface/
    $ export HF\_HOME

then, from within python load the datasets library

    \>\>\> import datasets

\#\#\# Load model datasets

To load one of the \<DATASET ID\> model datasets, use \`datasets.load\_dataset(...)\`:

    \>\>\> dataset\_tag \= "\<DATASET TAG\>"
    \>\>\> dataset \= datasets.load\_dataset(
      path \= "\<HF PATH TO DATASET\>",
      name \= f"{dataset\_tag}",
      data\_dir \= f"{dataset\_tag}")\['train'\]


and the dataset is loaded as a \`datasets.arrow\_dataset.Dataset\`

    \>\>\> dataset
    \<RESULT OF LOADING DATASET MODEL\>

which is a column oriented format that can be accessed directly, converted in to a \`pandas.DataFrame\`, or \`parquet\` format, e.g.

    \>\>\> dataset.data.column('\<COLUMN NAME IN DATASET\>')
    \>\>\> dataset.to\_pandas()
    \>\>\> dataset.to\_parquet("dataset.parquet")

\#\#\# \<BREIF EXAMPLE OF HOW TO USE DIFFERENT PARTS OF THE DATASET\>

\#\# Dataset Details

\#\#\# Dataset Description
\<DETAILED DESCRIPTION OF DATASET\>

\- \*\*Acknowledgements:\*\*
\<ACKNOWLEDGEMENTS\>

\- \*\*License:\*\* \<LICENSE\>

\#\#\# Dataset Sources
\- \*\*Repository:\*\* \<URL FOR SOURCE OF DATA\>
\- \*\*Paper:\*\* \<APA CITATION REFERENCE FOR SOURCE DATA\>
\- \*\*Zenodo Repository:\*\* \<ZENODO LINK IF RELEVANT\>


\#\# Uses
\<DESCRIPTION OF INTENDED USE OF DATASET\>

\#\#\# Out-of-Scope Use
\<DESCRIPTION OF OUT OF SCOPE USES OF DATASET\>


\#\#\# Source Data
\<DESCRIPTION OF SOURCE DATA\>

\#\# Citation
\<BIBTEX REFERENCE FOR DATASET\>

\#\# Dataset Card Authors
\<NAME/INFO OF DATASET AUTHORS\> |
| :---- |

## **5 Add Metadata to the Dataset Card**

### **Overview**

A the top of the \`README.md file include metadata about the dataset in yaml format
\---
language: …
license: …
size\_categories: …
pretty\_name: '...'
tags: …
dataset\_summary: …
dataset\_description: …
acknowledgements: …
repo: …
citation\_bibtex: …
citation\_apa: …
\---

For the full spec, see the Dataset Card specification

* [Dataset Card Documentation](https://huggingface.co/docs/hub/en/datasets-cards)
* [Dataset Card Specification](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1)
* [Dataset Card Template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md)

To allow the datasets to be loaded automatically through the datasets python library, additional info needs to be in the header of the README.md. It should reflect how the [repository is structured](https://huggingface.co/docs/datasets/en/repository_structure)
configs:
dataset\_info:

While it is possible to create these by hand, it highly recommended allowing it to be created automatically when uploaded via loading the dataset locally with [datasets.load\_dataset(...)](https://huggingface.co/docs/datasets/en/loading), then pushing it to the hub with [datasets.push\_to\_hub(...)](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub)

* [Example of uploading data using push\_to\_hub()](https://huggingface.co/datasets/RosettaCommons/MegaScale/blob/main/src/03.1_upload_data.py)
* See below for more details about how to use push\_to\_hub(...) for different common formats

### **Metadata fields**

#### License

* If the dataset is licensed under an existing standard license, then use it
* If it is unclear, then the authors need to be contacted for clarification
* Licensing it under the Rosetta License
  * Add the following to the dataset card:

    license: other

    license\_name: rosetta-license-1.0

    license\_link: LICENSE.md

  * Upload the Rosetta [LICENSE.md](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md) to the Dataset

#### Citation

* If the dataset has a DOI (e.g. associated with a published paper), use [doi2bib.org](http://doi2bib.org)
* [DOI → APA converter](https://paperpile.com/t/doi-to-apa-converter/):

#### tags

* Standard tags for searching for HuggingFace datasets
* typically:

  \- biology

  \- chemistry

#### repo

* Github, repository, figshare, etc. URL for data or project

#### citation\_bibtex

* Citation in bibtex format
* You can use https://www.doi2bib.org/

#### citation\_apa

* Citation in APA format