MolecularDatasetCurationGuide / sections /05_add_dataset_metadata.md
maom's picture
Rename sections/05_add_dataset_metadata to sections/05_add_dataset_metadata.md
ece0e2f verified

5 Add Metadata to the Dataset Card

Overview

A the top of the `README.md file include metadata about the dataset in yaml format
---
language: …
license: …
size_categories: …
pretty_name: '...'
tags: …
dataset_summary: …
dataset_description: …
acknowledgements: …
repo: …
citation_bibtex: …
citation_apa: …
---

For the full spec, see the Dataset Card specification

To allow the datasets to be loaded automatically through the datasets python library, additional info needs to be in the header of the README.md. It should reflect how the repository is structured
configs:
dataset_info:

While it is possible to create these by hand, it highly recommended allowing it to be created automatically when uploaded via loading the dataset locally with datasets.load_dataset(...), then pushing it to the hub with datasets.push_to_hub(...)

Metadata fields

License

  • If the dataset is licensed under an existing standard license, then use it
  • If it is unclear, then the authors need to be contacted for clarification
  • Licensing it under the Rosetta License
    • Add the following to the dataset card:

      license: other

      license_name: rosetta-license-1.0

      license_link: LICENSE.md

    • Upload the Rosetta LICENSE.md to the Dataset

Citation

tags

  • Standard tags for searching for HuggingFace datasets

  • typically:

    - biology

    - chemistry

repo

  • Github, repository, figshare, etc. URL for data or project

citation_bibtex

citation_apa

  • Citation in APA format