AnnualBERTs / README.md
nielsr's picture
nielsr HF Staff
Improve model card with metadata and clearer instructions
ee0af0e verified
|
raw
history blame
2.07 kB
metadata
language:
  - en
pipeline_tag: feature-extraction
library_name: transformers
license: apache-2.0

AnnualBERT: A Time-Series of Language Models for Understanding the Evolution of Science

This repository contains the AnnualBERT series of language models, designed to capture the temporal evolution of scientific text. AnnualBERT uses whole words as tokens and consists of a base RoBERTa model pre-trained on arXiv papers published until 2008, along with a collection of annually trained models reflecting the progression of scientific knowledge over time.

Towards understanding evolution of science through language model series

Model Details

AnnualBERT models offer several key advantages:

  • Specialized for Scientific Content: Trained on a massive dataset of arXiv papers, ensuring deep familiarity with scientific terminology and concepts.
  • Versatile Applications: Suitable for various NLP tasks, including text classification, keyword extraction, summarization, and citation prediction.
  • Evolutionary Insights: The time-series nature of the models captures the long-term relationships and changes in scientific discourse.

How to Use

The AnnualBERT models are accessed by year. For example, to load the 2020 model:

from transformers import AutoTokenizer, AutoModel

model_year = "2020" # Choose the year of the model (e.g., "2010", "2015", "2020")
model_path = f"jd445/AnnualBERTs/{model_year}"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

# Now you can use the tokenizer and model for your NLP tasks.  Example:
# inputs = tokenizer("This is a sample sentence.", return_tensors="pt")
# outputs = model(**inputs)

Remember to replace "jd445/AnnualBERTs/{model_year}" with the actual Hugging Face model ID for the year you want to use. For the best performance on scientific corpora, the 2020 model is recommended as a starting point. Refer to the paper for details on model performance across different years.