language:
- en
pipeline_tag: feature-extraction
library_name: transformers
license: apache-2.0
AnnualBERT: A Time-Series of Language Models for Understanding the Evolution of Science
This repository contains the AnnualBERT series of language models, designed to capture the temporal evolution of scientific text. AnnualBERT uses whole words as tokens and consists of a base RoBERTa model pre-trained on arXiv papers published until 2008, along with a collection of annually trained models reflecting the progression of scientific knowledge over time.
Towards understanding evolution of science through language model series
Model Details
AnnualBERT models offer several key advantages:
- Specialized for Scientific Content: Trained on a massive dataset of arXiv papers, ensuring deep familiarity with scientific terminology and concepts.
- Versatile Applications: Suitable for various NLP tasks, including text classification, keyword extraction, summarization, and citation prediction.
- Evolutionary Insights: The time-series nature of the models captures the long-term relationships and changes in scientific discourse.
How to Use
The AnnualBERT models are accessed by year. For example, to load the 2020 model:
from transformers import AutoTokenizer, AutoModel
model_year = "2020" # Choose the year of the model (e.g., "2010", "2015", "2020")
model_path = f"jd445/AnnualBERTs/{model_year}"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)
# Now you can use the tokenizer and model for your NLP tasks. Example:
# inputs = tokenizer("This is a sample sentence.", return_tensors="pt")
# outputs = model(**inputs)
Remember to replace "jd445/AnnualBERTs/{model_year}" with the actual Hugging Face model ID for the year you want to use. For the best performance on scientific corpora, the 2020 model is recommended as a starting point. Refer to the paper for details on model performance across different years.