Improve model card with metadata and clearer instructions

This PR improves the model card by:

- Adding essential metadata (`pipeline_tag`, `library_name`, and `license`) to ensure proper discoverability and functionality on the Hugging Face Hub.
- Providing a more descriptive overview of the AnnualBERT model series and its capabilities.
- Clarifying the usage instructions with a more complete code example, showing how to load models by year.
- I have assumed an Apache 2.0 license based on common practices in similar open-source projects. Please update the license in the metadata if this is incorrect.

Files changed (1) hide show

README.md +31 -11

README.md CHANGED Viewed

@@ -1,22 +1,42 @@
 ---
 language:
 - en
 ---
-## Model Description
-arXivBERT is a series of models trained on a time-based unit. If you are looking for the best performance on scientific corpora, please use the model from 2020 directly.
-## Why ?arXivBERT
-1. Specialized in Scientific Content: Trained on a large dataset of arXiv papers, ensuring high familiarity with scientific terminology and concepts.
-2. Versatile in Applications: Suitable for a range of NLP tasks, including but not limited to text classification, keyword extraction, summarization of scientific papers, and citation prediction.
-3. Evolutionary Insights: Continuous pre-training captures the long-term relationships and changes within the corpus.
-## How to Use?
-```
 from transformers import AutoTokenizer, AutoModel
-tokenizer = AutoTokenizer.from_pretrained("folderPath/year")
-model = AutoModel.from_pretrained("folderPath/wholewordtokenizer")
-```

 ---
 language:
 - en
+pipeline_tag: feature-extraction
+library_name: transformers
+license: apache-2.0
 ---
+# AnnualBERT: A Time-Series of Language Models for Understanding the Evolution of Science
+This repository contains the AnnualBERT series of language models, designed to capture the temporal evolution of scientific text.  AnnualBERT uses whole words as tokens and consists of a base RoBERTa model pre-trained on arXiv papers published until 2008, along with a collection of annually trained models reflecting the progression of scientific knowledge over time.
+[Towards understanding evolution of science through language model series](https://huggingface.co/papers/2409.09636)
+## Model Details
+AnnualBERT models offer several key advantages:
+* **Specialized for Scientific Content:** Trained on a massive dataset of arXiv papers, ensuring deep familiarity with scientific terminology and concepts.
+* **Versatile Applications:** Suitable for various NLP tasks, including text classification, keyword extraction, summarization, and citation prediction.
+* **Evolutionary Insights:** The time-series nature of the models captures the long-term relationships and changes in scientific discourse.
+## How to Use
+The AnnualBERT models are accessed by year.  For example, to load the 2020 model:
+```python
 from transformers import AutoTokenizer, AutoModel
+model_year = "2020" # Choose the year of the model (e.g., "2010", "2015", "2020")
+model_path = f"jd445/AnnualBERTs/{model_year}"
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+model = AutoModel.from_pretrained(model_path)
+# Now you can use the tokenizer and model for your NLP tasks.  Example:
+# inputs = tokenizer("This is a sample sentence.", return_tensors="pt")
+# outputs = model(**inputs)
+```
+Remember to replace `"jd445/AnnualBERTs/{model_year}"` with the actual Hugging Face model ID for the year you want to use.  For the best performance on scientific corpora, the 2020 model is recommended as a starting point.  Refer to the paper for details on model performance across different years.