nielsr HF Staff commited on
Commit
ee0af0e
·
verified ·
1 Parent(s): 470c190

Improve model card with metadata and clearer instructions

Browse files

This PR improves the model card by:

- Adding essential metadata (`pipeline_tag`, `library_name`, and `license`) to ensure proper discoverability and functionality on the Hugging Face Hub.
- Providing a more descriptive overview of the AnnualBERT model series and its capabilities.
- Clarifying the usage instructions with a more complete code example, showing how to load models by year.
- I have assumed an Apache 2.0 license based on common practices in similar open-source projects. Please update the license in the metadata if this is incorrect.

Files changed (1) hide show
  1. README.md +31 -11
README.md CHANGED
@@ -1,22 +1,42 @@
1
  ---
2
  language:
3
  - en
 
 
 
4
  ---
5
- ## Model Description
6
- arXivBERT is a series of models trained on a time-based unit. If you are looking for the best performance on scientific corpora, please use the model from 2020 directly.
7
 
8
- ## Why ?arXivBERT
9
- 1. Specialized in Scientific Content: Trained on a large dataset of arXiv papers, ensuring high familiarity with scientific terminology and concepts.
10
- 2. Versatile in Applications: Suitable for a range of NLP tasks, including but not limited to text classification, keyword extraction, summarization of scientific papers, and citation prediction.
11
- 3. Evolutionary Insights: Continuous pre-training captures the long-term relationships and changes within the corpus.
12
 
13
- ## How to Use?
14
 
15
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  from transformers import AutoTokenizer, AutoModel
17
 
18
- tokenizer = AutoTokenizer.from_pretrained("folderPath/year")
19
- model = AutoModel.from_pretrained("folderPath/wholewordtokenizer")
20
 
 
 
 
 
 
 
 
21
 
22
- ```
 
1
  ---
2
  language:
3
  - en
4
+ pipeline_tag: feature-extraction
5
+ library_name: transformers
6
+ license: apache-2.0
7
  ---
 
 
8
 
9
+ # AnnualBERT: A Time-Series of Language Models for Understanding the Evolution of Science
 
 
 
10
 
11
+ This repository contains the AnnualBERT series of language models, designed to capture the temporal evolution of scientific text. AnnualBERT uses whole words as tokens and consists of a base RoBERTa model pre-trained on arXiv papers published until 2008, along with a collection of annually trained models reflecting the progression of scientific knowledge over time.
12
 
13
+ [Towards understanding evolution of science through language model series](https://huggingface.co/papers/2409.09636)
14
+
15
+
16
+ ## Model Details
17
+
18
+ AnnualBERT models offer several key advantages:
19
+
20
+ * **Specialized for Scientific Content:** Trained on a massive dataset of arXiv papers, ensuring deep familiarity with scientific terminology and concepts.
21
+ * **Versatile Applications:** Suitable for various NLP tasks, including text classification, keyword extraction, summarization, and citation prediction.
22
+ * **Evolutionary Insights:** The time-series nature of the models captures the long-term relationships and changes in scientific discourse.
23
+
24
+ ## How to Use
25
+
26
+ The AnnualBERT models are accessed by year. For example, to load the 2020 model:
27
+
28
+ ```python
29
  from transformers import AutoTokenizer, AutoModel
30
 
31
+ model_year = "2020" # Choose the year of the model (e.g., "2010", "2015", "2020")
32
+ model_path = f"jd445/AnnualBERTs/{model_year}"
33
 
34
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
35
+ model = AutoModel.from_pretrained(model_path)
36
+
37
+ # Now you can use the tokenizer and model for your NLP tasks. Example:
38
+ # inputs = tokenizer("This is a sample sentence.", return_tensors="pt")
39
+ # outputs = model(**inputs)
40
+ ```
41
 
42
+ Remember to replace `"jd445/AnnualBERTs/{model_year}"` with the actual Hugging Face model ID for the year you want to use. For the best performance on scientific corpora, the 2020 model is recommended as a starting point. Refer to the paper for details on model performance across different years.