jd445
/

AnnualBERTs

jd445 commited on Mar 22, 2024

Commit

5caae66

verified ·

1 Parent(s): c73ee0e

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -3,21 +3,20 @@ language:
 - en
 ---
 ## Model Description
-arXivBERT is a cutting-edge language model specifically trained on a comprehensive corpus of scientific papers from the arXiv database, spanning from 2008 to 2020. This model leverages the robust architecture of RoBERTa and is fine-tuned to grasp the intricacies and nuances of academic language, making it an ideal tool for NLP tasks within the scientific domain.
 ## Why ?arXivBERT
 1. Specialized in Scientific Content: Trained on a large dataset of arXiv papers, ensuring high familiarity with scientific terminology and concepts.
 2. Versatile in Applications: Suitable for a range of NLP tasks, including but not limited to text classification, keyword extraction, summarization of scientific papers, and citation prediction.
-3. Evolutionary Insights: Offers unique insights into the evolution of scientific discourse and trends over a significant period (2008-2020).
 ## How to Use?
 ```
 from transformers import AutoTokenizer, AutoModel
-tokenizer = AutoTokenizer.from_pretrained("folderPath")
-model = AutoModel.from_pretrained("folderPath")
 ```

 - en
 ---
 ## Model Description
+arXivBERT is a series of models trained on a time-based unit. If you are looking for the best performance on scientific corpora, please use the model from 2020 directly.
 ## Why ?arXivBERT
 1. Specialized in Scientific Content: Trained on a large dataset of arXiv papers, ensuring high familiarity with scientific terminology and concepts.
 2. Versatile in Applications: Suitable for a range of NLP tasks, including but not limited to text classification, keyword extraction, summarization of scientific papers, and citation prediction.
+3. Evolutionary Insights: Continuous pre-training captures the long-term relationships and changes within the corpus.
 ## How to Use?
 ```
 from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("folderPath/year")
+model = AutoModel.from_pretrained("folderPath/wholewordtokenizer")
 ```