mmukh commited on
Commit
b588eab
·
1 Parent(s): b9907a5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -0
README.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SOBertBase
2
+
3
+ ## Model Description
4
+
5
+ SOBertBase is a 109M parameter BERT models trained on 27 billion tokens of SO data StackOverflow answer and comment text using the Megatron Toolkit.
6
+
7
+ SOBert is pre-trained with 19 GB data presented as 15 million samples where each sample contains an entire post and all its corresponding comments. We also include
8
+ all code in each answer so that our model is bimodal in nature. We use a SentencePiece tokenizer trained with BytePair Encoding, which has the benefit over WordPiece of never labeling tokens as “unknown".
9
+ Additionally, SOBert is trained with a a maximum sequence length of 2048 based on the empirical length distribution of StackOverflow posts and a relatively
10
+ large batch size of 0.5M tokens. More details can be found in the paper
11
+ [Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models](https://arxiv.org/pdf/2306.03268).
12
+
13
+ #### How to use
14
+
15
+ ```python
16
+ from transformers import *
17
+ import torch
18
+ tokenizer = AutoTokenizer.from_pretrained("mmukh/SOBertBase")
19
+ model = AutoModelForTokenClassification.from_pretrained("mmukh/SOBertBase")
20
+ ```
21
+
22
+ ### BibTeX entry and citation info
23
+
24
+ ```bibtex
25
+ @article{mukherjee2023stack,
26
+ title={Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models},
27
+ author={Mukherjee, Manisha and Hellendoorn, Vincent J},
28
+ journal={arXiv preprint arXiv:2306.03268},
29
+ year={2023}
30
+ }
31
+ ```