mmukh
/

SOBertBase

mmukh commited on Sep 8, 2023

Commit

b588eab

1 Parent(s): b9907a5

Create README.md

Files changed (1) hide show

README.md ADDED Viewed

+# SOBertBase
+## Model Description
+SOBertBase is a 109M parameter BERT models trained on 27 billion tokens of SO data StackOverflow answer and comment text using the Megatron Toolkit.
+SOBert is pre-trained with 19 GB data presented as 15 million samples where each sample contains an entire post and all its corresponding comments. We also include
+all code in each answer so that our model is bimodal in nature. We use a SentencePiece tokenizer trained with BytePair Encoding, which has the benefit over WordPiece of never labeling tokens as “unknown".
+Additionally, SOBert is trained with a a maximum sequence length of 2048 based on the empirical length distribution of StackOverflow posts and a relatively
+large batch size of 0.5M tokens. More details can be found in the paper
+[Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models](https://arxiv.org/pdf/2306.03268).
+#### How to use
+```python
+from transformers import *
+import torch
+tokenizer = AutoTokenizer.from_pretrained("mmukh/SOBertBase")
+model = AutoModelForTokenClassification.from_pretrained("mmukh/SOBertBase")
+```
+### BibTeX entry and citation info
+```bibtex
+@article{mukherjee2023stack,
+  title={Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models},
+  author={Mukherjee, Manisha and Hellendoorn, Vincent J},
+  journal={arXiv preprint arXiv:2306.03268},
+  year={2023}
+}
+```