| # SocBERT model | |
| Pretrained model on 20GB English tweets and 72GB Reddit comments using a masked language modeling (MLM) objective. | |
| The tweets are from Archive and collected from Twitter Streaming API. | |
| The Reddit comments are ramdonly sampled from all subreddits from 2015-2019. | |
| SocBERT-base was pretrained on 819M sequence blocks for 100K steps. | |
| SocBERT-final was pretrained on 929M (819M+110M) sequence blocks for 112K (100K+12K) steps. | |
| We benchmarked SocBERT, on 40 text classification tasks with social media data. | |
| The experiment results can be found in our paper: | |
| ``` | |
| @inproceedings{socbert:2023, | |
| title = {{SocBERT: A Pretrained Model for Social Media Text}}, | |
| author = {Yuting Guo and Abeed Sarker}, | |
| booktitle = {Proceedings of the Fourth Workshop on Insights from Negative Results in NLP}, | |
| year = {2023} | |
| } | |
| ``` | |