sarkerlab
/

SocBERT-base

Model card Files Files and versions

SocBERT-base / README.md

yguo262's picture

Update README.md

e00ef16 almost 3 years ago

|

history blame contribute delete

828 Bytes

	# SocBERT model
	Pretrained model on 20GB English tweets and 72GB Reddit comments using a masked language modeling (MLM) objective.
	The tweets are from Archive and collected from Twitter Streaming API.
	The Reddit comments are ramdonly sampled from all subreddits from 2015-2019.
	SocBERT-base was pretrained on 819M sequence blocks for 100K steps.
	SocBERT-final was pretrained on 929M (819M+110M) sequence blocks for 112K (100K+12K) steps.
	We benchmarked SocBERT, on 40 text classification tasks with social media data.

	The experiment results can be found in our paper:
	```
	@inproceedings{socbert:2023,
	title = {{SocBERT: A Pretrained Model for Social Media Text}},
	author = {Yuting Guo and Abeed Sarker},
	booktitle = {Proceedings of the Fourth Workshop on Insights from Negative Results in NLP},
	year = {2023}
	}
	```