--- license: apache-2.0 language: - bn base_model: - google/electra-small-discriminator --- # VĀC-BERT **VĀC-BERT** is a 17 million-parameter model, trained on the Vācaspati literary dataset. Despite its compact size, VĀC-BERT achieves competitive performance with state-of-the-art masked-language and downstream models that are over seven times larger. ## Model Details - **Architecture:** Electra-small (but reduced to 17 M parameters) - **Pretraining Corpus:** Vācaspati — a curated Bangla literary corpus - **Parameter Count:** 17 M (≈ 1/7th the size of BERT-base) - **Tokenizer:** WordPiece, vocabulary size 50 K ## Usage Example ```python from transformers import BertTokenizer, AutoModelForSequenceClassification tokenizer = BertTokenizer.from_pretrained("Vacaspati/VAC-BERT") model = AutoModelForSequenceClassification.from_pretrained("Vacaspati/VAC-BERT") ``` We are releasing the Vācaspati dataset. For access to Vācaspati dataset please fill this form. Link: https://forms.gle/DiVm2fSVCyXXMbkU9 Vācaspati dataset can also be accessed from: https://huggingface.co/datasets/Vacaspati/Vacaspati ## Citation If you are using this model please cite: ```bibtex @inproceedings{bhattacharyya-etal-2023-vacaspati, title = "{VACASPATI}: A Diverse Corpus of {B}angla Literature", author = "Bhattacharyya, Pramit and Mondal, Joydeep and Maji, Subhadip and Bhattacharya, Arnab", editor = "Park, Jong C. and Arase, Yuki and Hu, Baotian and Lu, Wei and Wijaya, Derry and Purwarianti, Ayu and Krisnadhi, Adila Alfa", booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)", month = nov, year = "2023", address = "Nusa Dua, Bali", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.ijcnlp-main.72/", doi = "10.18653/v1/2023.ijcnlp-main.72", pages = "1118--1130" } ```