FPTAI
/

vibert-base-cased

Model card Files Files and versions

Create README.md

#2

by phucdev - opened May 30, 2024

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

Files changed (1) hide show

README.md +69 -0

README.md ADDED Viewed

	@@ -0,0 +1,69 @@

+---
+language:
+- vi
+---
+# viBERT base model (cased)
+<!-- Provide a quick summary of what the model is/does. -->
+viBERT is a pretrained model for Vietnamese using a masked language modeling (MLM) objective.
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+viBERT is based on [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased).
+As such it retains the architecture of 12 layers, 768 hidden units, and 12
+heads and also uses a WordPiece tokenizer.
+In order to specialize the model to Vietnamese the authors collected a dataset from Vietnamese online newspapers, resulting in approximately 10 GB of texts.
+They reduced the original mBERT vocabulary to only include tokens that occur in the Vietnamese pretraining dataset, resulting in a vocabulary size of 38168.
+The model was then further pre-trained on the Vietnamese pre-training data.
+- **Model type:** BERT
+- **Language(s) (NLP):** Vietnamese
+- **Finetuned from model:** https://huggingface.co/google-bert/bert-base-multilingual-cased
+### Model Sources
+<!-- Provide the basic links for the model. -->
+- **Repository:** https://github.com/fpt-corp/viBERT
+- **Paper:** [Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models](https://aclanthology.org/2020.paclic-1.2/)
+## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+```tex
+@inproceedings{bui-etal-2020-improving,
+    title = "Improving Sequence Tagging for {V}ietnamese Text using Transformer-based Neural Models",
+    author = "Bui, The Viet  and
+      Tran, Thi Oanh  and
+      Le-Hong, Phuong",
+    editor = "Nguyen, Minh Le  and
+      Luong, Mai Chi  and
+      Song, Sanghoun",
+    booktitle = "Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation",
+    month = oct,
+    year = "2020",
+    address = "Hanoi, Vietnam",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2020.paclic-1.2",
+    pages = "13--20",
+}
+```
+**APA:**
+Tran, T. O., & Le Hong, P. (2020, October). Improving sequence tagging for Vietnamese text using transformer-based neural models. In Proceedings of the 34th Pacific Asia conference on language, information and computation (pp. 13-20).
+## Model Card Authors
+[@phucdev](https://github.com/phucdev)