VMware
/

vbert-2021-base

Model card Files Files and versions

TejaGollapudi commited on Jun 16, 2022

Commit

dcdbfca

·

1 Parent(s): db3f569

Model card

Files changed (1) hide show

README.md +77 -0

README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+---
+language:
+  - "eng"
+thumbnail: "url to a thumbnail used in social sharing"
+tags:
+- "pytorch"
+- "tensorflow"
+license: "apache-2.0"
+---
+# vBERT-2021-BASE
+### Model Info:
+<ul>
+<li> Authors: R&D AI Lab, VMware Inc.
+<li> Model date: April, 2022
+<li> Model version: 2021-base
+<li> Model type: Pretrained language model
+<li> License: Apache 2.0
+</ul>
+#### Motivation
+Traditional BERT models struggle with VMware-specific words (Tanzu, vSphere, etc.), technical terms, and compound words. (<a href =https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99>Weaknesses of WordPiece Tokenization</a>)
+We have created our vBERT model to address the aforementioned issues.  We have replaced the first 1k unused tokens of BERT's vocabulary with VMware-specific terms to create a modified vocabulary.  We then pretrained the 'bert-base-uncased' model for additional 78K steps (71k With MSL_128 and 7k with MSL_512) (approximately 5 epochs) on VMware domain data.
+#### Intended Use
+The model functions as a VMware-specific Language Model.
+#### How to Use
+Here is how to use this model to get the features of a given text in PyTorch:
+```
+from transformers import BertTokenizer, BertModel
+tokenizer = BertTokenizer.from_pretrained('VMware/vbert-2021-base')
+model = BertModel.from_pretrained("VMware/vbert-2021-base")
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+and in TensorFlow:
+```
+from transformers import BertTokenizer, TFBertModel
+tokenizer = BertTokenizer.from_pretrained('VMware/vbert-2021-base')
+model = TFBertModel.from_pretrained('VMware/vbert-2021-base')
+text = "Replace me by any text you'd like."
+encoded_input = tokenizer(text, return_tensors='tf')
+output = model(encoded_input)
+```
+### Training
+#### - Datasets
+Publically available VMware text data such as VMware Docs, Blogs etc. were used for creating the pretraining corpus. Sourced in May, 2021. (~320,000 Documents)
+#### - Preprocessing
+<ul>
+<li>Decoding HTML
+<li>Decoding Unicode
+<li>Stripping repeated characters
+<li>Splitting compound word
+<li>Spelling correction
+</ul>
+#### - Model performance measures
+We benchmarked vBERT on various VMware-specific NLP downstream tasks (IR, classification, etc).
+The model scored higher than the 'bert-base-uncased' model on all benchmarks.
+### Limitations and bias
+Since the model is further pretrained on the BERT model, it may have the same biases embedded within the original BERT model.
+The data needs to be preprocessed using our internal vNLP Preprocessor (not available to the public) to maximize its performance.