| | --- |
| | language: |
| | - "eng" |
| | thumbnail: "URL to a thumbnail used in social sharing" |
| | tags: |
| | - "PyTorch" |
| | - "tensorflow" |
| | license: "apache-2.0" |
| |
|
| | --- |
| | |
| |
|
| | # vBERT-2021-LARGE |
| |
|
| | ### Model Info: |
| | <ul> |
| | <li> Authors: R&D AI Lab, VMware Inc. |
| | <li> Model date: April, 2022 |
| | <li> Model version: 2021-base |
| | <li> Model type: Pretrained language model |
| | <li> License: Apache 2.0 |
| | </ul> |
| |
|
| | #### Motivation |
| | Traditional BERT models struggle with VMware-specific words (Tanzu, vSphere, etc.), technical terms, and compound words. (<a href =https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99>Weaknesses of WordPiece Tokenization</a>) |
| |
|
| | We pretrained thevBERT model to address the aforementioned issues using our We have pretrained our vBERT model to address the aforementioned issues using our <a href=https://medium.com/vmware-data-ml-blog/pretraining-a-custom-bert-model-6e37df97dfc4>BERT Pretraining Library </a>. |
| | <br>We have replaced the first 1k unused tokens of BERT's vocabulary with VMware-specific terms to create a modified vocabulary. We then pretrained the 'bert-large-uncased' model for additional 66K steps (60k with MSL_128 and 6k with MSL_512) on VMware domain data. |
| |
|
| | #### Intended Use |
| | The model functions as a VMware-specific Language Model. |
| |
|
| |
|
| | #### How to Use |
| | Here is how to use this model to get the features of a given text in PyTorch: |
| |
|
| | ``` |
| | from transformers import BertTokenizer, BertModel |
| | tokenizer = BertTokenizer.from_pretrained('VMware/vbert-2021-large') |
| | model = BertModel.from_pretrained("VMware/vbert-2021-large") |
| | text = "Replace me by any text you'd like." |
| | encoded_input = tokenizer(text, return_tensors='pt') |
| | output = model(**encoded_input) |
| | ``` |
| |
|
| | and in TensorFlow: |
| |
|
| | ``` |
| | from transformers import BertTokenizer, TFBertModel |
| | tokenizer = BertTokenizer.from_pretrained('VMware/vbert-2021-large') |
| | model = TFBertModel.from_pretrained('VMware/vbert-2021-large') |
| | text = "Replace me by any text you'd like." |
| | encoded_input = tokenizer(text, return_tensors='tf') |
| | output = model(encoded_input) |
| | |
| | ``` |
| |
|
| | ### Training |
| |
|
| | #### - Datasets |
| | Publically available VMware text data such as VMware Docs, Blogs etc. were used for creating the pretraining corpus. Sourced in May, 2021. (~320,000 Documents) |
| | #### - Preprocessing |
| | <ul> |
| | <li>Decoding HTML |
| | <li>Decoding Unicode |
| | <li>Stripping repeated characters |
| | <li>Splitting compound word |
| | <li>Spelling correction |
| | </ul> |
| |
|
| | #### - Model performance measures |
| | We benchmarked vBERT on various VMware-specific NLP downstream tasks (IR, classification, etc). |
| | The model scored higher than the 'bert-base-uncased' model on all benchmarks. |
| |
|
| | ### Limitations and bias |
| | Since the model is further pretrained on the BERT model, it may have the same biases embedded within the original BERT model. |
| |
|
| | The data needs to be preprocessed using our internal vNLP Preprocessor (not available to the public) to maximize its performance. |
| |
|