| --- |
| tags: |
| - generated_from_trainer |
| model-index: |
| - name: code-mixed-ijebertweet |
| results: [] |
| language: |
| - id |
| - jv |
| - en |
| pipeline_tag: fill-mask |
| widget: |
| - text: biasane nek arep [MASK] file bs pake software ini |
| --- |
| |
| <!-- This model card has been generated automatically according to the information the Trainer had access to. You |
| should probably proofread and complete it, then remove this comment. --> |
|
|
| # Indojave: IndoBERTweet-base |
|
|
| ## About |
| This is a pre-trained masked language model for code-mixed Indonesian-Javanese-English tweets data. |
| This model is trained based on [IndoBERTweet](https://arxiv.org/pdf/2109.04607.pdf) model utilizing |
| Hugging Face's [Transformers]((https://huggingface.co/transformers)) library. |
|
|
| ## Pre-training Data |
| The Twitter data is collected from January 2022 until January 2023. The tweets are collected using 8698 random keyword phrases. |
| To make sure the retrieved data are code-mixed, we use keyword phrases that contain code-mixed Indonesian, Javanese, or English words. |
| The following are few examples of the keyword phrases: |
| - travelling terus |
| - proud koncoku |
| - great kalian semua |
| - chattingane ilang |
| - baru aja launching |
|
|
| We acquire 40,788,384 raw tweets. We apply first stage pre-processing tasks such as: |
| - remove duplicate tweets, |
| - remove tweets with token length less than 5, |
| - remove multiple space, |
| - convert emoticon, |
| - convert all tweets to lower case. |
|
|
| After the first stage pre-processing, we obtain 17,385,773 tweets. |
| In the second stage pre-processing, we do the following pre-processing tasks: |
| - split the tweets into sentences, |
| - remove sentences with token length less than 4, |
| - convert ‘@username’ to ‘@USER’, |
| - convert URL to HTTPURL. |
|
|
| Finally, we have 28,121,693 sentences for the training process. |
| This pretraining data will not be opened to public due to Twitter policy. |
|
|
| ## Model |
| | Model name | Base model | Size of training data | Size of validation data | |
| |-------------------------------------------|-----------------|----------------------------|-------------------------| |
| | `indojave-codemixed-indobertweet-base` | IndoBERTweet | 2.24 GB of text | 249 MB of text | |
|
|
| ## Evaluation Results |
| We train the data with 3 epochs and total steps of 296K for 4 days. |
| The following are the results obtained from the training: |
|
|
| | train loss | eval loss | eval perplexity | |
| |------------|------------|-----------------| |
| | 2.7145 | 2.4854 | 12.0054 | |
|
|
| ## How to use |
| ### Load model and tokenizer |
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| tokenizer = AutoTokenizer.from_pretrained("fathan/indojave-codemixed-indobertweet-base") |
| model = AutoModel.from_pretrained("fathan/indojave-codemixed-indobertweet-base") |
| |
| ``` |
| ### Masked language model |
| ```python |
| from transformers import pipeline |
| |
| pretrained_model = "fathan/indojave-codemixed-indobertweet-base" |
| |
| fill_mask = pipeline( |
| "fill-mask", |
| model=pretrained_model, |
| tokenizer=pretrained_model |
| ) |
| ``` |
|
|
|
|
|
|
| ### Training hyperparameters |
|
|
| The following hyperparameters were used during training: |
| - learning_rate: 5e-05 |
| - train_batch_size: 256 |
| - eval_batch_size: 256 |
| - seed: 42 |
| - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
| - lr_scheduler_type: linear |
| - num_epochs: 3.0 |
|
|
| ### Framework versions |
|
|
| - Transformers 4.26.0 |
| - Pytorch 1.12.0+cu102 |
| - Datasets 2.9.0 |
| - Tokenizers 0.12.1 |