oscar-corpus/oscar
Updated • 698 • 207
How to use sarahlintang/IndoBERT with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("sarahlintang/IndoBERT", dtype="auto")IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language.
This model is base-uncased version which use bert-base config.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sarahlintang/IndoBERT")
model = AutoModel.from_pretrained("sarahlintang/IndoBERT")
tokenizer.encode("hai aku mau makan.")
[2, 8078, 1785, 2318, 1946, 18, 4]
This model was pre-trained on 16 GB of raw text ~2 B words from Oscar Corpus (https://oscar-corpus.com/).
This model is equal to bert-base model which has 32,000 vocabulary size.
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2. We used a Google Cloud Storage bucket, for persistent storage of training data and models.
We evaluate this model on three Indonesian NLP downstream task: