Instructions to use Nuri-Tas/roberturk-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Nuri-Tas/roberturk-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="Nuri-Tas/roberturk-base")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Nuri-Tas/roberturk-base") model = AutoModel.from_pretrained("Nuri-Tas/roberturk-base") - Notebooks
- Google Colab
- Kaggle
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
RoBERTurk is pretrained on Oscar Turkish Split (27GB) and a small chunk of C4 Turkish Split (1GB) with sentencepiece BPE tokenizer that is trained on randomly selected 30M sentences from the training data, which is composed of 90M sentences. The training data in total contains 5.3B tokens and the vocabulary size is 50K. The learning rate is warmed up to the peak value of 1e-5 for the first 10K updates and linearly decayed at $0.01$ rate. The model is pretrained for maximum 600K updates only with sequences of at most T=256 length.
Tokenizer
Load the pretrained tokenizer as follows:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Nuri-Tas/roberturk-base")
Model
Get the pretrained model with:
from transformers import RobertaModel
model = RobertaModel.from_pretrained("Nuri-Tas/roberturk-base")
Caveats
There is a slight mismatch between our tokenizer and the default tokenizer used by RobertaTokenizer, which results in some underperformance. I'm working on the issue and will update the tokenizer/model accordingly.
Additional TODOs are (although some of them can take some time and I may include them on different repositories):
- Using Zemberek as an alternative tokenizer
- Adjusting masking algorithm to be able to mask morphologies besides only complete words
- Preferably pretraining BPE on the whole training data
- Pretraining with 512 max sequence length + more data
- Downloads last month
- 4