Instructions to use chantha99m/khmer-1lms with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use chantha99m/khmer-1lms with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="chantha99m/khmer-1lms")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("chantha99m/khmer-1lms") model = AutoModelForMaskedLM.from_pretrained("chantha99m/khmer-1lms") - Notebooks
- Google Colab
- Kaggle
XLMRoBERTa for Khmer Language
Training from scratch using Masked Language Modeling task on 5M Khmer sentences or 162M words or 578K unique words for 1M steps.
Training data is created by crawling publicly available publicly news sites and Wikipedia.
Why?
- xlm-roberta-base is big. (279M parameters, while this is only 49M parameters).
- xlm-roberta-base is not optimized for Khmer language.
- xlm-roberta-base Vocab size is bigger (250,002) and this model uses 8000 vocab size.
Usage
from transformers import pipeline
pipe = pipeline("fill-mask", "seanghay/xlm-roberta-khmer-small")
result = pipe("αα½ααααΈααααα»<mask>!")
print(result)
[
{"score": 0.8130345344543457, "token": 11, "token_str": "ααΆ", "sequence": "αα½ααααΈααααα»ααΆ!"},
{"score": 0.17512884736061096, "token": 160, "token_str": "α", "sequence": "αα½ααααΈααααα»α!"},
{"score": 0.0034702506382018328, "token": 143, "token_str": "ααΆ", "sequence": "αα½ααααΈααααα» ααΆ!"},
{"score": 0.00305828545242548, "token": 16, "token_str": "α", "sequence": "αα½ααααΈααααα»α!"},
{"score": 0.0007526700501330197, "token": 133, "token_str": "α", "sequence": "αα½ααααΈααααα»α!"},
]
License
Apache-2.0
Citation
No need. :)
- Downloads last month
- -