| | --- |
| | library_name: transformers |
| | datasets: |
| | - oscar |
| | - mc4 |
| | - rasyosef/amharic-sentences-corpus |
| | language: |
| | - am |
| | metrics: |
| | - perplexity |
| | pipeline_tag: fill-mask |
| | widget: |
| | - text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል። |
| | example_title: Example 1 |
| | - text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል። |
| | example_title: Example 2 |
| | - text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው። |
| | example_title: Example 3 |
| | - text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው። |
| | example_title: Example 4 |
| | --- |
| | |
| | # bert-tiny-amharic |
| |
|
| | This model has the same architecture as [bert-tiny](https://huggingface.co/prajjwal1/bert-tiny) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of **290 million tokens**. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 28k. |
| |
|
| | It achieves the following results on the evaluation set: |
| | - `Loss: 4.27` |
| | - `Perplexity: 71.52` |
| |
|
| | This model has just `4.18M` parameters. |
| |
|
| | # How to use |
| | You can use this model directly with a pipeline for masked language modeling: |
| |
|
| | ```python |
| | >>> from transformers import pipeline |
| | >>> unmasker = pipeline('fill-mask', model='rasyosef/bert-tiny-amharic') |
| | >>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።") |
| | |
| | [{'score': 0.5629344582557678, |
| | 'token': 9617, |
| | 'token_str': 'ዓመታት', |
| | 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'}, |
| | {'score': 0.3049253523349762, |
| | 'token': 9345, |
| | 'token_str': 'ዓመት', |
| | 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'}, |
| | {'score': 0.0681595504283905, |
| | 'token': 10898, |
| | 'token_str': 'አመታት', |
| | 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'}, |
| | {'score': 0.028840897604823112, |
| | 'token': 9913, |
| | 'token_str': 'አመት', |
| | 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'}, |
| | {'score': 0.008974998258054256, |
| | 'token': 15098, |
| | 'token_str': 'ዘመናት', |
| | 'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዘመናት ተቆጥሯል ።'}] |
| | ``` |
| |
|
| | # Finetuning |
| |
|
| | This model was finetuned and evaluated on the following Amharic NLP tasks |
| |
|
| | - **Sentiment Classification** |
| | - Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment) |
| | - Code: https://github.com/rasyosef/amharic-sentiment-classification |
| | - **Named Entity Recognition** |
| | - Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition) |
| | - Code: https://github.com/rasyosef/amharic-named-entity-recognition |
| |
|
| | ### Finetuned Model Performance |
| | The reported F1 scores are macro averages. |
| |
|
| | |Model|Size (# params)| Perplexity|Sentiment (F1)| Named Entity Recognition (F1)| |
| | |-----|---------------|-----------|--------------|------------------------------| |
| | |bert-medium-amharic|40.5M|13.74|0.83|0.68| |
| | |bert-small-amharic|27.8M|15.96|0.83|0.68| |
| | |bert-mini-amharic|10.7M|22.42|0.81|0.64| |
| | |**bert-tiny-amharic**|**4.18M**|**71.52**|**0.79**|**0.54**| |
| | |xlm-roberta-base|279M||0.83|0.73| |
| | |am-roberta|443M||0.82|0.69| |