Update README.md

946d6aa verified over 1 year ago

4.08 kB

	---
	library_name: transformers
	datasets:
	- oscar
	- mc4
	- rasyosef/amharic-sentences-corpus
	language:
	- am
	metrics:
	- perplexity
	pipeline_tag: fill-mask
	widget:
	- text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
	example_title: Example 1
	- text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
	example_title: Example 2
	- text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
	example_title: Example 3
	- text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።
	example_title: Example 4
	---

	# bert-tiny-amharic

	This model has the same architecture as [bert-tiny](https://huggingface.co/prajjwal1/bert-tiny) and was pretrained from scratch using the Amharic subsets of the [oscar](https://huggingface.co/datasets/oscar), [mc4](https://huggingface.co/datasets/mc4), and [amharic-sentences-corpus](https://huggingface.co/datasets/rasyosef/amharic-sentences-corpus) datasets, on a total of 290 million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 28k.

	It achieves the following results on the evaluation set:
	- `Loss: 4.27`
	- `Perplexity: 71.52`

	This model has just `4.18M` parameters.

	# How to use
	You can use this model directly with a pipeline for masked language modeling:

	```python
	>>> from transformers import pipeline
	>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-tiny-amharic')
	>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")

	[{'score': 0.5629344582557678,
	'token': 9617,
	'token_str': 'ዓመታት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
	{'score': 0.3049253523349762,
	'token': 9345,
	'token_str': 'ዓመት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
	{'score': 0.0681595504283905,
	'token': 10898,
	'token_str': 'አመታት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
	{'score': 0.028840897604823112,
	'token': 9913,
	'token_str': 'አመት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
	{'score': 0.008974998258054256,
	'token': 15098,
	'token_str': 'ዘመናት',
	'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዘመናት ተቆጥሯል ።'}]
	```

	# Finetuning

	This model was finetuned and evaluated on the following Amharic NLP tasks

	- Sentiment Classification
	- Dataset: [amharic-sentiment](https://huggingface.co/datasets/rasyosef/amharic-sentiment)
	- Code: https://github.com/rasyosef/amharic-sentiment-classification
	- Named Entity Recognition
	- Dataset: [amharic-named-entity-recognition](https://huggingface.co/datasets/rasyosef/amharic-named-entity-recognition)
	- Code: https://github.com/rasyosef/amharic-named-entity-recognition

	### Finetuned Model Performance
	The reported F1 scores are macro averages.

	\|Model\|Size (# params)\| Perplexity\|Sentiment (F1)\| Named Entity Recognition (F1)\|
	\|-----\|---------------\|-----------\|--------------\|------------------------------\|
	\|bert-medium-amharic\|40.5M\|13.74\|0.83\|0.68\|
	\|bert-small-amharic\|27.8M\|15.96\|0.83\|0.68\|
	\|bert-mini-amharic\|10.7M\|22.42\|0.81\|0.64\|
	\|bert-tiny-amharic\|4.18M\|71.52\|0.79\|0.54\|
	\|xlm-roberta-base\|279M\|\|0.83\|0.73\|
	\|am-roberta\|443M\|\|0.82\|0.69\|