nikitast
/

lang-classifier-roberta

Text Classification

language classification

Model card Files Files and versions

lang-classifier-roberta / README.md

nikitast's picture

Add multilingual to the language tag (#1)

7c97c98 almost 3 years ago

|

history blame contribute delete

1.1 kB

	---
	language:
	- ru
	- uk
	- be
	- kk
	- az
	- hy
	- ka
	- he
	- en
	- de
	- multilingual
	tags:
	- language classification
	datasets:
	- open_subtitles
	- tatoeba
	- oscar
	---

	# RoBERTa for Single Language Classification
	## Training
	RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

	\| data source \| language \|
	\|-----------------\|----------------\|
	\| open_subtitles \| ka, he, en, de \|
	\| oscar \| be, kk, az, hu \|
	\| tatoeba \| ru, uk \|

	## Validation
	The metrics obtained from validation on the another part of dataset (~1k samples per language).

	\|index\|class\|f1-score\|precision\|recall\|support\|
	\|---\|---\|---\|---\|---\|---\|
	\|0\|az\|0\.998\|0\.997\|1\.0\|997\|
	\|1\|be\|0\.996\|0\.998\|0\.994\|1004\|
	\|2\|de\|0\.976\|0\.966\|0\.987\|979\|
	\|3\|en\|0\.976\|0\.986\|0\.967\|1020\|
	\|4\|he\|1\.0\|1\.0\|0\.999\|1001\|
	\|5\|hy\|0\.994\|0\.991\|0\.998\|993\|
	\|6\|ka\|0\.999\|0\.999\|0\.999\|1000\|
	\|7\|kk\|0\.996\|0\.998\|0\.993\|1005\|
	\|8\|uk\|0\.982\|0\.997\|0\.968\|1030\|
	\|9\|ru\|0\.982\|0\.968\|0\.997\|971\|
	\|10\|macro\_avg\|0\.99\|0\.99\|0\.99\|10000\|
	\|11\|weighted avg\|0\.99\|0\.99\|0\.99\|10000\|