nikitast
/

lang-segmentation-roberta

Token Classification

language classification

text segmentation

Model card Files Files and versions

lang-segmentation-roberta / README.md

nikitast's picture

Add multilingual to the language tag (#1)

b1aa48f almost 3 years ago

|

history blame contribute delete

990 Bytes

	---
	language:
	- ru
	- uk
	- be
	- kk
	- az
	- hy
	- ka
	- he
	- en
	- de
	- multilingual
	tags:
	- language classification
	- text segmentation
	datasets:
	- open_subtitles
	- tatoeba
	- oscar
	---

	# RoBERTa for Multilabel Language Segmentation
	## Training
	RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

	Implemented heuristic algorithm for multilingual training data creation with generation of target masks- https://github.com/n1kstep/lang-classifier

	\| data source \| language \|
	\|-----------------\|----------------\|
	\| open_subtitles \| ka, he, en, de \|
	\| oscar \| be, kk, az, hu \|
	\| tatoeba \| ru, uk \|

	## Validation
	The metrics obtained from validation on the another part of dataset (~1k samples per language).

	\| Validation Loss \| Precision \| Recall \| F1-Score \| Accuracy \|
	\|-----------------\|-----------\|----------\|----------\|----------\|
	\| 0.029172 \| 0.919623 \| 0.933586 \| 0.926552 \| 0.991883 \|