nikitast
/

lang-segmentation-roberta

Token Classification

language classification

text segmentation

Model card Files Files and versions

nikitast commited on Jul 18, 2022

Commit

2e44dd4

·

1 Parent(s): d0ae5d3

Create README.md

Files changed (1) hide show

README.md +39 -0

README.md ADDED Viewed

	@@ -0,0 +1,39 @@

+---
+language:
+- ru
+- uk
+- be
+- kk
+- az
+- hy
+- ka
+- he
+- en
+- de
+tags:
+- language classification
+- text segmentation
+datasets:
+- open_subtitles
+- tatoeba
+- oscar
+---
+# RoBERTa for Multilabel Language Segmentation
+## Training
+RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).
+Implemented heuristic algorithm for multilingual training data creation with generation of target masks- https://github.com/n1kstep/lang-classifier
+| data source          | language          |
+|-----------------|----------------|
+| open_subtitles    | ka, he, en, de |
+| oscar      | be, kk, az, hu |
+| tatoeba     | ru, uk         |
+## Validation
+The metrics obtained from validation on the another part of dataset (~1k samples per language).
+| Validation Loss | Precision | Recall   | F1-Score | Accuracy |
+|-----------------|-----------|----------|----------|----------|
+| 0.029172        | 0.919623  | 0.933586 | 0.926552 | 0.991883 |