nikitast
/

multilang-classifier-roberta

Text Classification

language classification

text-embeddings-inference

Model card Files Files and versions

nikitast commited on Jul 18, 2022

Commit

475e27c

·

1 Parent(s): 5f5fa73

Create README.md

Files changed (1) hide show

README.md +38 -0

README.md ADDED Viewed

	@@ -0,0 +1,38 @@

+---
+language:
+- ru
+- uk
+- be
+- kk
+- az
+- hy
+- ka
+- he
+- en
+- de
+tags:
+- language classification
+datasets:
+- open_subtitles
+- tatoeba
+- oscar
+---
+# RoBERTa for Multilabel Language Classification
+## Training
+RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).
+Implemented heuristic algorithm for multilingual training data creation - https://github.com/n1kstep/lang-classifier
+| data source          | language          |
+|-----------------|----------------|
+| open_subtitles    | ka, he, en, de |
+| oscar      | be, kk, az, hu |
+| tatoeba     | ru, uk         |
+## Validation
+The metrics obtained from validation on the another part of dataset (~1k samples per language).
+| Training Loss | Validation Loss | F1-Score | Roc Auc  | Accuracy | Support |
+|---------------|-----------------|----------|----------|----------|---------|
+| 0.161500      | 0.110949        | 0.947844 | 0.953939 | 0.762063 | 26858   |