SebOchs
/

canine-c-lang-id

Text Classification

Language Identification

Model card Files Files and versions

SebOchs commited on Dec 17, 2022

Commit

d06ff02

·

1 Parent(s): 2790ac2

Update README.md

Files changed (1) hide show

README.md +21 -0

README.md CHANGED Viewed

@@ -266,6 +266,27 @@ Canine model trained on WiLI-2018 dataset to identify the language of a text.
   - Accuracy: 94,92%
   - Macro F1-score: 94,91%
 ### Credit to
 ```
 @article{clark-etal-2022-canine,

   - Accuracy: 94,92%
   - Macro F1-score: 94,91%
+### Inference
+Dictionary to return English names for a label id:
+```python
+import datasets
+import pycountry
+def int_to_lang():
+    dataset = datasets.load_dataset('wili_2018')
+    # names for languages not in iso-639-3 from wikipedia
+    non_iso_languages = {'roa-tara': 'Tarantino', 'zh-yue': 'Cantonese', 'map-bms': 'Banyumasan',
+                         'nds-nl': 'Dutch Low Saxon', 'be-tarask': 'Belarusian'}
+    # create dictionary from data set labels to language names
+    lab_to_lang = {}
+    for i, lang in enumerate(dataset['train'].features['label'].names):
+        full_lang = pycountry.languages.get(alpha_3=lang)
+        if full_lang:
+            lab_to_lang[i] = full_lang.name
+        else:
+            lab_to_lang[i] = non_iso_languages[lang]
+    return lab_to_lang
+```
 ### Credit to
 ```
 @article{clark-etal-2022-canine,