guymorlan
/

levanti_arabic2diacritics

Token Classification

Model card Files Files and versions

guymorlan commited on Jul 10, 2024

Commit

4eafff8

·

verified ·

1 Parent(s): bb0243e

Update README.md

Files changed (1) hide show

README.md +54 -3

README.md CHANGED Viewed

@@ -1,3 +1,54 @@
----
-license: cc-by-nc-4.0
----

+---
+license: cc-by-nc-4.0
+language:
+- ar
+pipeline_tag: token-classification
+datasets:
+- guymorlan/levanti
+- community-datasets/tashkeela
+---
+# Levanti Diacritizer
+This model adds diacritics to raw text in Palestinian colloquial Arabic.
+The model is trained on a special subset of the Levanti dataset (to be released later).
+The model is fine-tuned from Google's [CANINE-s](https://huggingface.co/google/canine-s) character level LM with a multi-label token classification head.
+CANINE-s is first pre-trained on the Tashkeela dataset of classical Arabic diacritized text (after removing final diacritics from the text) and then trained for an additional 5 epochs on the diacritized subset of the Levanti dataset.
+Each token (letter) of the input is classified into 6 positive categories: Shadda, Fatha, Kasra, Damma and Sukun (see `model.config.id2label`). A multi-label model is used since a Shadda can accompany other diacritical marks.
+# Transliterator
+This model can be used in conjunction with [Levanti Transliterator](https://huggingface.co/guymorlan/levanti_diacritics2translit/), which transliterated diacritized text in Palestinian Arabic.
+# Example Usage
+```python
+from transformers import CanineForTokenClassification, AutoTokenizer
+model = CanineForTokenClassification.from_pretrained("guymorlan/levanti_arabic2diacritics")
+tokenizer = AutoTokenizer.from_pretrained("guymorlan/levanti_arabic2diacritics")
+label2diacritic = {0: 'ّ', 1: 'َ', 2: 'ِ', 3: 'ُ', 4: ''}
+def arabic2diacritics(text, model, tokenizer):
+    tokens = tokenizer(text, return_tensors="pt")
+    preds = (model(**tokens).logits.sigmoid() > 0.5)[0]
+    new_text = []
+    for p, c in zip(preds, text):
+        for i in range(1, 5):
+            if p[i]:
+                new_text.append(label2diacritic[i])
+        # check shadda last
+        if p[0]:
+            new_text.append(label2diacritic[0])
+        new_text.append(c)
+    new_text = "".join(new_text)
+    return new_text
+text = "بديش اروح عالمدرسة بكرا"
+arabic2diacritics(text, model, tokenizer)
+```
+```
+```
+# Attribution
+Created by Guy Mor-Lan.<br>
+Contact: guy.mor AT mail.huji.ac.il