AMR-KELEG commited on
Commit
f991001
·
verified ·
1 Parent(s): 0736dfa

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ ---
4
+ # Model Card for Token-DI
5
+
6
+ [![GitHub](https://img.shields.io/badge/💻-GitHub%20-black.svg)](https://github.com/AMR-KELEG/ALDi) [![Huggingface Space](https://img.shields.io/badge/🤗-Demo%20-yellow.svg)](https://huggingface.co/spaces/AMR-KELEG/ALDi)
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+ A BERT-based model fine-tuned to tag each token in a sentence as being in MSA or DA.
11
+
12
+ Model | Link on 🤗
13
+ ---|---
14
+ **Sentence-ALDi** (random seed: 42) | https://huggingface.co/AMR-KELEG/Sentence-ALDi
15
+ Sentence-ALDi (random seed: 30) | https://huggingface.co/AMR-KELEG/Sentence-ALDi-30
16
+ Sentence-ALDi (random seed: 50) | https://huggingface.co/AMR-KELEG/Sentence-ALDi-50
17
+ **Token-DI** (random seed: 42) | https://huggingface.co/AMR-KELEG/ALDi-Token-DI
18
+ Token-DI (random seed: 30) | https://huggingface.co/AMR-KELEG/ALDi-Token-DI-30
19
+ Token-DI (random seed: 50) | https://huggingface.co/AMR-KELEG/ALDi-Token-DI-50
20
+
21
+ ### Usage
22
+
23
+ ```
24
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
25
+
26
+ def tokenize_text(text):
27
+ """Tokenize a string based on separator regexps."""
28
+ tokens = text.split()
29
+ return tokens
30
+
31
+ def tag_sentence(text, print_tags=False):
32
+ logits = model(
33
+ **tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
34
+ ).logits
35
+
36
+ # Ignore the labels for [CLS] and [SEP]
37
+ subwords_labels = logits.argmax(axis=-1).numpy()[0][1:-1]
38
+ tokens = tokenize_text(text)
39
+ subwords = [
40
+ tokenizer.tokenize(token)
41
+ for token in tokens
42
+ if tokenizer.tokenize(token)
43
+ ]
44
+ n_subwords = [len(l) for l in subwords]
45
+ first_subword_indecies = [sum(n_subwords[0:i]) for i in range(len(n_subwords))]
46
+
47
+ tokens_labels = [
48
+ subwords_labels[index] for index in first_subword_indecies if index < 510
49
+ ]
50
+ tokens_tags = [INDECIES_TO_TAGS[l] for l in tokens_labels]
51
+
52
+ if print_tags:
53
+ for token, token_tag in zip(tokens, tokens_tags):
54
+ print(f"{token} -> {token_tag}")
55
+ print()
56
+
57
+ # Compute the CMI (Code Mixing Index)
58
+ # Ignore: "ambiguous", "ne" (named entity), "other" (emojis, ..)
59
+ n_msa_tokens = sum([t == "lang1" for t in tokens_tags])
60
+ n_da_tokens = sum([t in ["lang2", "mixed"] for t in tokens_tags])
61
+
62
+ if n_msa_tokens + n_da_tokens != 0:
63
+ return n_da_tokens / (n_msa_tokens + n_da_tokens)
64
+ else:
65
+ return 0
66
+
67
+ if __name__ == "__main__":
68
+ model_name = "AMR-KELEG/ALDi-Token-DI"
69
+
70
+ TAGS = ["ambiguous", "lang1", "lang2", "mixed", "ne", "other"] # lang1 -> MSA, lang2 -> Dialectal Arabic (DA)
71
+ INDECIES_TO_TAGS = {i: tag for i, tag in enumerate(TAGS)}
72
+
73
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
74
+ model = AutoModelForTokenClassification.from_pretrained(
75
+ model_name, num_labels=len(TAGS)
76
+ )
77
+
78
+ # Example usage
79
+ sentence = "ما هذا يا فتي؟ أنت متأكد من كده؟"
80
+ CMI_index = tag_sentence(sentence, print_tags=True)
81
+ print(f"CMI Index (as a proxy for ALDi): {CMI_index:.2f}")
82
+ ```
83
+
84
+ ### Model Description
85
+
86
+ <!-- Provide a longer summary of what this model is. -->
87
+
88
+ <!-- - **Developed by:** Amr Keleg -->
89
+ - **Model type:** Token classification head on top of a BERT-based model.
90
+ - **Language(s) (NLP):** Arabic.
91
+ <!--- **License:** [More Information Needed] -->
92
+ - **Finetuned from model :** [MarBERT](https://huggingface.co/UBC-NLP/MARBERT)
93
+ - **Dataset:** [MSA-DA code-switched dataset](https://aclanthology.org/W16-5805.pdf)
94
+
95
+ ### Citation
96
+
97
+ If you find the model useful, please cite the following respective paper:
98
+ ```
99
+ @inproceedings{keleg-etal-2023-aldi,
100
+ title = "{ALD}i: Quantifying the {A}rabic Level of Dialectness of Text",
101
+ author = "Keleg, Amr and
102
+ Goldwater, Sharon and
103
+ Magdy, Walid",
104
+ editor = "Bouamor, Houda and
105
+ Pino, Juan and
106
+ Bali, Kalika",
107
+ booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
108
+ month = dec,
109
+ year = "2023",
110
+ address = "Singapore",
111
+ publisher = "Association for Computational Linguistics",
112
+ url = "https://aclanthology.org/2023.emnlp-main.655",
113
+ doi = "10.18653/v1/2023.emnlp-main.655",
114
+ pages = "10597--10611",
115
+ abstract = "Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17{\%} from news articles and 83{\%} from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers{'} stylistic choices in different situations, a useful property for sociolinguistic analyses.",
116
+ }
117
+ ```