mahmoudsaalama commited on
Commit
9bdf310
·
verified ·
1 Parent(s): 4125723

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ tags:
5
+ - arabic
6
+ - end-of-utterance
7
+ - eou-detection
8
+ - saudi-dialect
9
+ - conversational-ai
10
+ - turn-detection
11
+ - camelbert
12
+ base_model: CAMeL-Lab/bert-base-arabic-camelbert-msa
13
+ license: mit
14
+ ---
15
+
16
+ # Arabic End-of-Utterance Detection Model
17
+
18
+ Fine-tuned CAMeLBERT model for detecting end-of-utterance in Arabic conversations, with emphasis on Saudi dialect.
19
+
20
+ ## Model Description
21
+
22
+ This model is designed to detect when a speaker has finished their conversational turn in Arabic dialogue. It's particularly optimized for Saudi dialect patterns and real-time applications.
23
+
24
+ ### Model Details
25
+
26
+ - **Base Model**: CAMeLBERT-MSA (CAMeL-Lab/bert-base-arabic-camelbert-msa)
27
+ - **Task**: Binary classification (EOU vs. non-EOU)
28
+ - **Language**: Arabic (Modern Standard Arabic + Saudi dialect)
29
+ - **Parameters**: ~110M (base encoder) + classification head
30
+ - **Training Data**: 2,000+ Arabic conversation samples
31
+
32
+ ### Intended Use
33
+
34
+ - Real-time turn detection in conversational AI agents
35
+ - Voice assistants for Arabic speakers
36
+ - Dialogue systems
37
+ - LiveKit agent integration
38
+
39
+ ## How to Use
40
+
41
+ ### Installation
42
+
43
+ ```bash
44
+ pip install torch transformers
45
+ ```
46
+
47
+ ### Basic Usage
48
+
49
+ ```python
50
+ from transformers import AutoTokenizer, AutoModel
51
+ import torch
52
+
53
+ # Load model and tokenizer
54
+ model_name = "mahmoudsaalama/arabic-eou-camelbert"
55
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
56
+ model = AutoModel.from_pretrained(model_name)
57
+
58
+ # Prepare input
59
+ text = "السلام عليكم ورحمة الله"
60
+ inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True)
61
+
62
+ # Get prediction
63
+ with torch.no_grad():
64
+ outputs = model(**inputs)
65
+ probability = torch.sigmoid(outputs.logits).item()
66
+ is_eou = probability > 0.5
67
+
68
+ print(f"EOU Probability: {probability:.4f}")
69
+ print(f"Is EOU: {is_eou}")
70
+ ```
71
+
72
+ ### Using the SDK
73
+
74
+ For easier integration, use the Arabic EOU SDK:
75
+
76
+ ```bash
77
+ pip install arabic-eou-sdk
78
+ ```
79
+
80
+ ```python
81
+ from arabic_eou_sdk import ArabicEOUDetector
82
+
83
+ detector = ArabicEOUDetector(model_name="mahmoudsaalama/arabic-eou-camelbert")
84
+ result = detector.update_transcription("السلام عليكم", is_final=True)
85
+
86
+ print(f"Is EOU: {result['is_eou']}")
87
+ print(f"Probability: {result['probability']:.4f}")
88
+ print(f"Confidence: {result['confidence']:.4f}")
89
+ ```
90
+
91
+ ## Training Details
92
+
93
+ ### Training Data
94
+
95
+ - **Size**: ~2,000 samples (1,600 train, 200 val, 200 test)
96
+ - **Balance**: 50% positive (EOU), 50% negative (non-EOU)
97
+ - **Sources**: Synthetic Saudi Arabic conversations + public Arabic datasets
98
+
99
+ ### Training Procedure
100
+
101
+ - **Optimizer**: AdamW
102
+ - **Learning Rate**: 2e-5
103
+ - **Batch Size**: 16
104
+ - **Epochs**: 10 (with early stopping)
105
+ - **Mixed Precision**: FP16
106
+ - **Hardware**: GPU (CUDA)
107
+
108
+ ### Evaluation Metrics
109
+
110
+ | Metric | Score |
111
+ |--------|-------|
112
+ | Accuracy | ~90% |
113
+ | Precision | ~88% |
114
+ | Recall | ~92% |
115
+ | F1 Score | ~90% |
116
+ | ROC AUC | ~95% |
117
+
118
+ ### Inference Speed
119
+
120
+ | Configuration | Latency |
121
+ |--------------|---------|
122
+ | GPU (FP32) | ~15-20ms |
123
+ | GPU (INT8) | ~8-12ms |
124
+ | CPU (FP32) | ~60-80ms |
125
+ | CPU (INT8) | ~25-35ms |
126
+
127
+ ## Limitations
128
+
129
+ - **Dialectal Coverage**: Optimized for Saudi dialect, may not generalize perfectly to other Arabic dialects
130
+ - **Synthetic Data**: Trained primarily on synthetic conversations
131
+ - **Domain**: Limited to common conversational topics
132
+ - **Dataset Size**: Relatively small training set
133
+
134
+ ## Bias and Fairness
135
+
136
+ - Model may perform better on Saudi dialect than other Arabic dialects
137
+ - Training data focuses on common conversational patterns
138
+ - May not handle code-switching or mixed-language conversations well
139
+
140
+ ## Citation
141
+
142
+ ```bibtex
143
+ @model{arabic_eou_camelbert_2025,
144
+ author = {Mahmoud Saalama},
145
+ title = {Arabic End-of-Utterance Detection Model},
146
+ year = {2025},
147
+ publisher = {Hugging Face},
148
+ url = {https://huggingface.co/mahmoudsaalama/arabic-eou-camelbert}
149
+ }
150
+ ```
151
+
152
+ ## License
153
+
154
+ MIT License
155
+
156
+ ## Contact
157
+
158
+ For questions or feedback:
159
+ - GitHub: [arabic-eou-livekit](https://github.com/mahmoudsaalama/arabic-eou-livekit)
160
+ - Hugging Face: [@mahmoudsaalama](https://huggingface.co/mahmoudsaalama)
161
+
162
+ ## Acknowledgments
163
+
164
+ - Base model: [CAMeLBERT](https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa) by CAMeL Lab
165
+ - Framework: [Transformers](https://huggingface.co/transformers) by Hugging Face
166
+ - Integration: [LiveKit](https://livekit.io) for real-time applications
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4b2b9847f8f73b4d7eb05bc48b3eda0fdb38f3d5efd0046699eac02b3600d0e0
3
+ size 437196367
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "full_tokenizer_file": null,
50
+ "mask_token": "[MASK]",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "never_split": null,
53
+ "pad_token": "[PAD]",
54
+ "sep_token": "[SEP]",
55
+ "strip_accents": null,
56
+ "tokenize_chinese_chars": true,
57
+ "tokenizer_class": "BertTokenizer",
58
+ "unk_token": "[UNK]"
59
+ }
training_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "model_name": "CAMeL-Lab/bert-base-arabic-camelbert-msa",
3
+ "hidden_size": 256,
4
+ "dropout": 0.1
5
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff