Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +166 -0
pytorch_model.bin +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +59 -0
training_config.json +5 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,166 @@

+---
+language:
+- ar
+tags:
+- arabic
+- end-of-utterance
+- eou-detection
+- saudi-dialect
+- conversational-ai
+- turn-detection
+- camelbert
+base_model: CAMeL-Lab/bert-base-arabic-camelbert-msa
+license: mit
+---
+# Arabic End-of-Utterance Detection Model
+Fine-tuned CAMeLBERT model for detecting end-of-utterance in Arabic conversations, with emphasis on Saudi dialect.
+## Model Description
+This model is designed to detect when a speaker has finished their conversational turn in Arabic dialogue. It's particularly optimized for Saudi dialect patterns and real-time applications.
+### Model Details
+- **Base Model**: CAMeLBERT-MSA (CAMeL-Lab/bert-base-arabic-camelbert-msa)
+- **Task**: Binary classification (EOU vs. non-EOU)
+- **Language**: Arabic (Modern Standard Arabic + Saudi dialect)
+- **Parameters**: ~110M (base encoder) + classification head
+- **Training Data**: 2,000+ Arabic conversation samples
+### Intended Use
+- Real-time turn detection in conversational AI agents
+- Voice assistants for Arabic speakers
+- Dialogue systems
+- LiveKit agent integration
+## How to Use
+### Installation
+```bash
+pip install torch transformers
+```
+### Basic Usage
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+# Load model and tokenizer
+model_name = "mahmoudsaalama/arabic-eou-camelbert"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModel.from_pretrained(model_name)
+# Prepare input
+text = "السلام عليكم ورحمة الله"
+inputs = tokenizer(text, return_tensors="pt", max_length=128, truncation=True)
+# Get prediction
+with torch.no_grad():
+    outputs = model(**inputs)
+    probability = torch.sigmoid(outputs.logits).item()
+    is_eou = probability > 0.5
+print(f"EOU Probability: {probability:.4f}")
+print(f"Is EOU: {is_eou}")
+```
+### Using the SDK
+For easier integration, use the Arabic EOU SDK:
+```bash
+pip install arabic-eou-sdk
+```
+```python
+from arabic_eou_sdk import ArabicEOUDetector
+detector = ArabicEOUDetector(model_name="mahmoudsaalama/arabic-eou-camelbert")
+result = detector.update_transcription("السلام عليكم", is_final=True)
+print(f"Is EOU: {result['is_eou']}")
+print(f"Probability: {result['probability']:.4f}")
+print(f"Confidence: {result['confidence']:.4f}")
+```
+## Training Details
+### Training Data
+- **Size**: ~2,000 samples (1,600 train, 200 val, 200 test)
+- **Balance**: 50% positive (EOU), 50% negative (non-EOU)
+- **Sources**: Synthetic Saudi Arabic conversations + public Arabic datasets
+### Training Procedure
+- **Optimizer**: AdamW
+- **Learning Rate**: 2e-5
+- **Batch Size**: 16
+- **Epochs**: 10 (with early stopping)
+- **Mixed Precision**: FP16
+- **Hardware**: GPU (CUDA)
+### Evaluation Metrics
+| Metric | Score |
+|--------|-------|
+| Accuracy | ~90% |
+| Precision | ~88% |
+| Recall | ~92% |
+| F1 Score | ~90% |
+| ROC AUC | ~95% |
+### Inference Speed
+| Configuration | Latency |
+|--------------|---------|
+| GPU (FP32) | ~15-20ms |
+| GPU (INT8) | ~8-12ms |
+| CPU (FP32) | ~60-80ms |
+| CPU (INT8) | ~25-35ms |
+## Limitations
+- **Dialectal Coverage**: Optimized for Saudi dialect, may not generalize perfectly to other Arabic dialects
+- **Synthetic Data**: Trained primarily on synthetic conversations
+- **Domain**: Limited to common conversational topics
+- **Dataset Size**: Relatively small training set
+## Bias and Fairness
+- Model may perform better on Saudi dialect than other Arabic dialects
+- Training data focuses on common conversational patterns
+- May not handle code-switching or mixed-language conversations well
+## Citation
+```bibtex
+@model{arabic_eou_camelbert_2025,
+  author = {Mahmoud Saalama},
+  title = {Arabic End-of-Utterance Detection Model},
+  year = {2025},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/mahmoudsaalama/arabic-eou-camelbert}
+}
+```
+## License
+MIT License
+## Contact
+For questions or feedback:
+- GitHub: [arabic-eou-livekit](https://github.com/mahmoudsaalama/arabic-eou-livekit)
+- Hugging Face: [@mahmoudsaalama](https://huggingface.co/mahmoudsaalama)
+## Acknowledgments
+- Base model: [CAMeLBERT](https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa) by CAMeL Lab
+- Framework: [Transformers](https://huggingface.co/transformers) by Hugging Face
+- Integration: [LiveKit](https://livekit.io) for real-time applications

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4b2b9847f8f73b4d7eb05bc48b3eda0fdb38f3d5efd0046699eac02b3600d0e0
+size 437196367

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "full_tokenizer_file": null,
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "model_name": "CAMeL-Lab/bert-base-arabic-camelbert-msa",
+  "hidden_size": 256,
+  "dropout": 0.1
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff