|
|
--- |
|
|
language: ar |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- arabic |
|
|
- saudi-arabic |
|
|
- eou |
|
|
- end-of-utterance |
|
|
- conversational-ai |
|
|
- livekit |
|
|
- turn-detection |
|
|
datasets: |
|
|
- HossamEL-Dein/arabic-eou-dataset |
|
|
base_model: aubmindlab/bert-base-arabertv02 |
|
|
--- |
|
|
|
|
|
# Arabic End-of-Utterance Detection Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model detects End-of-Utterance (EOU) in Arabic conversations, specifically optimized for Saudi dialects. It predicts the probability that a speaker has finished their conversational turn based on text transcription. |
|
|
|
|
|
**Use Case**: Real-time conversational AI agents (voice assistants, chatbots, customer service) |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| **Test Accuracy** | 99.6% | |
|
|
| **Precision** | 100% | |
|
|
| **Recall** | 99.45% | |
|
|
| **F1 Score** | 99.73% | |
|
|
| **AUC-ROC** | 99.96% | |
|
|
| **Inference Time** | ~15-20ms | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Total samples**: 5,000 |
|
|
- **SADA22 (Real Saudi audio)**: 104 samples (2.1%) |
|
|
- **Synthetic (Saudi patterns)**: 4,896 samples (97.9%) |
|
|
- **Splits**: 80% train / 10% validation / 10% test |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Usage |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model |
|
|
model = AutoModelForSequenceClassification.from_pretrained("HossamEL-Dein/arabic-eou-model") |
|
|
tokenizer = AutoTokenizer.from_pretrained("HossamEL-Dein/arabic-eou-model") |
|
|
model.eval() |
|
|
|
|
|
# Predict EOU |
|
|
text = "ู
ุฑุญุจุง ููู ุญุงูู ุงูููู
" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.softmax(outputs.logits, dim=-1) |
|
|
eou_probability = probs[0][1].item() |
|
|
|
|
|
print(f"EOU Probability: {eou_probability:.2%}") |
|
|
# Output: EOU Probability: 98.56% |
|
|
``` |
|
|
|
|
|
### Integration with LiveKit |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
class EOUDetector: |
|
|
def __init__(self, threshold=0.7): |
|
|
self.model = AutoModelForSequenceClassification.from_pretrained("HossamEL-Dein/arabic-eou-model") |
|
|
self.tokenizer = AutoTokenizer.from_pretrained("HossamEL-Dein/arabic-eou-model") |
|
|
self.model.eval() |
|
|
self.threshold = threshold |
|
|
|
|
|
def check_eou(self, transcript_text): |
|
|
inputs = self.tokenizer(transcript_text, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
outputs = self.model(**inputs) |
|
|
probs = torch.softmax(outputs.logits, dim=-1) |
|
|
eou_prob = probs[0][1].item() |
|
|
|
|
|
return { |
|
|
'probability': eou_prob, |
|
|
'is_eou': eou_prob > self.threshold |
|
|
} |
|
|
|
|
|
# Use in LiveKit agent |
|
|
detector = EOUDetector() |
|
|
result = detector.check_eou("ู
ุฑุญุจุง ููู ุญุงูู") |
|
|
if result['is_eou']: |
|
|
print("User finished speaking!") |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Base Model**: aubmindlab/bert-base-arabertv02 |
|
|
- **Task**: Binary sequence classification |
|
|
- **Input**: Arabic text (up to 128 tokens) |
|
|
- **Output**: 2-class probability distribution [Non-EOU, EOU] |
|
|
- **Parameters**: 136M |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Framework**: PyTorch + Transformers |
|
|
- **Epochs**: 3 |
|
|
- **Batch Size**: 16 |
|
|
- **Learning Rate**: 2e-5 |
|
|
- **Optimizer**: AdamW |
|
|
- **Training Time**: ~3 hours on T4 GPU |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
- โ
Real-time voice assistants |
|
|
- โ
Arabic conversational AI |
|
|
- โ
Turn-taking detection in dialogues |
|
|
- โ
LiveKit agent integration |
|
|
|
|
|
### Limitations |
|
|
- Trained primarily on Saudi dialect patterns |
|
|
- Requires text input (not raw audio) |
|
|
- Best for conversational context (5-10 seconds) |
|
|
- May need threshold tuning for specific use cases |
|
|
|
|
|
## Dataset |
|
|
|
|
|
Training dataset available at: [HossamEL-Dein/arabic-eou-dataset](https://huggingface.co/datasets/HossamEL-Dein/arabic-eou-dataset) |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@misc{arabic-eou-2024, |
|
|
author = {HossamEL-Dein}, |
|
|
title = {Arabic End-of-Utterance Detection Model}, |
|
|
year = {2024}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/HossamEL-Dein/arabic-eou-model} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or issues, please open an issue on the model repository. |