Arabic End-of-Utterance (EOU) Detection Model – Saudi Dialect
Base model: UBC-NLP/marbertv2
Model type: Sequence Classification (Binary: EOU vs CONTINUE)
Language: Arabic (Saudi dialect focus)
Trained on: 30,000 examples (≈15k positive + 15k negative)
Framework: Hugging Face Transformers
Model Description
This model is a fine-tuned MarBERTv2 transformer for predicting End-of-Utterance (EOU) in Arabic conversations. It predicts whether a speaker has finished their turn based on transcription text, enabling real-time turn-taking in AI voice agents.
The model was trained specifically to integrate with LiveKit agents, which provide up to 6 previous turns to the EOU model. To leverage this, the training dataset includes sliding window context of up to 4 previous turns combined with the current utterance using the [SEP] token. This allows turn-aware predictions, rather than relying solely on the last utterance.
Training Data
- Source: Sada dataset (download here)
- Processed for EOU: Positive samples are full utterances; negative samples are incomplete prefixes (download here)
- Total samples: 30,000 (15k positive, 15k negative)
- Context: Up to 4 previous turns per example
Training Procedure
- Base model:
UBC-NLP/marbertv2 - Fine-tuning task: Sequence classification (2 labels: EOU / CONTINUE)
- Labels:
0: CONTINUE1: EOU
- Config updates:
hidden_dropout_prob = 0.3attention_probs_dropout_prob = 0.3
- Training hyperparameters:
- Learning rate: 3e-5
- Weight decay: 0.05
- Batch size: 32
- Gradient accumulation: 4 steps
- FP16 mixed precision: True
- Epochs: 10
- Warmup ratio: 0.1
- Save strategy: per epoch, keep 1 best model
- Evaluation metric for best model:
eval_loss
Evaluation Results
Validation metrics (last epoch):
- Loss: 0.2556
- Accuracy: 0.9011
- F1: 0.8956
- Precision: 0.9488
- Recall: 0.8480
Training metrics (last epoch):
- Loss: 0.2025
- Accuracy: 0.9158
- F1: 0.9111
- Precision: 0.9649
- Recall: 0.8630
Test set results:
- Loss: 0.2624
- Accuracy: 0.8958
- F1: 0.8900
- Precision: 0.9419
- Recall: 0.8436
Intended Uses
Primary use-case
- LiveKit agents for real-time EOU detection
- Voice assistants
- Dialogue systems
- Turn-taking prediction in Arabic conversations
Not intended for
- Speech-to-text training
- General language modeling
- Speaker diarization
Limitations
- Primarily focused on Saudi dialect
- Negative prefixes may not capture all real-time partial ASR outputs
- Text-only model — does not process audio
- Performance may vary on other Arabic dialects
Inference Example
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "salmamohammedhamed22/arabic-eou-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example input with context
text = "مرحبا كيف حالك [SEP] تمام الحمد لله."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
print(f"EOU probability: {probs[0][1].item():.3f}") # probability of end-of-utterance
- Downloads last month
- 2