Arabic End-of-Utterance (EOU) Detection Model – Saudi Dialect

Base model: UBC-NLP/marbertv2 Model type: Sequence Classification (Binary: EOU vs CONTINUE) Language: Arabic (Saudi dialect focus) Trained on: 30,000 examples (≈15k positive + 15k negative) Framework: Hugging Face Transformers

Model Description

This model is a fine-tuned MarBERTv2 transformer for predicting End-of-Utterance (EOU) in Arabic conversations. It predicts whether a speaker has finished their turn based on transcription text, enabling real-time turn-taking in AI voice agents.

The model was trained specifically to integrate with LiveKit agents, which provide up to 6 previous turns to the EOU model. To leverage this, the training dataset includes sliding window context of up to 4 previous turns combined with the current utterance using the [SEP] token. This allows turn-aware predictions, rather than relying solely on the last utterance.

Training Data

Source: Sada dataset (download here)
Processed for EOU: Positive samples are full utterances; negative samples are incomplete prefixes (download here)
Total samples: 30,000 (15k positive, 15k negative)
Context: Up to 4 previous turns per example

Training Procedure

Base model: UBC-NLP/marbertv2
Fine-tuning task: Sequence classification (2 labels: EOU / CONTINUE)
Labels:
- 0: CONTINUE
- 1: EOU
Config updates:
- hidden_dropout_prob = 0.3
- attention_probs_dropout_prob = 0.3
Training hyperparameters:
- Learning rate: 3e-5
- Weight decay: 0.05
- Batch size: 32
- Gradient accumulation: 4 steps
- FP16 mixed precision: True
- Epochs: 10
- Warmup ratio: 0.1
- Save strategy: per epoch, keep 1 best model
- Evaluation metric for best model: eval_loss

Evaluation Results

Validation metrics (last epoch):

Loss: 0.2556
Accuracy: 0.9011
F1: 0.8956
Precision: 0.9488
Recall: 0.8480

Training metrics (last epoch):

Loss: 0.2025
Accuracy: 0.9158
F1: 0.9111
Precision: 0.9649
Recall: 0.8630

Test set results:

Loss: 0.2624
Accuracy: 0.8958
F1: 0.8900
Precision: 0.9419
Recall: 0.8436

Intended Uses

Primary use-case

LiveKit agents for real-time EOU detection
Voice assistants
Dialogue systems
Turn-taking prediction in Arabic conversations

Not intended for

Speech-to-text training
General language modeling
Speaker diarization

Limitations

Primarily focused on Saudi dialect
Negative prefixes may not capture all real-time partial ASR outputs
Text-only model — does not process audio
Performance may vary on other Arabic dialects

Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "salmamohammedhamed22/arabic-eou-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example input with context
text = "مرحبا كيف حالك [SEP] تمام الحمد لله."

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
print(f"EOU probability: {probs[0][1].item():.3f}")  # probability of end-of-utterance

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32