Arabic End-of-Utterance (EOU) Detection Model – Saudi Dialect

Base model: UBC-NLP/marbertv2 Model type: Sequence Classification (Binary: EOU vs CONTINUE) Language: Arabic (Saudi dialect focus) Trained on: 30,000 examples (≈15k positive + 15k negative) Framework: Hugging Face Transformers


Model Description

This model is a fine-tuned MarBERTv2 transformer for predicting End-of-Utterance (EOU) in Arabic conversations. It predicts whether a speaker has finished their turn based on transcription text, enabling real-time turn-taking in AI voice agents.

The model was trained specifically to integrate with LiveKit agents, which provide up to 6 previous turns to the EOU model. To leverage this, the training dataset includes sliding window context of up to 4 previous turns combined with the current utterance using the [SEP] token. This allows turn-aware predictions, rather than relying solely on the last utterance.


Training Data

  • Source: Sada dataset (download here)
  • Processed for EOU: Positive samples are full utterances; negative samples are incomplete prefixes (download here)
  • Total samples: 30,000 (15k positive, 15k negative)
  • Context: Up to 4 previous turns per example

Training Procedure

  • Base model: UBC-NLP/marbertv2
  • Fine-tuning task: Sequence classification (2 labels: EOU / CONTINUE)
  • Labels:
    • 0: CONTINUE
    • 1: EOU
  • Config updates:
    • hidden_dropout_prob = 0.3
    • attention_probs_dropout_prob = 0.3
  • Training hyperparameters:
    • Learning rate: 3e-5
    • Weight decay: 0.05
    • Batch size: 32
    • Gradient accumulation: 4 steps
    • FP16 mixed precision: True
    • Epochs: 10
    • Warmup ratio: 0.1
    • Save strategy: per epoch, keep 1 best model
    • Evaluation metric for best model: eval_loss

Evaluation Results

Validation metrics (last epoch):

  • Loss: 0.2556
  • Accuracy: 0.9011
  • F1: 0.8956
  • Precision: 0.9488
  • Recall: 0.8480

Training metrics (last epoch):

  • Loss: 0.2025
  • Accuracy: 0.9158
  • F1: 0.9111
  • Precision: 0.9649
  • Recall: 0.8630

Test set results:

  • Loss: 0.2624
  • Accuracy: 0.8958
  • F1: 0.8900
  • Precision: 0.9419
  • Recall: 0.8436

Intended Uses

Primary use-case

  • LiveKit agents for real-time EOU detection
  • Voice assistants
  • Dialogue systems
  • Turn-taking prediction in Arabic conversations

Not intended for

  • Speech-to-text training
  • General language modeling
  • Speaker diarization

Limitations

  • Primarily focused on Saudi dialect
  • Negative prefixes may not capture all real-time partial ASR outputs
  • Text-only model — does not process audio
  • Performance may vary on other Arabic dialects

Inference Example

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "salmamohammedhamed22/arabic-eou-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example input with context
text = "مرحبا كيف حالك [SEP] تمام الحمد لله."

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
print(f"EOU probability: {probs[0][1].item():.3f}")  # probability of end-of-utterance
Downloads last month
2
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support