YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen2.5-1.5B-Instruct β End-of-Utterance (EOU) Detection
π Overview
This repository documents the fine-tuning and integration of Qwen/Qwen2.5-1.5B-Instruct for the task of End-of-Utterance (EOU) detection in conversational text. The model is adapted to determine whether a given user utterance represents the end of a conversational turn (1) or whether the speaker is likely to continue (0).
The work is part of a larger AI Voice Agent system, where accurate turn-boundary detection is critical for natural, low-latency humanβmachine interactionβespecially in streaming speech-to-text (STT) scenarios and Saudi/Arabic dialogue.
This README focuses on:
- Model fine-tuning methodology
- Evaluation approach and metrics
- Project and system architecture
- Practical integration into a full voice agent pipeline
π― Task Definition: End-of-Utterance Detection
EOU detection is framed as a binary decision task:
0β The utterance is incomplete; the speaker is likely to continue1β The utterance represents the end of a conversational turn
Although conceptually a classification problem, the task is implemented using a causal language modeling (instruction-following) paradigm, where the model is prompted to generate a single token (0 or 1) as its response.
This design choice aligns with:
- Instruction-tuned LLM behavior
- Streaming dialogue systems
- Unified handling of Arabic and English text
π§ Model Selection Rationale
Base Model: Qwen/Qwen2.5-1.5B-Instruct
The Qwen2.5-1.5B-Instruct model was selected due to:
- Strong multilingual and instruction-following capabilities
- Good performance on short-context reasoning tasks
- Suitability for parameter-efficient fine-tuning
- Better stability compared to smaller instruction-tuned models (e.g., SmolLM-Instruct), which produced weak and inconsistent EOU predictions in preliminary experiments
Fine-Tuning Strategy
- Approach: Parameter-Efficient Fine-Tuning (PEFT) using LoRA
- Objective: Preserve the base modelβs general language understanding while adapting it to conversational turn-boundary detection
- Output Format: Single-token generation (
0or1)
π§ͺ Model Fine-Tuning
Training Objective
The model is fine-tuned to respond to a structured instruction prompt, such as:
"You are an end-of-utterance detection assistant. Reply with only 0 or 1."
Given a user utterance, the model learns to generate the correct binary output based on conversational completeness.
Training Environment
- Fine-tuning was performed in a Google Colab notebook
- Mixed-precision training (FP16) was used
- Gradient accumulation was applied to support effective batch sizes under memory constraints
Data
The fine-tuning dataset consists of conversational utterances annotated for EOU detection. The dataset includes:
- Arabic (with emphasis on Saudi dialect)
- Customer-service style conversations
- STT-like fragmented and overlapping utterances
(Full dataset preparation and annotation methodology is documented separately.)
π Evaluation Methodology
Evaluation Setup
- The dataset was split into training and evaluation subsets
- The model was evaluated on unseen utterances
- Predictions were generated deterministically (no sampling)
Metrics
Standard binary classification metrics were used:
- Accuracy
- Precision
- Recall
- F1-score
In addition, confusion matrix analysis was used to understand:
- False EOU predictions (premature cut-offs)
- Missed EOU predictions (latency and overlap issues)
This evaluation strategy reflects real-world deployment concerns in voice agents and streaming dialogue systems.
Interpretation
Rather than optimizing for raw accuracy alone, emphasis was placed on:
- Robustness to incomplete and noisy utterances
- Stability across short and long turns
- Sensible behavior under STT-style continuous input
π Full Project Architecture
This model is part of a larger AI Voice Agent system. The overall project structure is shown below:
AI-Voice-Agent/
β
βββ data/
β βββ Dataset Preparation
β βββ (filtering, cleaning, annotation, EOU generation)
β
βββ EOU_turn_detection/
β βββ detector.py
β β βββ Loads the fine-tuned Qwen2.5 EOU model
β β βββ Exposes a simple EOU prediction interface
β
βββ Qwen2.5-1.5B_arabic_EOU/
β βββ Fine-tuned model weights and tokenizer
β
βββ Qwen_Qwen2.5-1.5B-Instruct_EOU.ipynb
β βββ Google Colab notebook used for fine-tuning and evaluation
β
βββ voice_agent/
β βββ LiveKit-based voice agent
β βββ Real-time audio streaming
β βββ STT integration
β βββ Uses EOU predictions to manage turn-taking
β
βββ .env
β βββ Environment variables and API keys
β
βββ requirements.txt
β βββ Python dependencies
β
βββ README.md
This structure ensures modularity, reproducibility, and clean separation between model training, inference, and real-time voice interaction logic.
π Integration into the Voice Agent
Within the voice agent, the EOU model plays a central role in turn management:
User speech is streamed via LiveKit
Speech-to-Text (STT) produces partial and overlapping transcripts
Each transcript chunk is passed to the EOU detector (
detector.py)The model predicts whether the current utterance is complete
When
EOU = 1:- The agent stops listening
- The utterance is forwarded to the LLM / dialogue manager
When
EOU = 0:- The agent continues listening and accumulating text
This approach enables:
- Natural turn-taking
- Reduced interruption
- Lower response latency
- Robust handling of streaming and noisy STT output
π§ͺ How to Use the Model (Inference)
Below is the minimal inference example exactly as used in this project. It demonstrates how to load the fine-tuned Qwen2.5-EOU-Detection model and generate a binary End-of-Utterance prediction.
Python Inference Example
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Yahia-123/Qwen2.5-1.5B-Instruct_EOU"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
def predict_eou(text):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=1,
do_sample=False
)
return tokenizer.decode(
output[0][inputs.input_ids.shape[-1]:],
skip_special_tokens=True
).strip()
# Example
text = "Hello, how are you?"
print(predict_eou(text)) # "0" or "1"
Notes
- The model is prompted using the official Qwen chat template
- The output is a single token representing the EOU decision
- Deterministic decoding is used for stable behavior in production systems
β οΈ Limitations
- The model outputs a binary decision and does not represent uncertainty
- Performance may degrade on domains far from conversational dialogue
- Highly code-switched or extremely short utterances remain challenging
These limitations are acceptable within the current voice-agent design and can be addressed in future iterations.
π Summary
This project demonstrates how a modern instruction-tuned LLM can be adaptedβusing parameter-efficient fine-tuningβfor real-time conversational turn detection. By integrating the fine-tuned Qwen2.5-1.5B-Instruct model into a live voice agent, the system achieves more natural, human-like dialogue flow, particularly in Arabic and Saudi dialect scenarios.