YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen2.5-1.5B-Instruct – End-of-Utterance (EOU) Detection

📌 Overview

This repository documents the fine-tuning and integration of Qwen/Qwen2.5-1.5B-Instruct for the task of End-of-Utterance (EOU) detection in conversational text. The model is adapted to determine whether a given user utterance represents the end of a conversational turn (1) or whether the speaker is likely to continue (0).

The work is part of a larger AI Voice Agent system, where accurate turn-boundary detection is critical for natural, low-latency human–machine interaction—especially in streaming speech-to-text (STT) scenarios and Saudi/Arabic dialogue.

This README focuses on:

Model fine-tuning methodology
Evaluation approach and metrics
Project and system architecture
Practical integration into a full voice agent pipeline

🎯 Task Definition: End-of-Utterance Detection

EOU detection is framed as a binary decision task:

0 → The utterance is incomplete; the speaker is likely to continue
1 → The utterance represents the end of a conversational turn

Although conceptually a classification problem, the task is implemented using a causal language modeling (instruction-following) paradigm, where the model is prompted to generate a single token (0 or 1) as its response.

This design choice aligns with:

Instruction-tuned LLM behavior
Streaming dialogue systems
Unified handling of Arabic and English text

🧠 Model Selection Rationale

Base Model: Qwen/Qwen2.5-1.5B-Instruct

The Qwen2.5-1.5B-Instruct model was selected due to:

Strong multilingual and instruction-following capabilities
Good performance on short-context reasoning tasks
Suitability for parameter-efficient fine-tuning
Better stability compared to smaller instruction-tuned models (e.g., SmolLM-Instruct), which produced weak and inconsistent EOU predictions in preliminary experiments

Fine-Tuning Strategy

Approach: Parameter-Efficient Fine-Tuning (PEFT) using LoRA
Objective: Preserve the base model’s general language understanding while adapting it to conversational turn-boundary detection
Output Format: Single-token generation (0 or 1)

🧪 Model Fine-Tuning

Training Objective

The model is fine-tuned to respond to a structured instruction prompt, such as:

"You are an end-of-utterance detection assistant. Reply with only 0 or 1."

Given a user utterance, the model learns to generate the correct binary output based on conversational completeness.

Training Environment

Fine-tuning was performed in a Google Colab notebook
Mixed-precision training (FP16) was used
Gradient accumulation was applied to support effective batch sizes under memory constraints

Data

The fine-tuning dataset consists of conversational utterances annotated for EOU detection. The dataset includes:

Arabic (with emphasis on Saudi dialect)
Customer-service style conversations
STT-like fragmented and overlapping utterances

(Full dataset preparation and annotation methodology is documented separately.)

📊 Evaluation Methodology

Evaluation Setup

The dataset was split into training and evaluation subsets
The model was evaluated on unseen utterances
Predictions were generated deterministically (no sampling)

Metrics

Standard binary classification metrics were used:

Accuracy
Precision
Recall
F1-score

In addition, confusion matrix analysis was used to understand:

False EOU predictions (premature cut-offs)
Missed EOU predictions (latency and overlap issues)

This evaluation strategy reflects real-world deployment concerns in voice agents and streaming dialogue systems.

Interpretation

Rather than optimizing for raw accuracy alone, emphasis was placed on:

Robustness to incomplete and noisy utterances
Stability across short and long turns
Sensible behavior under STT-style continuous input

📁 Full Project Architecture

This model is part of a larger AI Voice Agent system. The overall project structure is shown below:

AI-Voice-Agent/
│
├── data/
│   └── Dataset Preparation
│       └── (filtering, cleaning, annotation, EOU generation)
│
├── EOU_turn_detection/
│   ├── detector.py
│   │   └── Loads the fine-tuned Qwen2.5 EOU model
│   │   └── Exposes a simple EOU prediction interface
│
├── Qwen2.5-1.5B_arabic_EOU/
│   └── Fine-tuned model weights and tokenizer
│
├── Qwen_Qwen2.5-1.5B-Instruct_EOU.ipynb
│   └── Google Colab notebook used for fine-tuning and evaluation
│
├── voice_agent/
│   └── LiveKit-based voice agent
│       └── Real-time audio streaming
│       └── STT integration
│       └── Uses EOU predictions to manage turn-taking
│
├── .env
│   └── Environment variables and API keys
│
├── requirements.txt
│   └── Python dependencies
│
└── README.md

This structure ensures modularity, reproducibility, and clean separation between model training, inference, and real-time voice interaction logic.

🔊 Integration into the Voice Agent

Within the voice agent, the EOU model plays a central role in turn management:

User speech is streamed via LiveKit
Speech-to-Text (STT) produces partial and overlapping transcripts
Each transcript chunk is passed to the EOU detector (detector.py)
The model predicts whether the current utterance is complete
When EOU = 1:
- The agent stops listening
- The utterance is forwarded to the LLM / dialogue manager
When EOU = 0:
- The agent continues listening and accumulating text

This approach enables:

Natural turn-taking
Reduced interruption
Lower response latency
Robust handling of streaming and noisy STT output

🧪 How to Use the Model (Inference)

Below is the minimal inference example exactly as used in this project. It demonstrates how to load the fine-tuned Qwen2.5-EOU-Detection model and generate a binary End-of-Utterance prediction.

Python Inference Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Yahia-123/Qwen2.5-1.5B-Instruct_EOU"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def predict_eou(text):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output = model.generate(
      **inputs,
    max_new_tokens=1,
    do_sample=False
          )

    return tokenizer.decode(
        output[0][inputs.input_ids.shape[-1]:],
        skip_special_tokens=True
    ).strip()

# Example
text = "Hello, how are you?"
print(predict_eou(text))  # "0" or "1"

Notes

The model is prompted using the official Qwen chat template
The output is a single token representing the EOU decision
Deterministic decoding is used for stable behavior in production systems

⚠️ Limitations

The model outputs a binary decision and does not represent uncertainty
Performance may degrade on domains far from conversational dialogue
Highly code-switched or extremely short utterances remain challenging

These limitations are acceptable within the current voice-agent design and can be addressed in future iterations.

📌 Summary

This project demonstrates how a modern instruction-tuned LLM can be adapted—using parameter-efficient fine-tuning—for real-time conversational turn detection. By integrating the fine-tuned Qwen2.5-1.5B-Instruct model into a live voice agent, the system achieves more natural, human-like dialogue flow, particularly in Arabic and Saudi dialect scenarios.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support