YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen2.5-1.5B-Instruct – End-of-Utterance (EOU) Detection

πŸ“Œ Overview

This repository documents the fine-tuning and integration of Qwen/Qwen2.5-1.5B-Instruct for the task of End-of-Utterance (EOU) detection in conversational text. The model is adapted to determine whether a given user utterance represents the end of a conversational turn (1) or whether the speaker is likely to continue (0).

The work is part of a larger AI Voice Agent system, where accurate turn-boundary detection is critical for natural, low-latency human–machine interactionβ€”especially in streaming speech-to-text (STT) scenarios and Saudi/Arabic dialogue.

This README focuses on:

  • Model fine-tuning methodology
  • Evaluation approach and metrics
  • Project and system architecture
  • Practical integration into a full voice agent pipeline

🎯 Task Definition: End-of-Utterance Detection

EOU detection is framed as a binary decision task:

  • 0 β†’ The utterance is incomplete; the speaker is likely to continue
  • 1 β†’ The utterance represents the end of a conversational turn

Although conceptually a classification problem, the task is implemented using a causal language modeling (instruction-following) paradigm, where the model is prompted to generate a single token (0 or 1) as its response.

This design choice aligns with:

  • Instruction-tuned LLM behavior
  • Streaming dialogue systems
  • Unified handling of Arabic and English text

🧠 Model Selection Rationale

Base Model: Qwen/Qwen2.5-1.5B-Instruct

The Qwen2.5-1.5B-Instruct model was selected due to:

  • Strong multilingual and instruction-following capabilities
  • Good performance on short-context reasoning tasks
  • Suitability for parameter-efficient fine-tuning
  • Better stability compared to smaller instruction-tuned models (e.g., SmolLM-Instruct), which produced weak and inconsistent EOU predictions in preliminary experiments

Fine-Tuning Strategy

  • Approach: Parameter-Efficient Fine-Tuning (PEFT) using LoRA
  • Objective: Preserve the base model’s general language understanding while adapting it to conversational turn-boundary detection
  • Output Format: Single-token generation (0 or 1)

πŸ§ͺ Model Fine-Tuning

Training Objective

The model is fine-tuned to respond to a structured instruction prompt, such as:

"You are an end-of-utterance detection assistant. Reply with only 0 or 1."

Given a user utterance, the model learns to generate the correct binary output based on conversational completeness.

Training Environment

  • Fine-tuning was performed in a Google Colab notebook
  • Mixed-precision training (FP16) was used
  • Gradient accumulation was applied to support effective batch sizes under memory constraints

Data

The fine-tuning dataset consists of conversational utterances annotated for EOU detection. The dataset includes:

  • Arabic (with emphasis on Saudi dialect)
  • Customer-service style conversations
  • STT-like fragmented and overlapping utterances

(Full dataset preparation and annotation methodology is documented separately.)


πŸ“Š Evaluation Methodology

Evaluation Setup

  • The dataset was split into training and evaluation subsets
  • The model was evaluated on unseen utterances
  • Predictions were generated deterministically (no sampling)

Metrics

Standard binary classification metrics were used:

  • Accuracy
  • Precision
  • Recall
  • F1-score

In addition, confusion matrix analysis was used to understand:

  • False EOU predictions (premature cut-offs)
  • Missed EOU predictions (latency and overlap issues)

This evaluation strategy reflects real-world deployment concerns in voice agents and streaming dialogue systems.

Interpretation

Rather than optimizing for raw accuracy alone, emphasis was placed on:

  • Robustness to incomplete and noisy utterances
  • Stability across short and long turns
  • Sensible behavior under STT-style continuous input

πŸ“ Full Project Architecture

This model is part of a larger AI Voice Agent system. The overall project structure is shown below:

AI-Voice-Agent/
β”‚
β”œβ”€β”€ data/
β”‚   └── Dataset Preparation
β”‚       └── (filtering, cleaning, annotation, EOU generation)
β”‚
β”œβ”€β”€ EOU_turn_detection/
β”‚   β”œβ”€β”€ detector.py
β”‚   β”‚   └── Loads the fine-tuned Qwen2.5 EOU model
β”‚   β”‚   └── Exposes a simple EOU prediction interface
β”‚
β”œβ”€β”€ Qwen2.5-1.5B_arabic_EOU/
β”‚   └── Fine-tuned model weights and tokenizer
β”‚
β”œβ”€β”€ Qwen_Qwen2.5-1.5B-Instruct_EOU.ipynb
β”‚   └── Google Colab notebook used for fine-tuning and evaluation
β”‚
β”œβ”€β”€ voice_agent/
β”‚   └── LiveKit-based voice agent
β”‚       └── Real-time audio streaming
β”‚       └── STT integration
β”‚       └── Uses EOU predictions to manage turn-taking
β”‚
β”œβ”€β”€ .env
β”‚   └── Environment variables and API keys
β”‚
β”œβ”€β”€ requirements.txt
β”‚   └── Python dependencies
β”‚
└── README.md

This structure ensures modularity, reproducibility, and clean separation between model training, inference, and real-time voice interaction logic.


πŸ”Š Integration into the Voice Agent

Within the voice agent, the EOU model plays a central role in turn management:

  1. User speech is streamed via LiveKit

  2. Speech-to-Text (STT) produces partial and overlapping transcripts

  3. Each transcript chunk is passed to the EOU detector (detector.py)

  4. The model predicts whether the current utterance is complete

  5. When EOU = 1:

    • The agent stops listening
    • The utterance is forwarded to the LLM / dialogue manager
  6. When EOU = 0:

    • The agent continues listening and accumulating text

This approach enables:

  • Natural turn-taking
  • Reduced interruption
  • Lower response latency
  • Robust handling of streaming and noisy STT output

πŸ§ͺ How to Use the Model (Inference)

Below is the minimal inference example exactly as used in this project. It demonstrates how to load the fine-tuned Qwen2.5-EOU-Detection model and generate a binary End-of-Utterance prediction.

Python Inference Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Yahia-123/Qwen2.5-1.5B-Instruct_EOU"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def predict_eou(text):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output = model.generate(
      **inputs,
    max_new_tokens=1,
    do_sample=False
          )

    return tokenizer.decode(
        output[0][inputs.input_ids.shape[-1]:],
        skip_special_tokens=True
    ).strip()

# Example
text = "Hello, how are you?"
print(predict_eou(text))  # "0" or "1"

Notes

  • The model is prompted using the official Qwen chat template
  • The output is a single token representing the EOU decision
  • Deterministic decoding is used for stable behavior in production systems

⚠️ Limitations

  • The model outputs a binary decision and does not represent uncertainty
  • Performance may degrade on domains far from conversational dialogue
  • Highly code-switched or extremely short utterances remain challenging

These limitations are acceptable within the current voice-agent design and can be addressed in future iterations.


πŸ“Œ Summary

This project demonstrates how a modern instruction-tuned LLM can be adaptedβ€”using parameter-efficient fine-tuningβ€”for real-time conversational turn detection. By integrating the fine-tuned Qwen2.5-1.5B-Instruct model into a live voice agent, the system achieves more natural, human-like dialogue flow, particularly in Arabic and Saudi dialect scenarios.


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support