Semantic Turn Policy

semantic-turn-policy - A text-based turn-taking policy model for voice AI agents.

A fine-tuned Qwen2.5-0.5B model for predicting turn-taking actions in conversations. Given a conversation context, the model predicts what action a voice AI agent should take next.

Model Description

This model is designed for semantic turn-taking in voice AI applications. Unlike acoustic-based approaches (VAD, silence detection), this model uses the semantic content of the conversation to decide when an AI agent should speak, listen, or continue its current action.

Action Tokens

The model predicts one of four action tokens:

Token Description When to Use
<|continue_listening|> Keep listening User is mid-utterance, not done speaking
<|start_speaking|> Begin speaking User finished, agent should respond
<|start_listening|> Start listening Agent finished speaking, await user
<|continue_speaking|> Continue speaking User gave backchannel, agent should continue

Usage

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "anyreach-ai/semantic-turn-taking"  # Update with actual repo
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Define action tokens
ACTION_TOKENS = [
    "<|continue_listening|>",
    "<|start_speaking|>",
    "<|start_listening|>",
    "<|continue_speaking|>"
]
action_token_ids = [tokenizer.convert_tokens_to_ids(t) for t in ACTION_TOKENS]

# Format conversation - model predicts action after the last turn (Qwen ChatML format)
conversation = "<|im_start|>user\nI'd like to book a table for two at 7pm tonight\n"

# Get prediction
inputs = tokenizer(conversation, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits[0, -1, action_token_ids]
    probs = torch.softmax(logits, dim=-1)
    predicted_idx = probs.argmax().item()

print(f"Predicted action: {ACTION_TOKENS[predicted_idx]}")
print(f"Probabilities: {dict(zip(ACTION_TOKENS, probs.tolist()))}")
# Output: Predicted action: <|start_speaking|> (user finished their request)

ONNX Inference

For faster inference and deployment, an ONNX version is available in the onnx/ subfolder. Uses the optimum library for seamless loading with KV-cache support.

import numpy as np
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

# Load tokenizer and ONNX model
model_name = "anyreach-ai/semantic-turn-taking"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForCausalLM.from_pretrained(model_name, subfolder="onnx")

# Define action tokens
ACTION_TOKENS = [
    "<|continue_listening|>",
    "<|start_speaking|>",
    "<|start_listening|>",
    "<|continue_speaking|>"
]
action_token_ids = [tokenizer.convert_tokens_to_ids(t) for t in ACTION_TOKENS]

# Format conversation (Qwen ChatML format)
conversation = "<|im_start|>user\nI'd like to book a table for two at 7pm tonight\n"

# Get prediction
inputs = tokenizer(conversation, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits[0, -1, action_token_ids].detach().numpy()

# Compute softmax
exp_logits = np.exp(logits - np.max(logits))
probs = exp_logits / exp_logits.sum()

predicted_idx = np.argmax(probs)
print(f"Predicted action: {ACTION_TOKENS[predicted_idx]}")
print(f"Probabilities: {dict(zip(ACTION_TOKENS, [f'{p:.1%}' for p in probs]))}")

Installation for ONNX:

pip install optimum[onnxruntime] transformers

Turn-Taking Decision Function

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

class TurnTakingPredictor:
    def __init__(self, model_name: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        self.model.eval()
        
        self.action_tokens = [
            "<|continue_listening|>",
            "<|start_speaking|>",
            "<|start_listening|>",
            "<|continue_speaking|>"
        ]
        self.action_ids = [
            self.tokenizer.convert_tokens_to_ids(t) 
            for t in self.action_tokens
        ]
    
    def predict(self, conversation: str) -> dict:
        """
        Predict turn-taking action for a conversation.
        
        Args:
            conversation: Conversation text in the expected format
            
        Returns:
            dict with 'action', 'probability', and 'all_probs'
        """
        inputs = self.tokenizer(
            conversation, 
            return_tensors="pt",
            truncation=True,
            max_length=2048
        ).to(self.model.device)
        
        with torch.no_grad():
            outputs = self.model(**inputs)
            logits = outputs.logits[0, -1, self.action_ids]
            probs = torch.softmax(logits, dim=-1).cpu().numpy()
        
        predicted_idx = probs.argmax()
        return {
            "action": self.action_tokens[predicted_idx],
            "probability": float(probs[predicted_idx]),
            "all_probs": {
                self.action_tokens[i]: float(probs[i]) 
                for i in range(len(self.action_tokens))
            }
        }
    
    def should_agent_speak(self, conversation: str, threshold: float = 0.5) -> bool:
        """
        Binary decision: should the agent start speaking?
        
        Returns True if start_speaking probability > threshold
        """
        result = self.predict(conversation)
        return result["all_probs"]["<|start_speaking|>"] > threshold


# Usage
predictor = TurnTakingPredictor("anyreach-ai/semantic-turn-taking")

# Helper to convert chat messages to model format (Qwen ChatML)
def format_messages(messages):
    return "".join(f"<|im_start|>{m['role']}\n{m['content']}\n" for m in messages)

# Single turn - user completed a request
conversation = "<|im_start|>user\nI'd like to order a pizza\n"
result = predictor.predict(conversation)
print(f"Action: {result['action']}")  # <|start_speaking|>

# Single turn - user is mid-sentence
conversation = "<|im_start|>user\nI was thinking about maybe going to the um\n"
result = predictor.predict(conversation)
print(f"Action: {result['action']}")  # <|continue_listening|>

# Multi-turn conversation using chat format
messages = [
    {"role": "user", "content": "Hi, I need help booking a flight to New York"},
    {"role": "assistant", "content": "I'd be happy to help! When are you planning to travel?"},
    {"role": "user", "content": "I'm thinking maybe next, um"},
]
conversation = format_messages(messages)
result = predictor.predict(conversation)
print(f"Action: {result['action']}")  # <|continue_listening|> (user hesitating)

Conversation Format

The model expects conversations in Qwen ChatML format, using <|im_start|> as the role marker:

<|im_start|>user
User message here
<|action_token|>
<|im_start|>assistant
Assistant response here
<|action_token|>
<|im_start|>user
Another user message

Format rules:

  • Each turn starts with <|im_start|> followed by the role name (user or assistant) and a newline
  • Turn content follows on subsequent lines
  • The model predicts the action token after each turn's content
  • Action tokens (<|start_speaking|>, <|continue_listening|>, etc.) appear between turns in full conversations

Examples:

# Single user turn (model predicts what action to take)
"<|im_start|>user\nI need help booking a flight\n"

# Multi-turn conversation with action tokens
"<|im_start|>user\nWhat's the weather today\n<|start_speaking|>\n<|im_start|>assistant\nIt's sunny and 72 degrees\n"

Performance

Evaluated on a held-out test set of 540 conversations (16,978 prediction points) across 36 completely unseen scenarios:

4-Class Action Prediction

Metric Score
Accuracy 92.2%
Balanced Accuracy 91.7%
F1 Macro 91.3%
F1 Weighted 92.2%

Per-Action Performance

Action Precision Recall F1 Score Support
continue_listening 91.4% 92.7% 92.0% 3,214
start_speaking 95.6% 94.8% 95.2% 5,461
start_listening 94.3% 91.6% 92.9% 5,564
continue_speaking 82.9% 87.6% 85.2% 2,739

Training

  • Base Model: Qwen/Qwen2.5-0.5B-Instruct
  • Format: Qwen ChatML (<|im_start|>role\ncontent\n)
  • Training Data: ~9,800 synthetic conversations across 280 scenarios, 7 categories
  • Loss: Hybrid masked loss — class-balanced focal loss (gamma=2.0) on action tokens + 10% NTP weight on all tokens
  • Optimizer: AdamW with cosine LR schedule, differential learning rates (5x for embeddings, 1x for backbone)
  • Early Stopping: Based on eval F1 macro (patience=10)
  • Augmentation: On-the-fly ASR style augmentation + realistic streaming chunking

Limitations

  • Trained on synthetic English conversations; may not generalize to other languages
  • Optimized for customer service and assistant-style conversations
  • Does not incorporate acoustic features (prosody, pauses, etc.)
  • Best used in combination with VAD for real-time applications

Intended Use

  • Voice AI agents that need semantic turn-taking decisions
  • Conversational AI systems with natural turn management
  • Research on dialogue systems and turn-taking

Citation

@misc{semantic-turn-taking-2026,
  title={Semantic Turn-Taking Model},
  author={Shangeth Rajaa},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/anyreach-ai/semantic-turn-taking}
}

License

Apache 2.0

Downloads last month
87
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anyreach-ai/semantic-turn-taking

Base model

Qwen/Qwen2.5-0.5B
Quantized
(164)
this model

Collection including anyreach-ai/semantic-turn-taking