Semantic Turn Policy
semantic-turn-policy - A text-based turn-taking policy model for voice AI agents.
A fine-tuned Qwen2.5-0.5B model for predicting turn-taking actions in conversations. Given a conversation context, the model predicts what action a voice AI agent should take next.
Model Description
This model is designed for semantic turn-taking in voice AI applications. Unlike acoustic-based approaches (VAD, silence detection), this model uses the semantic content of the conversation to decide when an AI agent should speak, listen, or continue its current action.
Action Tokens
The model predicts one of four action tokens:
| Token | Description | When to Use |
|---|---|---|
<|continue_listening|> |
Keep listening | User is mid-utterance, not done speaking |
<|start_speaking|> |
Begin speaking | User finished, agent should respond |
<|start_listening|> |
Start listening | Agent finished speaking, await user |
<|continue_speaking|> |
Continue speaking | User gave backchannel, agent should continue |
Usage
Basic Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model_name = "anyreach-ai/semantic-turn-taking" # Update with actual repo
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Define action tokens
ACTION_TOKENS = [
"<|continue_listening|>",
"<|start_speaking|>",
"<|start_listening|>",
"<|continue_speaking|>"
]
action_token_ids = [tokenizer.convert_tokens_to_ids(t) for t in ACTION_TOKENS]
# Format conversation - model predicts action after the last turn (Qwen ChatML format)
conversation = "<|im_start|>user\nI'd like to book a table for two at 7pm tonight\n"
# Get prediction
inputs = tokenizer(conversation, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits[0, -1, action_token_ids]
probs = torch.softmax(logits, dim=-1)
predicted_idx = probs.argmax().item()
print(f"Predicted action: {ACTION_TOKENS[predicted_idx]}")
print(f"Probabilities: {dict(zip(ACTION_TOKENS, probs.tolist()))}")
# Output: Predicted action: <|start_speaking|> (user finished their request)
ONNX Inference
For faster inference and deployment, an ONNX version is available in the onnx/ subfolder. Uses the optimum library for seamless loading with KV-cache support.
import numpy as np
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
# Load tokenizer and ONNX model
model_name = "anyreach-ai/semantic-turn-taking"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ORTModelForCausalLM.from_pretrained(model_name, subfolder="onnx")
# Define action tokens
ACTION_TOKENS = [
"<|continue_listening|>",
"<|start_speaking|>",
"<|start_listening|>",
"<|continue_speaking|>"
]
action_token_ids = [tokenizer.convert_tokens_to_ids(t) for t in ACTION_TOKENS]
# Format conversation (Qwen ChatML format)
conversation = "<|im_start|>user\nI'd like to book a table for two at 7pm tonight\n"
# Get prediction
inputs = tokenizer(conversation, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits[0, -1, action_token_ids].detach().numpy()
# Compute softmax
exp_logits = np.exp(logits - np.max(logits))
probs = exp_logits / exp_logits.sum()
predicted_idx = np.argmax(probs)
print(f"Predicted action: {ACTION_TOKENS[predicted_idx]}")
print(f"Probabilities: {dict(zip(ACTION_TOKENS, [f'{p:.1%}' for p in probs]))}")
Installation for ONNX:
pip install optimum[onnxruntime] transformers
Turn-Taking Decision Function
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
class TurnTakingPredictor:
def __init__(self, model_name: str):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
self.model.eval()
self.action_tokens = [
"<|continue_listening|>",
"<|start_speaking|>",
"<|start_listening|>",
"<|continue_speaking|>"
]
self.action_ids = [
self.tokenizer.convert_tokens_to_ids(t)
for t in self.action_tokens
]
def predict(self, conversation: str) -> dict:
"""
Predict turn-taking action for a conversation.
Args:
conversation: Conversation text in the expected format
Returns:
dict with 'action', 'probability', and 'all_probs'
"""
inputs = self.tokenizer(
conversation,
return_tensors="pt",
truncation=True,
max_length=2048
).to(self.model.device)
with torch.no_grad():
outputs = self.model(**inputs)
logits = outputs.logits[0, -1, self.action_ids]
probs = torch.softmax(logits, dim=-1).cpu().numpy()
predicted_idx = probs.argmax()
return {
"action": self.action_tokens[predicted_idx],
"probability": float(probs[predicted_idx]),
"all_probs": {
self.action_tokens[i]: float(probs[i])
for i in range(len(self.action_tokens))
}
}
def should_agent_speak(self, conversation: str, threshold: float = 0.5) -> bool:
"""
Binary decision: should the agent start speaking?
Returns True if start_speaking probability > threshold
"""
result = self.predict(conversation)
return result["all_probs"]["<|start_speaking|>"] > threshold
# Usage
predictor = TurnTakingPredictor("anyreach-ai/semantic-turn-taking")
# Helper to convert chat messages to model format (Qwen ChatML)
def format_messages(messages):
return "".join(f"<|im_start|>{m['role']}\n{m['content']}\n" for m in messages)
# Single turn - user completed a request
conversation = "<|im_start|>user\nI'd like to order a pizza\n"
result = predictor.predict(conversation)
print(f"Action: {result['action']}") # <|start_speaking|>
# Single turn - user is mid-sentence
conversation = "<|im_start|>user\nI was thinking about maybe going to the um\n"
result = predictor.predict(conversation)
print(f"Action: {result['action']}") # <|continue_listening|>
# Multi-turn conversation using chat format
messages = [
{"role": "user", "content": "Hi, I need help booking a flight to New York"},
{"role": "assistant", "content": "I'd be happy to help! When are you planning to travel?"},
{"role": "user", "content": "I'm thinking maybe next, um"},
]
conversation = format_messages(messages)
result = predictor.predict(conversation)
print(f"Action: {result['action']}") # <|continue_listening|> (user hesitating)
Conversation Format
The model expects conversations in Qwen ChatML format, using <|im_start|> as the role marker:
<|im_start|>user
User message here
<|action_token|>
<|im_start|>assistant
Assistant response here
<|action_token|>
<|im_start|>user
Another user message
Format rules:
- Each turn starts with
<|im_start|>followed by the role name (userorassistant) and a newline - Turn content follows on subsequent lines
- The model predicts the action token after each turn's content
- Action tokens (
<|start_speaking|>,<|continue_listening|>, etc.) appear between turns in full conversations
Examples:
# Single user turn (model predicts what action to take)
"<|im_start|>user\nI need help booking a flight\n"
# Multi-turn conversation with action tokens
"<|im_start|>user\nWhat's the weather today\n<|start_speaking|>\n<|im_start|>assistant\nIt's sunny and 72 degrees\n"
Performance
Evaluated on a held-out test set of 540 conversations (16,978 prediction points) across 36 completely unseen scenarios:
4-Class Action Prediction
| Metric | Score |
|---|---|
| Accuracy | 92.2% |
| Balanced Accuracy | 91.7% |
| F1 Macro | 91.3% |
| F1 Weighted | 92.2% |
Per-Action Performance
| Action | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| continue_listening | 91.4% | 92.7% | 92.0% | 3,214 |
| start_speaking | 95.6% | 94.8% | 95.2% | 5,461 |
| start_listening | 94.3% | 91.6% | 92.9% | 5,564 |
| continue_speaking | 82.9% | 87.6% | 85.2% | 2,739 |
Training
- Base Model: Qwen/Qwen2.5-0.5B-Instruct
- Format: Qwen ChatML (
<|im_start|>role\ncontent\n) - Training Data: ~9,800 synthetic conversations across 280 scenarios, 7 categories
- Loss: Hybrid masked loss — class-balanced focal loss (gamma=2.0) on action tokens + 10% NTP weight on all tokens
- Optimizer: AdamW with cosine LR schedule, differential learning rates (5x for embeddings, 1x for backbone)
- Early Stopping: Based on eval F1 macro (patience=10)
- Augmentation: On-the-fly ASR style augmentation + realistic streaming chunking
Limitations
- Trained on synthetic English conversations; may not generalize to other languages
- Optimized for customer service and assistant-style conversations
- Does not incorporate acoustic features (prosody, pauses, etc.)
- Best used in combination with VAD for real-time applications
Intended Use
- Voice AI agents that need semantic turn-taking decisions
- Conversational AI systems with natural turn management
- Research on dialogue systems and turn-taking
Citation
@misc{semantic-turn-taking-2026,
title={Semantic Turn-Taking Model},
author={Shangeth Rajaa},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/anyreach-ai/semantic-turn-taking}
}
License
Apache 2.0
- Downloads last month
- 87