Instructions to use anyreach-ai/semantic-turn-taking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use anyreach-ai/semantic-turn-taking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="anyreach-ai/semantic-turn-taking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("anyreach-ai/semantic-turn-taking")
model = AutoModelForCausalLM.from_pretrained("anyreach-ai/semantic-turn-taking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use anyreach-ai/semantic-turn-taking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "anyreach-ai/semantic-turn-taking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anyreach-ai/semantic-turn-taking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/anyreach-ai/semantic-turn-taking

SGLang

How to use anyreach-ai/semantic-turn-taking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "anyreach-ai/semantic-turn-taking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anyreach-ai/semantic-turn-taking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "anyreach-ai/semantic-turn-taking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anyreach-ai/semantic-turn-taking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use anyreach-ai/semantic-turn-taking with Docker Model Runner:
```
docker model run hf.co/anyreach-ai/semantic-turn-taking
```

Semantic Turn-Taking Model

A fine-tuned Qwen2.5-0.5B-Instruct model that predicts turn-taking actions in conversations. Given a conversation context, the model predicts what action a voice AI agent should take next.

Unlike acoustic-based approaches (VAD, silence detection), this model uses the semantic content of the conversation to make turn-taking decisions.

Action Classes

The model predicts one of 4 actions:

Action	Token	Description
`start_speaking`	`<\|start_speaking\|>`	User finished their turn, agent should respond
`continue_listening`	`<\|continue_listening\|>`	User is mid-utterance, keep listening
`start_listening`	`<\|start_listening\|>`	User interrupted the agent, stop talking
`continue_speaking`	`<\|continue_speaking\|>`	User gave a backchannel, agent keeps talking

Usage

PyTorch

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "anyreach-ai/semantic-turn-taking"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).cuda()
model.eval()

# Format conversation as ChatML with <|predict|> trigger
conversation = """<|im_start|>user
I need help with my bill<|im_end|>
<|im_start|>assistant
Sure I can help with that what seems to be the issue<|im_end|>
<|im_start|>user
I was charged twice for the same order<|im_end|>
<|predict|>"""

inputs = tokenizer(conversation, return_tensors="pt").to("cuda")

with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]

# Get action probabilities
action_tokens = {
    "start_speaking": tokenizer.convert_tokens_to_ids("<|start_speaking|>"),
    "continue_listening": tokenizer.convert_tokens_to_ids("<|continue_listening|>"),
    "start_listening": tokenizer.convert_tokens_to_ids("<|start_listening|>"),
    "continue_speaking": tokenizer.convert_tokens_to_ids("<|continue_speaking|>"),
}

action_logits = {name: logits[0, tid].item() for name, tid in action_tokens.items()}
probs = torch.softmax(torch.tensor(list(action_logits.values())), dim=0)
for (name, _), p in zip(action_logits.items(), probs):
    print(f"  {name}: {p:.4f}")
# → start_speaking: 0.95+ (user is done, agent should respond)

ONNX (CPU)

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("anyreach-ai/semantic-turn-taking")

sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 4
session = ort.InferenceSession(
    "onnx/model_q8.onnx",  # download from this repo
    providers=["CPUExecutionProvider"],
    sess_options=sess_options,
)

# Tokenize
conversation = "..."  # ChatML format as above
inputs = tokenizer(conversation, return_tensors="np")
input_ids = inputs["input_ids"].astype("int64")
seq_len = input_ids.shape[1]

# Build feed (empty KV cache for single forward pass)
feed = {
    "input_ids": input_ids,
    "attention_mask": inputs["attention_mask"].astype("int64"),
    "position_ids": np.arange(seq_len, dtype="int64").reshape(1, -1),
}
for i in range(24):
    feed[f"past_key_values.{i}.key"] = np.zeros((1, 2, 0, 64), dtype="float32")
    feed[f"past_key_values.{i}.value"] = np.zeros((1, 2, 0, 64), dtype="float32")

# Run inference
logits = session.run(None, feed)[0]  # [1, seq_len, vocab_size]
last_logits = logits[0, -1, :]

# Extract action probabilities
ACTION_IDS = [151666, 151665, 151667, 151668]  # SS, CL, SLi, CS
action_logits = last_logits[ACTION_IDS]
probs = np.exp(action_logits) / np.sum(np.exp(action_logits))

Benchmark Results

Evaluated on anyreach-ai/semantic-turn-taking-benchmark.

Binary (EOU vs Not-EOU)

Only start_speaking and continue_listening examples. Predictions mapped: SS/CS → EOU, CL/SLi → Not-EOU.

Subset	N	Accuracy	F1 (macro)
TEN	428	91.82%	91.80%
SwDA	2,688	65.96%	51.46%
Synthetic	36	86.11%	85.57%

Multi-class

Subset	N	Classes	Accuracy	F1 (macro)
TEN	428	2	91.82%	91.80%
SwDA	3,523	3	68.98%	46.92%
Synthetic	60	4	76.67%	72.07%

Latency

Measured on single examples, CPU (4 threads) and GPU (NVIDIA T4).

Format	Size	Short (8 tok)	Medium (28 tok)	Long (54 tok)
PyTorch GPU (fp16)	942 MB	26 ms	30 ms	34 ms
PyTorch CPU (fp32)	942 MB	165 ms	247 ms	289 ms
ONNX CPU (q8)	473 MB	128 ms	151 ms	191 ms

Model Details

Base model: Qwen/Qwen2.5-0.5B-Instruct (494M parameters)
Training: Full fine-tuning on ~154K synthetic conversation examples
Input format: Qwen ChatML with <|predict|> trigger token
Max sequence length: 1024 tokens (left truncation)
Special tokens: 5 added (<|predict|>, 4 action tokens)

Files

File	Description
`model.safetensors`	PyTorch model weights (fp32)
`onnx/model_q8.onnx`	ONNX INT8 quantized (dynamic quantization)
`config.json`	Model configuration
`tokenizer.json`	Tokenizer

Citation

@misc{semantic-turn-taking-2026,
  title={Semantic Turn-Taking Model},
  author={Shangeth Rajaa},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/anyreach-ai/semantic-turn-taking}
}