Turnsense: Turn-Detector Model

A lightweight end-of-utterance (EOU) detection model fine-tuned on SmolLM2-135M, optimized for Raspberry Pi and low-power devices. Trained on TURNS-2K, a dataset designed to cover various STT output patterns including backchannels, mispronunciations, code-switching, and different text formatting styles. This makes the model work well across different STT systems.

Key Features

Lightweight: Built on SmolLM2-135M (~135M parameters)
High accuracy: 97.50% (standard) / 93.75% (quantized)
Edge-ready: Runs on Raspberry Pi and similar hardware
ONNX support: Works with ONNX Runtime and Hugging Face Transformers

Performance

The model holds up well across configurations:

Standard model: 97.50% accuracy
Quantized model: 93.75% accuracy
Average probability difference: 0.0323 between versions

Speed

Installation

pip install transformers onnxruntime numpy huggingface_hub

Quick Start

import onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

# Download and load tokenizer and model
model_id = "latishab/turnsense"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_path = hf_hub_download(repo_id=model_id, filename="model_quantized.onnx")

# Initialize ONNX Runtime session
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# Prepare input
# Note: The special token <|user|> is included, but <|im_end|> is not.
text = "Hello, how are you?"
inputs = tokenizer(
    f"<|user|> {text}",
    padding="max_length",
    max_length=256,
    return_tensors="np"  
)

# Run inference
ort_inputs = {
    'input_ids': inputs['input_ids'].numpy(),
    'attention_mask': inputs['attention_mask'].numpy()
}
all_logits = session.run(None, ort_inputs)[0]
logits_for_item = all_logits[0]
prediction = np.argmax(logits_for_item)

print(f"Text: '{text}'")
print(f"Prediction (0 or 1): {prediction}")

Dataset: TURNS-2K

The model is trained on TURNS-2K, a dataset built for end-of-utterance detection. It covers:

Backchannels and self-corrections
Code-switching and language mixing
Multiple text formatting styles
Variations in STT output across different systems

Motivation and current state

I built Turnsense because I couldn't find a good open-source turn detection model for edge devices. Most options were either proprietary or too heavy to run on something like a Raspberry Pi.

The model is trained on English speech patterns using 2,000 samples via LoRA fine-tuning on SmolLM2-135M. It handles common STT outputs well, but there are edge cases and complex conversational patterns it doesn't cover yet. ONNX was a deliberate choice for device compatibility, though a port to Apple MLX is on the table.

License

Apache 2.0. See the LICENSE file for details.

Contributing

Contributions are welcome. Some areas that could use help: dataset expansion, model optimization, documentation, and bug reports. Feel free to open a PR or issue.

Citation

If you use this model in your research:

@software{latishab2025turnsense,
  author       = {Latisha Besariani HENDRA},
  title        = {Turnsense: A Lightweight End-of-Utterance Detection Model},
  month        = mar,
  year         = 2025,
  publisher    = {GitHub},
  journal      = {GitHub repository},
  url          = {https://github.com/latishab/turnsense},
  note         = {https://huggingface.co/latishab/turnsense}
}

Downloads last month: 1,618

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for latishab/turnsense

Base model

HuggingFaceTB/SmolLM2-135M

Quantized

HuggingFaceTB/SmolLM2-135M-Instruct

Quantized

(112)

this model

Dataset used to train latishab/turnsense

Evaluation results

Accuracy (Standard)
self-reported

97.500
Accuracy (Quantized)
self-reported

93.750