Turnsense: Turn-Detector Model
A lightweight end-of-utterance (EOU) detection model fine-tuned on SmolLM2-135M, optimized for Raspberry Pi and low-power devices. Trained on TURNS-2K, a dataset designed to cover various STT output patterns including backchannels, mispronunciations, code-switching, and different text formatting styles. This makes the model work well across different STT systems.
Key Features
- Lightweight: Built on SmolLM2-135M (~135M parameters)
- High accuracy: 97.50% (standard) / 93.75% (quantized)
- Edge-ready: Runs on Raspberry Pi and similar hardware
- ONNX support: Works with ONNX Runtime and Hugging Face Transformers
Performance
The model holds up well across configurations:
- Standard model: 97.50% accuracy
- Quantized model: 93.75% accuracy
- Average probability difference: 0.0323 between versions
Speed
Installation
pip install transformers onnxruntime numpy huggingface_hub
Quick Start
import onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
# Download and load tokenizer and model
model_id = "latishab/turnsense"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_path = hf_hub_download(repo_id=model_id, filename="model_quantized.onnx")
# Initialize ONNX Runtime session
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
# Prepare input
# Note: The special token <|user|> is included, but <|im_end|> is not.
text = "Hello, how are you?"
inputs = tokenizer(
f"<|user|> {text}",
padding="max_length",
max_length=256,
return_tensors="np"
)
# Run inference
ort_inputs = {
'input_ids': inputs['input_ids'].numpy(),
'attention_mask': inputs['attention_mask'].numpy()
}
all_logits = session.run(None, ort_inputs)[0]
logits_for_item = all_logits[0]
prediction = np.argmax(logits_for_item)
print(f"Text: '{text}'")
print(f"Prediction (0 or 1): {prediction}")
Dataset: TURNS-2K
The model is trained on TURNS-2K, a dataset built for end-of-utterance detection. It covers:
- Backchannels and self-corrections
- Code-switching and language mixing
- Multiple text formatting styles
- Variations in STT output across different systems
Motivation and current state
I built Turnsense because I couldn't find a good open-source turn detection model for edge devices. Most options were either proprietary or too heavy to run on something like a Raspberry Pi.
The model is trained on English speech patterns using 2,000 samples via LoRA fine-tuning on SmolLM2-135M. It handles common STT outputs well, but there are edge cases and complex conversational patterns it doesn't cover yet. ONNX was a deliberate choice for device compatibility, though a port to Apple MLX is on the table.
License
Apache 2.0. See the LICENSE file for details.
Contributing
Contributions are welcome. Some areas that could use help: dataset expansion, model optimization, documentation, and bug reports. Feel free to open a PR or issue.
Citation
If you use this model in your research:
@software{latishab2025turnsense,
author = {Latisha Besariani HENDRA},
title = {Turnsense: A Lightweight End-of-Utterance Detection Model},
month = mar,
year = 2025,
publisher = {GitHub},
journal = {GitHub repository},
url = {https://github.com/latishab/turnsense},
note = {https://huggingface.co/latishab/turnsense}
}
- Downloads last month
- 3,557
Model tree for latishab/turnsense
Base model
HuggingFaceTB/SmolLM2-135MDataset used to train latishab/turnsense
Evaluation results
- Accuracy (Standard)self-reported97.500
- Accuracy (Quantized)self-reported93.750

