Turnsense: Turn-Detector Model

A lightweight end-of-utterance (EOU) detection model fine-tuned on SmolLM2-135M, optimized for Raspberry Pi and low-power devices. Trained on TURNS-2K, a dataset designed to cover various STT output patterns including backchannels, mispronunciations, code-switching, and different text formatting styles. This makes the model work well across different STT systems.

Key Features

  • Lightweight: Built on SmolLM2-135M (~135M parameters)
  • High accuracy: 97.50% (standard) / 93.75% (quantized)
  • Edge-ready: Runs on Raspberry Pi and similar hardware
  • ONNX support: Works with ONNX Runtime and Hugging Face Transformers

Performance

The model holds up well across configurations:

  • Standard model: 97.50% accuracy
  • Quantized model: 93.75% accuracy
  • Average probability difference: 0.0323 between versions

image/png

Speed

image/png

Installation

pip install transformers onnxruntime numpy huggingface_hub

Quick Start

import onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

# Download and load tokenizer and model
model_id = "latishab/turnsense"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_path = hf_hub_download(repo_id=model_id, filename="model_quantized.onnx")

# Initialize ONNX Runtime session
session = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# Prepare input
# Note: The special token <|user|> is included, but <|im_end|> is not.
text = "Hello, how are you?"
inputs = tokenizer(
    f"<|user|> {text}",
    padding="max_length",
    max_length=256,
    return_tensors="np"  
)

# Run inference
ort_inputs = {
    'input_ids': inputs['input_ids'].numpy(),
    'attention_mask': inputs['attention_mask'].numpy()
}
all_logits = session.run(None, ort_inputs)[0]
logits_for_item = all_logits[0]
prediction = np.argmax(logits_for_item)

print(f"Text: '{text}'")
print(f"Prediction (0 or 1): {prediction}")

Dataset: TURNS-2K

The model is trained on TURNS-2K, a dataset built for end-of-utterance detection. It covers:

  • Backchannels and self-corrections
  • Code-switching and language mixing
  • Multiple text formatting styles
  • Variations in STT output across different systems

Motivation and current state

I built Turnsense because I couldn't find a good open-source turn detection model for edge devices. Most options were either proprietary or too heavy to run on something like a Raspberry Pi.

The model is trained on English speech patterns using 2,000 samples via LoRA fine-tuning on SmolLM2-135M. It handles common STT outputs well, but there are edge cases and complex conversational patterns it doesn't cover yet. ONNX was a deliberate choice for device compatibility, though a port to Apple MLX is on the table.

License

Apache 2.0. See the LICENSE file for details.

Contributing

Contributions are welcome. Some areas that could use help: dataset expansion, model optimization, documentation, and bug reports. Feel free to open a PR or issue.

Citation

If you use this model in your research:

@software{latishab2025turnsense,
  author       = {Latisha Besariani HENDRA},
  title        = {Turnsense: A Lightweight End-of-Utterance Detection Model},
  month        = mar,
  year         = 2025,
  publisher    = {GitHub},
  journal      = {GitHub repository},
  url          = {https://github.com/latishab/turnsense},
  note         = {https://huggingface.co/latishab/turnsense}
}
Downloads last month
3,557
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for latishab/turnsense

Quantized
(88)
this model

Dataset used to train latishab/turnsense

Evaluation results