Uhm: on-device filler-word detection

A frame-precise classifier that finds "uh", "um", "hmm", and other fillers in audio with 20 ms timestamps. Trained on English; produces high-confidence detections on Spanish, French, German, and Dutch without retraining.

Try it

Browser demo: huggingface.co/spaces/desert-ant-labs/uhm-demo. Drop in any audio file and get a click-to-seek filler timeline.
Drop-in SDKs for iOS, Android, and Web are coming soon. Email contact@desertant.ai for early access.

Variants

Two tiers, both free under the license up to 100,000 MAU each.

Tier	Backbone	Character	When to use
`uhm-base`	HuBERT-base, 8-bit Core ML, 90 MB	Higher recall; broadest device support	Default. Catches more fillers, accepts a few more false fires.
`uhm-pro`	DistilHuBERT, fp16 Core ML, 45 MB	Smaller, faster (~2.2× on-device), more precise	When a flagged filler gets auto-cut without review.

Both variants preserve 100% argmax agreement with the fp32 PyTorch reference on test inputs.

Files

Tier	File	Format	Size	Use
`uhm-base`	`uhm-base.mlpackage.zip`	Core ML 8-bit	~88 MB	iOS / macOS on-device
`uhm-base`	`uhm-base-web-fp16.onnx`	ONNX fp16	~189 MB	Browser, server, Python (`onnxruntime`)
`uhm-base`	`uhm-base.onnx`	ONNX fp32	~378 MB	Quantization-free reference
`uhm-pro`	`uhm-pro.mlpackage.zip`	Core ML fp16	~45 MB	iOS / macOS on-device
`uhm-pro`	`uhm-pro-web-fp16.onnx`	ONNX fp16	~51 MB	Browser, server, Python (`onnxruntime`)
`uhm-pro`	`uhm-pro.onnx`	ONNX fp32	~98 MB	Quantization-free reference

Source weights for fine-tuning live in safetensors-checkpoint/ (HuBERT-base fp32, alongside config.json, preprocessor_config.json, labels.json).

Use

Python (ONNX)

from huggingface_hub import hf_hub_download
import onnxruntime as ort

path    = hf_hub_download("desert-ant-labs/uhm", "uhm-base-web-fp16.onnx")
session = ort.InferenceSession(path, providers=["CPUExecutionProvider"])

Python (PyTorch, fine-tuning starting point)

from transformers import AutoModelForAudioFrameClassification, AutoFeatureExtractor

extractor = AutoFeatureExtractor.from_pretrained("desert-ant-labs/uhm")
model     = AutoModelForAudioFrameClassification.from_pretrained("desert-ant-labs/uhm")

Inputs and outputs

Input: 16 kHz mono audio, up to 30-second windows.
Output: per-frame softmax over 6 classes, one prediction every 20 ms.
Class indices: 0 = not_filler, 1 = uh, 2 = um, 3 = hmm, 4 = and, 5 = other.

Core ML input shape (1, 480000) float32; output (1, 1499, 6). Requires iOS 17 / macOS 14 or newer.

Limitations

Trained on English; non-English performance is by acoustic transfer and has not been measured against per-language ground truth.
Best on podcast / meeting / talking-head audio. Heavy background music, laughter, or multi-speaker overlap degrades quality.
Type labels (uh / um / hmm / and / other) are secondary. Trust filler vs. not_filler more than the specific subtype.

Built on

Base architectures and pretrained weights: facebook/hubert-base-ls960 (Apache 2.0) and its distilled variant ntu-spml/distilhubert (Apache 2.0).
Public fine-tuning audio: AMI Meeting Corpus (edinburghcstr/ami, IHM split). CC BY 4.0, Edinburgh CSTR.
Internal video content created by the Desert Ant Labs team.

License

Released under the Desert Ant Labs Source-Available License v1.0 (see LICENSE.md).

Free for commercial use up to 100,000 Monthly Active Users (MAU).
Above 100,000 MAU a commercial license is required. Contact licensing@desertant.ai.

Citation

@software{uhm_2026,
  title  = {Uhm: on-device filler-word detection},
  author = {Desert Ant Labs},
  year   = {2026},
  url    = {https://huggingface.co/desert-ant-labs/uhm},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

desert-ant-labs
/

uhm