Uhm: on-device filler-word detection

A frame-precise classifier that finds "uh", "um", "hmm", and other fillers in audio with 20 ms timestamps. Trained on English; produces high-confidence detections on Spanish, French, German, and Dutch without retraining.

Try it

Variants

Two tiers, both free under the license up to 100,000 MAU each.

Tier Backbone Character When to use
uhm-base HuBERT-base, 8-bit Core ML, 90 MB Higher recall; broadest device support Default. Catches more fillers, accepts a few more false fires.
uhm-pro DistilHuBERT, fp16 Core ML, 45 MB Smaller, faster (~2.2× on-device), more precise When a flagged filler gets auto-cut without review.

Both variants preserve 100% argmax agreement with the fp32 PyTorch reference on test inputs.

Files

Tier File Format Size Use
uhm-base uhm-base.mlpackage.zip Core ML 8-bit ~88 MB iOS / macOS on-device
uhm-base uhm-base-web-fp16.onnx ONNX fp16 ~189 MB Browser, server, Python (onnxruntime)
uhm-base uhm-base.onnx ONNX fp32 ~378 MB Quantization-free reference
uhm-pro uhm-pro.mlpackage.zip Core ML fp16 ~45 MB iOS / macOS on-device
uhm-pro uhm-pro-web-fp16.onnx ONNX fp16 ~51 MB Browser, server, Python (onnxruntime)
uhm-pro uhm-pro.onnx ONNX fp32 ~98 MB Quantization-free reference

Source weights for fine-tuning live in safetensors-checkpoint/ (HuBERT-base fp32, alongside config.json, preprocessor_config.json, labels.json).

Use

Python (ONNX)

from huggingface_hub import hf_hub_download
import onnxruntime as ort

path    = hf_hub_download("desert-ant-labs/uhm", "uhm-base-web-fp16.onnx")
session = ort.InferenceSession(path, providers=["CPUExecutionProvider"])

Python (PyTorch, fine-tuning starting point)

from transformers import AutoModelForAudioFrameClassification, AutoFeatureExtractor

extractor = AutoFeatureExtractor.from_pretrained("desert-ant-labs/uhm")
model     = AutoModelForAudioFrameClassification.from_pretrained("desert-ant-labs/uhm")

Inputs and outputs

  • Input: 16 kHz mono audio, up to 30-second windows.
  • Output: per-frame softmax over 6 classes, one prediction every 20 ms.
  • Class indices: 0 = not_filler, 1 = uh, 2 = um, 3 = hmm, 4 = and, 5 = other.

Core ML input shape (1, 480000) float32; output (1, 1499, 6). Requires iOS 17 / macOS 14 or newer.

Limitations

  • Trained on English; non-English performance is by acoustic transfer and has not been measured against per-language ground truth.
  • Best on podcast / meeting / talking-head audio. Heavy background music, laughter, or multi-speaker overlap degrades quality.
  • Type labels (uh / um / hmm / and / other) are secondary. Trust filler vs. not_filler more than the specific subtype.

Built on

License

Released under the Desert Ant Labs Source-Available License v1.0 (see LICENSE.md).

  • Free for commercial use up to 100,000 Monthly Active Users (MAU).
  • Above 100,000 MAU a commercial license is required. Contact licensing@desertant.ai.

Citation

@software{uhm_2026,
  title  = {Uhm: on-device filler-word detection},
  author = {Desert Ant Labs},
  year   = {2026},
  url    = {https://huggingface.co/desert-ant-labs/uhm},
}

© 2026 Desert Ant Labs · https://desertant.ai

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using desert-ant-labs/uhm 1