SmartTurn V3.1 FP32

SmartTurn V3.1 is a semantic turn-completion / endpoint model. It takes the latest 8 seconds of 16 kHz audio, extracts Whisper log-mel features, and returns a binary probability for whether the current audio segment should be considered complete.

This repository contains the FP32 Hugging Face save_pretrained export recovered from smart-turn-v3.1-gpu.onnx.

Files

config.json: SmartTurnV3 config with model_type = "smart_turn_v3".
model.safetensors: FP32 model weights.
preprocessor_config.json: Whisper feature extractor config for 8-second audio.

Important

This model repo intentionally does not ship remote code. To load it, install or otherwise make available the smart_turn Python package that defines and registers SmartTurnV3Config and SmartTurnV3Model.

Installation

pip install torch==2.8.* transformers librosa numpy

# Install your smart_turn package, or run inside the smart-turn project with:
export PYTHONPATH=/path/to/smart-turn-onnx/src:${PYTHONPATH}

End-to-End Inference Demo

Save this as demo_smart_turn.py and run:

python demo_smart_turn.py /path/to/audio.wav

from __future__ import annotations

import sys
from pathlib import Path

import librosa
import numpy as np
import torch
from transformers import WhisperFeatureExtractor

from smart_turn.models import load_model


REPO_ID = "MigoXV/smart-turn-v3.1"
SAMPLE_RATE = 16000
WINDOW_SECONDS = 8
THRESHOLD = 0.5


def load_audio_window(path: str | Path) -> np.ndarray:
    audio, sr = librosa.load(path, sr=None, mono=True)

    if sr != SAMPLE_RATE:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=SAMPLE_RATE)

    if audio.dtype != np.float32:
        audio = audio.astype(np.float32)

    max_abs = np.max(np.abs(audio)) if audio.size else 0.0
    if max_abs > 1.0:
        audio = audio / max_abs

    max_samples = WINDOW_SECONDS * SAMPLE_RATE
    if audio.size >= max_samples:
        return audio[-max_samples:]

    padding = max_samples - audio.size
    return np.pad(audio, (padding, 0), mode="constant", constant_values=0)


def predict(audio_path: str | Path, device: str = "cpu") -> tuple[int, float]:
    torch_device = torch.device(device)
    dtype = torch.float32

    model = load_model(REPO_ID).to(device=torch_device, dtype=dtype).eval()
    feature_extractor = WhisperFeatureExtractor.from_pretrained(REPO_ID)

    audio = load_audio_window(audio_path)
    inputs = feature_extractor(
        audio,
        sampling_rate=SAMPLE_RATE,
        return_tensors="np",
        padding="max_length",
        max_length=WINDOW_SECONDS * SAMPLE_RATE,
        truncation=True,
        do_normalize=True,
    )

    input_features = torch.from_numpy(
        inputs.input_features.astype(np.float32)
    ).to(device=torch_device, dtype=dtype)

    with torch.no_grad():
        probability = model(input_features=input_features)["logits"].view(-1).item()

    prediction = 1 if probability > THRESHOLD else 0
    return prediction, probability


if __name__ == "__main__":
    if len(sys.argv) != 2:
        raise SystemExit("Usage: python demo_smart_turn.py /path/to/audio.wav")

    pred, prob = predict(sys.argv[1], device="cuda" if torch.cuda.is_available() else "cpu")
    print(f"prediction={pred} probability={prob:.8f}")

Input And Output

Input audio is converted to mono float32.
Audio is resampled to 16 kHz.
Only the latest 8 seconds are used.
Shorter audio is left-padded with zeros.
The model returns a sigmoid probability in [0, 1].
prediction = 1 means the segment is considered complete when probability > 0.5.

Notes

This is an FP32 safetensors export.
The model architecture is WhisperEncoder + attention pooling + binary head.
The repository does not include training code or remote modeling code.

Downloads last month: 23

Safetensors

Model size

8M params

Tensor type

F32