hviske-v5.3 — MLX INT8

Apple Silicon (MLX) INT8-quantized version of syvai/hviske-v5.3, the state-of-the-art Danish ASR model. Weights are quantized to 8-bit (affine, group_size=64) using mlx-speech.

Performance on Apple Silicon

Measured across the full CoRal v3 test set (17,560 clips, 25 hours of audio) on Apple M-series:

Split	Avg clip	RTFx	Latency p50	Latency p95
read_aloud	6.8 s	36×	177 ms	338 ms
conversation	3.4 s	25×	115 ms	281 ms

RTFx = real-time factor (higher = faster). The difference between splits is due to fixed per-clip overhead amortizing better over longer clips.

Accuracy vs. base model

Evaluated on the full CoRal v3 test sets (17,560 samples) with greedy decoding and strict normalization (lowercase + punctuation strip + Danish digit-to-word), matching the methodology on the base model card.

Split	N	BF16 WER	INT8 WER	Δ WER	BF16 CER	INT8 CER	Δ CER
read_aloud	9,122	9.37%	10.19%	+0.82 pp	3.80%	4.07%	+0.27 pp
conversation	8,438	19.63%	25.08%	+5.45 pp	11.56%	16.32%	+4.76 pp
weighted avg	17,560	14.30%	17.35%	+3.05 pp	7.53%	9.96%	+2.43 pp

Read-aloud quality is largely preserved (+0.8 pp WER). Conversation degrades more noticeably (+5.5 pp WER), which is typical of aggressive quantization on harder, more varied speech. If conversation accuracy is critical, use the full BF16 model.

Requirements

uv add mlx-speech soundfile scipy

Requires Python ≥ 3.10 and an Apple Silicon Mac (M1 or later).

Usage

Quick start

import numpy as np
import soundfile as sf
from scipy.signal import resample_poly
from math import gcd
from mlx_speech.generation.cohere_asr import CohereAsrModel

asr = CohereAsrModel.from_pretrained("rasgaard/hviske-v5.3-mlx-int8")

# Load audio and resample to 16 kHz
audio, sr = sf.read("your_audio.wav", dtype="float32", always_2d=False)
if audio.ndim > 1:
    audio = audio.mean(axis=1)
if sr != 16000:
    g = gcd(16000, sr)
    audio = resample_poly(audio, 16000 // g, sr // g).astype("float32")

result = asr.transcribe(audio, sample_rate=16000)
print(result.text)
# → "Jeg er subjekt A og jeg hedder Veronica"

Load from a local directory

from mlx_speech.generation.cohere_asr import CohereAsrModel

asr = CohereAsrModel.from_dir("/path/to/hviske-v5.3-mlx-int8")

Transcribe multiple files

import numpy as np
import soundfile as sf
from scipy.signal import resample_poly
from math import gcd
from mlx_speech.generation.cohere_asr import CohereAsrModel

def load_16k(path):
    audio, sr = sf.read(path, dtype="float32", always_2d=False)
    if audio.ndim > 1:
        audio = audio.mean(axis=1)
    if sr != 16000:
        g = gcd(16000, sr)
        audio = resample_poly(audio, 16000 // g, sr // g).astype("float32")
    return audio

asr = CohereAsrModel.from_pretrained("rasgaard/hviske-v5.3-mlx-int8")

for path in ["clip_a.wav", "clip_b.wav", "clip_c.wav"]:
    result = asr.transcribe(load_16k(path), sample_rate=16000)
    print(f"{path}: {result.text}")

Quantization details

Converted from syvai/hviske-v5.3 using mlx-speech/scripts/convert/cohere_asr.py:

python scripts/convert/cohere_asr.py \
  --input-dir models/hviske-v5.3 \
  --output-dir models/hviske-v5.3-mlx-int8 \
  --bits 8 --group-size 64 --mode affine

Linear layers whose output dimension is divisible by 64 are quantized to 8-bit affine; embeddings, norms, and conv layers remain in BF16.

License

CC BY-NC 4.0 — non-commercial use only. See base model card for commercial licensing.

Downloads last month: 20

Safetensors

Model size

0.8B params

Tensor type

BF16

U32

MLX

Hardware compatibility

Quantized

Model tree for rasgaard/hviske-v5.3-mlx-int8

Base model

syvai/hviske-v5.1

Finetuned

syvai/hviske-v5.3

Quantized

(1)

this model