hviske-v5.3 — MLX INT8

Apple Silicon (MLX) INT8-quantized version of syvai/hviske-v5.3, the state-of-the-art Danish ASR model. Weights are quantized to 8-bit (affine, group_size=64) using mlx-speech.

Performance on Apple Silicon

Measured across the full CoRal v3 test set (17,560 clips, 25 hours of audio) on Apple M-series:

Split Avg clip RTFx Latency p50 Latency p95
read_aloud 6.8 s 36× 177 ms 338 ms
conversation 3.4 s 25× 115 ms 281 ms

RTFx = real-time factor (higher = faster). The difference between splits is due to fixed per-clip overhead amortizing better over longer clips.

Accuracy vs. base model

Evaluated on the full CoRal v3 test sets (17,560 samples) with greedy decoding and strict normalization (lowercase + punctuation strip + Danish digit-to-word), matching the methodology on the base model card.

Split N BF16 WER INT8 WER Δ WER BF16 CER INT8 CER Δ CER
read_aloud 9,122 9.37% 10.19% +0.82 pp 3.80% 4.07% +0.27 pp
conversation 8,438 19.63% 25.08% +5.45 pp 11.56% 16.32% +4.76 pp
weighted avg 17,560 14.30% 17.35% +3.05 pp 7.53% 9.96% +2.43 pp

Read-aloud quality is largely preserved (+0.8 pp WER). Conversation degrades more noticeably (+5.5 pp WER), which is typical of aggressive quantization on harder, more varied speech. If conversation accuracy is critical, use the full BF16 model.

Requirements

uv add mlx-speech soundfile scipy

Requires Python ≥ 3.10 and an Apple Silicon Mac (M1 or later).

Usage

Quick start

import numpy as np
import soundfile as sf
from scipy.signal import resample_poly
from math import gcd
from mlx_speech.generation.cohere_asr import CohereAsrModel

asr = CohereAsrModel.from_pretrained("rasgaard/hviske-v5.3-mlx-int8")

# Load audio and resample to 16 kHz
audio, sr = sf.read("your_audio.wav", dtype="float32", always_2d=False)
if audio.ndim > 1:
    audio = audio.mean(axis=1)
if sr != 16000:
    g = gcd(16000, sr)
    audio = resample_poly(audio, 16000 // g, sr // g).astype("float32")

result = asr.transcribe(audio, sample_rate=16000)
print(result.text)
# → "Jeg er subjekt A og jeg hedder Veronica"

Load from a local directory

from mlx_speech.generation.cohere_asr import CohereAsrModel

asr = CohereAsrModel.from_dir("/path/to/hviske-v5.3-mlx-int8")

Transcribe multiple files

import numpy as np
import soundfile as sf
from scipy.signal import resample_poly
from math import gcd
from mlx_speech.generation.cohere_asr import CohereAsrModel

def load_16k(path):
    audio, sr = sf.read(path, dtype="float32", always_2d=False)
    if audio.ndim > 1:
        audio = audio.mean(axis=1)
    if sr != 16000:
        g = gcd(16000, sr)
        audio = resample_poly(audio, 16000 // g, sr // g).astype("float32")
    return audio

asr = CohereAsrModel.from_pretrained("rasgaard/hviske-v5.3-mlx-int8")

for path in ["clip_a.wav", "clip_b.wav", "clip_c.wav"]:
    result = asr.transcribe(load_16k(path), sample_rate=16000)
    print(f"{path}: {result.text}")

Quantization details

Converted from syvai/hviske-v5.3 using mlx-speech/scripts/convert/cohere_asr.py:

python scripts/convert/cohere_asr.py \
  --input-dir models/hviske-v5.3 \
  --output-dir models/hviske-v5.3-mlx-int8 \
  --bits 8 --group-size 64 --mode affine

Linear layers whose output dimension is divisible by 64 are quantized to 8-bit affine; embeddings, norms, and conv layers remain in BF16.

License

CC BY-NC 4.0 — non-commercial use only. See base model card for commercial licensing.

Downloads last month
87
Safetensors
Model size
0.8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rasgaard/hviske-v5.3-mlx-int8

Quantized
(1)
this model