Hviske v5 — Danish ASR

Fine-tuned CohereLabs/cohere-transcribe-03-2026 (2B parameters) for Danish speech recognition.

Trained on 3.5M samples (16,000 hours) of Danish speech across 7 datasets.

Training progression

Version Training data Avg WER
Base model — >100% (no Danish pretraining)
hviske-v1 CoRal-v3 ~20%
hviske-v4 + nota + ftspeech 6.7% (parliamentary)
hviske-v5 + VoxPopuli + nst-da + Common Voice 14.1% (multi-domain avg)

Evaluation results

Evaluated on 200 samples per dataset (WER / CER):

Dataset Domain WER CER
VoxPopuli European Parliament 11.8% 6.3%
nota Broadcast media 5.7% 1.8%
ftspeech Danish Parliament 6.4% 3.1%
CoRal-v3 read_aloud Read-aloud speech 19.2% 7.0%
CoRal-v3 conversation Conversational 17.1% 9.3%
nst-da General Danish 14.0% 8.4%
Common Voice 17 Crowd-sourced 24.3% 7.4%
Average 14.1% 6.2%

Training details

  • Architecture: Full fine-tuning (encoder + decoder), encoder BatchNorm in eval mode
  • Optimizer: AdamW 8-bit, LR 2.5e-5, cosine schedule with 500-step warmup
  • Batch size: 32 (4 × 8 gradient accumulation)
  • Data mixing: Shuffle buffer (20k samples) across all datasets for balanced domain exposure
  • Audio: 16kHz mono, max 20s per sample
  • Hardware: Single NVIDIA RTX 3090 (24GB)

Training data

Dataset Samples Hours Description
VoxPopuli Danish 1,775,578 ~13,600 European Parliament recordings
ftspeech 995,677 ~1,400 Danish Parliament (Folketinget)
CoRal-v3 read_aloud 299,255 ~400 Read-aloud Danish speech
nst-da 182,605 ~250 NST Danish speech corpus
CoRal-v3 conversation 147,249 ~200 Conversational Danish
nota 98,600 ~270 Danish broadcast media
Common Voice 17 3,484 ~5 Crowd-sourced Danish
Total ~3.5M ~16,000

A unified version of all training data is available at syvai/danish-asr-unified.

Usage

Installation

pip install transformers torch soundfile librosa

Basic transcription

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import soundfile as sf

# Load model
processor = AutoProcessor.from_pretrained("syvai/hviske-v5", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained("syvai/hviske-v5", trust_remote_code=True)
model = model.to("cuda")  # optional, for GPU inference

# Load audio (must be 16kHz mono)
audio, sr = sf.read("audio.wav")

# Transcribe
transcriptions = model.transcribe(
    processor=processor,
    audio_arrays=[audio],
    sample_rates=[sr],
    language="da",
    punctuation=True,
)
print(transcriptions[0])

Batch transcription

import soundfile as sf

files = ["file1.wav", "file2.wav", "file3.wav"]
arrays, rates = [], []
for f in files:
    audio, sr = sf.read(f)
    arrays.append(audio)
    rates.append(sr)

transcriptions = model.transcribe(
    processor=processor,
    audio_arrays=arrays,
    sample_rates=rates,
    language="da",
    punctuation=True,
)
for f, t in zip(files, transcriptions):
    print(f"{f}: {t}")

High-throughput inference with vLLM

For production workloads, serve the model with vLLM for significantly higher throughput:

pip install vllm
vllm serve syvai/hviske-v5 --trust-remote-code

Then send requests:

import requests, base64, soundfile as sf, io

audio, sr = sf.read("audio.wav")
buf = io.BytesIO()
sf.write(buf, audio, sr, format="WAV")
audio_b64 = base64.b64encode(buf.getvalue()).decode()

response = requests.post("http://localhost:8000/v1/chat/completions", json={
    "model": "syvai/hviske-v5",
    "messages": [{"role": "user", "content": [
        {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}}
    ]}],
})
print(response.json()["choices"][0]["message"]["content"])
Downloads last month
-
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for syvai/hviske-v5

Finetuned
(4)
this model

Datasets used to train syvai/hviske-v5