Hviske v5 — Danish ASR
Fine-tuned CohereLabs/cohere-transcribe-03-2026 (2B parameters) for Danish speech recognition.
Trained on 3.5M samples (16,000 hours) of Danish speech across 7 datasets.
Training progression
| Version | Training data | Avg WER |
|---|---|---|
| Base model | — | >100% (no Danish pretraining) |
| hviske-v1 | CoRal-v3 | ~20% |
| hviske-v4 | + nota + ftspeech | 6.7% (parliamentary) |
| hviske-v5 | + VoxPopuli + nst-da + Common Voice | 14.1% (multi-domain avg) |
Evaluation results
Evaluated on 200 samples per dataset (WER / CER):
| Dataset | Domain | WER | CER |
|---|---|---|---|
| VoxPopuli | European Parliament | 11.8% | 6.3% |
| nota | Broadcast media | 5.7% | 1.8% |
| ftspeech | Danish Parliament | 6.4% | 3.1% |
| CoRal-v3 read_aloud | Read-aloud speech | 19.2% | 7.0% |
| CoRal-v3 conversation | Conversational | 17.1% | 9.3% |
| nst-da | General Danish | 14.0% | 8.4% |
| Common Voice 17 | Crowd-sourced | 24.3% | 7.4% |
| Average | 14.1% | 6.2% |
Training details
- Architecture: Full fine-tuning (encoder + decoder), encoder BatchNorm in eval mode
- Optimizer: AdamW 8-bit, LR 2.5e-5, cosine schedule with 500-step warmup
- Batch size: 32 (4 × 8 gradient accumulation)
- Data mixing: Shuffle buffer (20k samples) across all datasets for balanced domain exposure
- Audio: 16kHz mono, max 20s per sample
- Hardware: Single NVIDIA RTX 3090 (24GB)
Training data
| Dataset | Samples | Hours | Description |
|---|---|---|---|
| VoxPopuli Danish | 1,775,578 | ~13,600 | European Parliament recordings |
| ftspeech | 995,677 | ~1,400 | Danish Parliament (Folketinget) |
| CoRal-v3 read_aloud | 299,255 | ~400 | Read-aloud Danish speech |
| nst-da | 182,605 | ~250 | NST Danish speech corpus |
| CoRal-v3 conversation | 147,249 | ~200 | Conversational Danish |
| nota | 98,600 | ~270 | Danish broadcast media |
| Common Voice 17 | 3,484 | ~5 | Crowd-sourced Danish |
| Total | ~3.5M | ~16,000 |
A unified version of all training data is available at syvai/danish-asr-unified.
Usage
Installation
pip install transformers torch soundfile librosa
Basic transcription
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import soundfile as sf
# Load model
processor = AutoProcessor.from_pretrained("syvai/hviske-v5", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained("syvai/hviske-v5", trust_remote_code=True)
model = model.to("cuda") # optional, for GPU inference
# Load audio (must be 16kHz mono)
audio, sr = sf.read("audio.wav")
# Transcribe
transcriptions = model.transcribe(
processor=processor,
audio_arrays=[audio],
sample_rates=[sr],
language="da",
punctuation=True,
)
print(transcriptions[0])
Batch transcription
import soundfile as sf
files = ["file1.wav", "file2.wav", "file3.wav"]
arrays, rates = [], []
for f in files:
audio, sr = sf.read(f)
arrays.append(audio)
rates.append(sr)
transcriptions = model.transcribe(
processor=processor,
audio_arrays=arrays,
sample_rates=rates,
language="da",
punctuation=True,
)
for f, t in zip(files, transcriptions):
print(f"{f}: {t}")
High-throughput inference with vLLM
For production workloads, serve the model with vLLM for significantly higher throughput:
pip install vllm
vllm serve syvai/hviske-v5 --trust-remote-code
Then send requests:
import requests, base64, soundfile as sf, io
audio, sr = sf.read("audio.wav")
buf = io.BytesIO()
sf.write(buf, audio, sr, format="WAV")
audio_b64 = base64.b64encode(buf.getvalue()).decode()
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "syvai/hviske-v5",
"messages": [{"role": "user", "content": [
{"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}}
]}],
})
print(response.json()["choices"][0]["message"]["content"])
- Downloads last month
- -
Model tree for syvai/hviske-v5
Base model
CohereLabs/cohere-transcribe-03-2026