hviske-v5.3

Danish ASR. Fine-tuned from syvai/hviske-v5.1 on the CoRal v3 train splits with layer-wise learning-rate decay (encoder LR = 0.75 × decoder LR) for 5 epochs.

A 2B-parameter Conformer encoder-decoder ASR model, optimized for Danish read-aloud and conversational speech.

Results on CoRal v3 full test sets

Evaluated on the complete test splits (17,560 samples). Two normalization conventions:

raw: jiwer on un-normalized references and hypotheses
strict: lowercase + punctuation strip + Danish digit-to-word (num2words(lang="da")) — the apples-to-apples normalization for comparing against published Whisper-style numbers

Greedy decoding (`num_beams=1`)

Split	N	raw WER	strict WER	raw CER	strict CER
`read_aloud`	9,122	10.26%	9.37%	4.17%	3.80%
`conversation`	8,438	21.30%	19.63%	12.12%	11.56%
weighted avg	17,560	15.56%	14.30%	7.99%	7.53%

Beam search (`num_beams=5`, `length_penalty=1.0`)

Split	N	raw WER	strict WER	raw CER	strict CER
`read_aloud`	9,122	9.86%	9.01%	3.98%	3.63%
`conversation`	8,438	20.89%	19.21%	11.90%	11.35%
weighted avg	17,560	15.16%	13.91%	7.78%	7.34%

Beam search costs ~75% more inference time but lowers avg WER by 0.4 pp.

Versus other Danish ASR models on CoRal v3 (CER)

The CoRal team publishes CER numbers on the same test splits. hviske-v5.3 numbers are evaluated on the full test sets. Other entries reproduced from the roest-v3-whisper-1.5b model card.

Conversation split

Model	Params	Trained on	conv CER
hviske-v5.3 (this model, beam=5, strict)	2.0B	read_aloud + conversation	11.35%
hviske-v5.3 (this model, greedy, strict)	2.0B	read_aloud + conversation	11.56%
hviske-v5.3 (this model, beam=5, raw)	2.0B	read_aloud + conversation	11.90%
CoRal-project/roest-whisper-1.5b-v2	1.54B	read_aloud + conversation	11.6%
CoRal-project/roest-wav2vec2-315m-v3	315M	read_aloud + conversation	13.7%
syvai/hviske-v3-conversation	1.54B	read_aloud + conversation	15.1%
capacit-ai/saga (greedy, strict)	2.0B	read_aloud + conversation	16.92%
CoRal-project/roest-wav2vec2-315m-v1	315M	read_aloud only	17.6%
CoRal-project/roest-wav2vec2-315m-v2	315M	read_aloud + conversation	24.2%
openai/whisper-large-v3	1.54B	—	27.5%
syvai/hviske-v2	1.54B	read_aloud only	29.4%
CoRal-project/roest-whisper-1.5b-v1	1.54B	read_aloud only	35.6%

Read-aloud split

Model	Params	Trained on	read_aloud CER
hviske-v5.3 (this model, beam=5, strict)	2.0B	read_aloud + conversation	3.63%
hviske-v5.3 (this model, greedy, strict)	2.0B	read_aloud + conversation	3.80%
hviske-v5.3 (this model, beam=5, raw)	2.0B	read_aloud + conversation	3.98%
CoRal-project/roest-whisper-1.5b-v1	1.54B	read_aloud only	4.0%
syvai/hviske-v2	1.54B	read_aloud only	4.0%
CoRal-project/roest-whisper-1.5b-v2	1.54B	read_aloud + conversation	4.5%
syvai/hviske-v3-conversation	1.54B	read_aloud + conversation	4.5%
CoRal-project/roest-wav2vec2-315m-v3	315M	read_aloud + conversation	5.9%
CoRal-project/roest-wav2vec2-315m-v2	315M	read_aloud + conversation	6.4%
capacit-ai/saga (greedy, strict)	2.0B	read_aloud + conversation	7.41%
CoRal-project/roest-wav2vec2-315m-v1	315M	read_aloud only	8.2%
openai/whisper-large-v3	1.54B	—	10.1%

The CoRal team's published numbers do not specify the normalization used; both raw and strict CER are shown for hviske-v5.3 to make the comparison fair. capacit-ai/saga was evaluated with the same methodology used here (full test splits via greedy vllm serve + /v1/audio/transcriptions); raw CER is 8.26% (read_aloud) and 17.49% (conversation).

Inference speed

On a single NVIDIA RTX 3090, hviske-v5.3 reaches RTFx ≈ 425 — i.e. it transcribes audio about 425× faster than real time. 60 minutes of audio is processed in ≈ 8.5 seconds.

Installation

pip install "transformers>=5.4.0" torch soundfile librosa huggingface_hub sentencepiece protobuf
pip install datasets  # only needed for the streaming examples below

Usage

Load the model with AutoModelForSpeechSeq2Seq and trust_remote_code=True. The model exposes both a high-level model.transcribe(...) helper and the standard model.generate(...) interface.

1. Quick start — single file

import torch, numpy as np, soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()

audio, sr = sf.read("your_audio.wav")
audio = np.asarray(audio, dtype=np.float32)

hyp = model.transcribe(
    processor=processor,
    language="da",
    audio_arrays=[audio],
    sample_rates=[sr],
)[0]
print(hyp)

Audio longer than ~35 s is automatically chunked. Input is resampled to 16 kHz internally.

2. Long-form audio (≥ 35 s)

The processor automatically splits long audio into chunks. Pass the resulting audio_chunk_index back into decode() to stitch the per-chunk hypotheses into a single transcript:

import time
import soundfile as sf
import torch
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()

audio, sr = sf.read("long_lecture_da.wav")
duration_s = len(audio) / sr
print(f"Audio duration: {duration_s/60:.1f} min")

inputs = processor(audio=audio, sampling_rate=sr, return_tensors="pt", language="da")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs = inputs.to(model.device, dtype=model.dtype)

start = time.time()
outputs = model.generate(**inputs, max_new_tokens=256)
text = processor.decode(
    outputs,
    skip_special_tokens=True,
    audio_chunk_index=audio_chunk_index,
    language="da",
)[0]
elapsed = time.time() - start
print(f"Transcribed in {elapsed:.1f}s — RTFx: {duration_s/elapsed:.1f}")
print(text)

3. Batched inference

import torch, soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("syvai/hviske-v5.3", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "syvai/hviske-v5.3", trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()

audio_a, _ = sf.read("clip_a.wav")
audio_b, _ = sf.read("clip_b.wav")

texts = model.transcribe(
    processor=processor,
    language="da",
    audio_arrays=[audio_a, audio_b],
    sample_rates=[16000, 16000],
)
for i, t in enumerate(texts):
    print(f"[{i}] {t}")

4. Beam search (lowest WER)

Beam=5 is the recipe used to produce the beam-search numbers in the table above (~0.4 pp lower avg WER, ~75% slower than greedy):

inputs = processor(audio, sampling_rate=16000, return_tensors="pt", language="da")
audio_chunk_index = inputs.get("audio_chunk_index")
inputs = inputs.to(model.device, dtype=model.dtype)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    num_beams=5,
    length_penalty=1.0,
    do_sample=False,
)
text = processor.decode(
    outputs,
    skip_special_tokens=True,
    audio_chunk_index=audio_chunk_index,
    language="da",
)[0]
print(text)

5. Punctuation control

By default the processor inserts a punctuation prompt token. Disable it for plain-text output:

inputs_pnc   = processor(audio, sampling_rate=16000, return_tensors="pt", language="da", punctuation=True)
inputs_nopnc = processor(audio, sampling_rate=16000, return_tensors="pt", language="da", punctuation=False)

Training details

Starting point: syvai/hviske-v5.1
Data: CoRal-project/coral-v3 — both read_aloud (299,255 samples) and conversation (147,249) train splits, interleaved and per-epoch shuffled (seed=42)
Eval during training: 10% of each test split (912 + 843 samples) for best-checkpoint tracking
Best-checkpoint tracking: saved the eval point with lowest avg WER (hit at 90% of training)
Hyperparameters:
- Epochs: 5
- Batch: 16 micro × 8 grad-accum = 128 effective batch
- Layer-wise LR decay (LLRD): encoder = 0.75 × decoder
  - Decoder peak: 2e-4
  - Encoder peak: 1.5e-4
- Schedule: 1,500-step linear warmup → cosine decay to zero
- Optimizer: bnb AdamW8bit
- Augmentation: SpecAugment (2 freq × 27 bins, 2 time × 100 frames)
- Max audio length: 31 s (longer is dropped)
- Precision: bf16
Hardware: single NVIDIA RTX PRO 6000 Blackwell Max-Q (98 GB)

License

This model is released under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0).

Permitted: non-commercial use including research, education, evaluation, and personal projects, with attribution.
Not permitted without a separate commercial license: any use by or for a commercial entity, integration into a commercial product or service, or use to generate revenue (directly or indirectly).
Commercial licensing: contact mads@syv.ai.

Downloads last month: 3,801

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for syvai/hviske-v5.3

Base model

syvai/hviske-v5.1

Finetuned

(2)

this model

Quantizations

1 model

syvai
/

hviske-v5.3

hviske-v5.3

Results on CoRal v3 full test sets

Greedy decoding (`num_beams=1`)

Beam search (`num_beams=5`, `length_penalty=1.0`)

Versus other Danish ASR models on CoRal v3 (CER)

Conversation split

Read-aloud split

Inference speed

Installation

Usage

1. Quick start — single file

2. Long-form audio (≥ 35 s)

3. Batched inference

4. Beam search (lowest WER)

5. Punctuation control

Training details

License

Model tree for syvai/hviske-v5.3

Dataset used to train syvai/hviske-v5.3

hviske-v5.3

Results on CoRal v3 full test sets

Greedy decoding (num_beams=1)

Beam search (num_beams=5, length_penalty=1.0)

Versus other Danish ASR models on CoRal v3 (CER)

Conversation split

Read-aloud split

Inference speed

Installation

Usage

1. Quick start — single file

2. Long-form audio (≥ 35 s)

3. Batched inference

4. Beam search (lowest WER)

5. Punctuation control

Training details

License

Model tree for syvai/hviske-v5.3

Dataset used to train syvai/hviske-v5.3

Greedy decoding (`num_beams=1`)

Beam search (`num_beams=5`, `length_penalty=1.0`)