A newer model is available — please use syvai/hviske-v5.3 instead. v5.3 is the current recommended Danish ASR model from this family and reaches 13.91% strict WER on the CoRal v3 full test set (beam=5), down from v5.2's 14.90%. This v5.2 checkpoint is kept for reproducibility.

hviske-v5.2

Danish ASR. Further fine-tuned from syvai/hviske-v5.1 on the CoRal v3 train splits for 4 epochs, targeting CoRal-quality read-aloud and conversational Danish.

Results on CoRal v3 full test sets

Split N WER CER
read_aloud 9,122 10.94% 4.34%
conversation 8,438 21.87% 12.38%
average 16.41% 8.36%

These are evaluated on the complete test splits (17,560 samples total), not a subset.

Progression across the model family

Model read_aloud WER conversation WER
Base 2B Conformer (no Danish pretraining) ~105% ~126%
hviske-v5.1 (1 ep on syvai/danish-asr-unified) 19.60% 41.74%
hviske-v5.1-hilr-best (3 ep on CoRal) 11.59% 22.63%
hviske-v5.2 (4 ep on CoRal, long warmup) 10.94% 21.87%

Usage

import torch, numpy as np, soundfile as sf
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("syvai/hviske-v5.2", trust_remote_code=True)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "syvai/hviske-v5.2", trust_remote_code=True, dtype=torch.bfloat16
).to("cuda").eval()

audio, sr = sf.read("your_audio.wav")
audio = np.asarray(audio, dtype=np.float32)

hyp = model.transcribe(
    processor=processor,
    language="da",
    audio_arrays=[audio],
    sample_rates=[sr],
)[0]
print(hyp)

Audio longer than ~35 s is automatically chunked. Input is resampled to 16 kHz internally.

Training details

  • Starting point: syvai/hviske-v5.1
  • Data: CoRal-project/coral-v3 — both read_aloud (299,255 samples) and conversation (147,249) train splits interleaved and per-epoch shuffled (seed=42)
  • Eval during training: 10% of each test split (912 + 843 samples)
  • Best-checkpoint tracking: saved the epoch with lowest average WER across both splits (hit at 90% of training, step 100,458)
  • Hyperparameters:
    • Epochs: 4
    • Batch: 16 micro × 8 grad-accum = 128 effective batch
    • LR: 1.5e-4 peak, 1,500-step linear warmup, cosine decay
    • Optimizer: bnb AdamW8bit
    • Augmentation: SpecAugment (2 freq × 27 bins, 2 time × 100 frames)
    • Max audio length: 31 s (longer is dropped)
    • Precision: bf16
  • Hardware: single NVIDIA RTX PRO 6000 Blackwell Max-Q (98 GB)
  • Wall time: 10 h 53 min for training

License

This model is released under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0).

  • Permitted: non-commercial use including research, education, evaluation, and personal projects, with attribution.
  • Not permitted without a separate commercial license: any use by or for a commercial entity, integration into a commercial product or service, or use to generate revenue (directly or indirectly).
  • Commercial licensing: contact mads@syv.ai.
Downloads last month
236
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for syvai/hviske-v5.2

Finetuned
(2)
this model

Dataset used to train syvai/hviske-v5.2