๐Ÿ‡ซ๐Ÿ‡ฎ Finnish ASR Canary-v2: State-of-the-Art Finnish Speech Recognition

A high-performance fine-tuned version of NVIDIA's Canary-v2 (1B parameter) model, specifically optimized for the Finnish language. This project provides a robust Finnish ASR solution through two rounds of finetuning, combined with a 6-gram KenLM language model for Shallow Fusion.

Round 2 (March 2026) โ€” Improved training corpus (28,857 samples), TTS-augmented long-form data, and transcript normalization. Best overall result on Common Voice and CSS10. See Round 2 Analysis below.


๐Ÿš€ Performance Benchmarks (WER %)

All numbers use jiwer normalization (lowercase, punctuation stripped). Lower is better.

Best Configuration Per Dataset

Dataset R1 + KenLM 5M R2 Greedy R2 + KenLM 5M Best
Common Voice 5.98% 5.41% 4.58% R2 + KenLM
FLEURS 6.48% 8.39% 7.75% R1 + KenLM
CSS10 (Audiobook) 11.85% 7.03% 12.39% R2 Greedy
VoxPopuli (Parliament) 5.73% 13.91% 13.23% R1 + KenLM
Global Average 7.51% 8.69% 9.49% R1 + KenLM

VoxPopuli is the one domain where R1 still leads. The R2 regression is caused by transcript normalization during training (number words โ†’ digits) while the eval manifest retains word-form numbers. This will be corrected in Round 3.

Full Benchmark Table

Model CommonVoice FLEURS CSS10 VoxPopuli Avg
Base Canary-v2 17.95% 7.79% 17.07% 7.96% 12.69%
R1 Greedy 12.82% 8.33% 12.19% 4.46% 9.45%
R1 + KenLM 5M 5.98% 6.48% 11.85% 5.73% 7.51%
R2 Greedy 5.41% 8.39% 7.03% 13.91% 8.69%
R2 + KenLM 5M 4.58% 7.75% 12.39% 13.23% 9.49%

KenLM Impact Within R2

Dataset R2 Greedy R2 + KenLM ฮ” Verdict
Common Voice 5.41% 4.58% โˆ’15.3% KenLM helps
FLEURS 8.39% 7.75% โˆ’7.6% KenLM helps
CSS10 7.03% 12.39% +76% KenLM hurts โ€” use greedy
VoxPopuli 13.91% 13.23% โˆ’4.9% Marginal

KenLM and CSS10: When the acoustic model is already very accurate (7% WER), the n-gram LM can override high-confidence acoustic decisions with mismatched web-text Finnish. Always benchmark KenLM on your target domain before deploying.


๐Ÿ“– Round 2 Analysis

What Changed in Round 2

Change Detail
Training corpus 28,857 samples (+24% vs R1's 23,180)
TTS long-form data 4,377 synthesized samples (mean 14.5s, max 25s) added to shift duration distribution
max_duration 20s โ†’ 30s to include TTS segments
Transcript normalization Number words โ†’ digits, en-dash โ†’ ASCII
Init checkpoint Base canary-1b-v2.nemo (fresh start, no R1 regressions inherited)
New eval sets eval_tts (487 entries) and eval_long_form (200 entries, all >20s)

R2 Results vs R1

Dataset R1 Greedy R2 Greedy ฮ” Why
Common Voice 12.82% 5.41% โˆ’57.8% TSV contamination fixed + normalization
CSS10 12.19% 7.03% โˆ’42.3% TTS data improved read-speech alignment
FLEURS 8.33% 8.39% โ‰ˆ flat Clean read-speech; unchanged by TTS additions
VoxPopuli 4.46% 13.91% +211% Normalization mismatch + TTS distribution shift

Key Lesson: Normalization Consistency

R2 normalized training transcripts (e.g. "kaksituhattaneljรคtoista" โ†’ "2014") but the eval_voxpopuli.json evaluation manifest was not updated to match. This inflates VoxPopuli WER for R2. A forthcoming Round 3 will normalize all eval manifests consistently.


๐Ÿƒ Running Inference

This model requires NVIDIA NeMo (commit 557177a18d, included in this repo with two patches applied).

Short Audio (< 30s)

from nemo.collections.asr.models import EncDecMultiTaskModel
from omegaconf import OmegaConf

# Load R2 model (recommended for most use cases)
model = EncDecMultiTaskModel.restore_from("models/canary-finnish-v2.nemo")
model.eval().cuda()

# Greedy decoding โ€” best for audiobooks, read speech
result = model.transcribe(
    audio=["sample.wav"],
    taskname="asr",
    source_lang="fi",
    target_lang="fi",
    pnc="yes"
)
print(result[0].text)

Short Audio with KenLM (recommended for conversational / CV-style audio)

model.change_decoding_strategy(
    decoding_cfg=OmegaConf.create({
        'strategy': 'beam',
        'beam': {
            'beam_size': 5,
            'ngram_lm_model': "models/kenlm_5M.nemo",
            'ngram_lm_alpha': 0.2,
        },
        'batch_size': 1
    })
)
result = model.transcribe(
    audio=["sample.wav"],
    taskname="asr",
    source_lang="fi",
    target_lang="fi",
    pnc="yes"
)

Long-Form Audio (podcasts, interviews, lectures)

We provide two scripts for long-form audio. The Pyannote-based pipeline is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary.

1. Diarized Pipeline (Recommended) โ€” inference_pyannote.py

This script uses pyannote/speaker-diarization-community-1 to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio.

# Optimized for podcasts/interviews (includes diarization + KenLM)
python inference_pyannote.py \
  --audio long_recording.wav \
  --model models/canary-finnish-v2.nemo \
  --kenlm models/kenlm_5M.nemo \
  --output transcript.json

2. VAD-only Pipeline โ€” inference_vad.py

A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording.

python inference_vad.py \
  --audio long_recording.wav \
  --model models/canary-finnish-v2.nemo \
  --output transcript.txt

Example Output

See moo_merged_kenlm.json for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps.


โš™๏ธ Parameter Recommendations

By Content Type

Content Type --min_silence_ms --beam_size KenLM Notes
Podcast / interview 150 5 Yes Conversational Finnish, KenLM helps most
Lecture / presentation 500โ€“1000 5 Yes Longer pauses โ†’ sentence-level VAD splits
Audiobook / read speech 150 โ€” No R2 greedy already at 7% WER; KenLM hurts
Parliament / formal speech 150 4 No Use R1 model; R2 regressed on this domain
Unknown / mixed 150 (default) 5 Yes Safe default

KenLM Alpha Tuning

--alpha controls how strongly the LM influences decoding (0 = greedy, higher = more LM):

ฮฑ Effect
0.1 Conservative โ€” mostly acoustic
0.2 Recommended default
0.3 More LM correction โ€” good for noisy audio
0.5+ Risky โ€” LM can override correct acoustic output

Full CLI Reference

inference_vad.py
  --audio           Path to input audio file (WAV, 16kHz mono)
  --model           Path to .nemo acoustic model
  --kenlm           Path to .nemo KenLM bundle (omit for greedy)
  --output          Output path (.txt); .json written alongside automatically
  --chunk_len       Max chunk duration in seconds (default: 15)
  --beam_size       Beam width for KenLM decoding (default: 5)
  --alpha           KenLM language model weight (default: 0.2)
  --min_silence_ms  Min silence to split VAD segments (default: 150)
  --min_speech_ms   Min speech duration to keep a segment (default: 250)
  --speech_pad_ms   Padding added around each speech segment (default: 400)

๐Ÿ—๏ธ Methodology & Architecture

Acoustic Model

Built on NVIDIA's Canary-v2 (Fast-Conformer AED, 1B parameters). Both rounds use speech_to_text_finetune.py which restores the full model architecture from the base .nemo checkpoint โ€” only the dataloader, optimizer, and tokenizer (kept frozen, update_tokenizer: false) need to be specified.

KenLM Language Model

A 6-gram KenLM trained on 5 million lines of high-quality Finnish text:

Source Lines
Reddit (Finnish communities) 1.5M
FinePDF (Finnish documents) 1.5M
Wiki-Edu (Wikipedia + educational) 1.0M
ASR transcripts ~23k

Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed before training. The model is token-aligned with the Canary BPE tokenizer and runs on GPU via NVIDIA's NGPU-LM engine (binary .nemo bundle, loads in <10s).

Training Infrastructure

  • Hardware: RTX 6000 PRO Blackwell (96 GB VRAM), Verda.com, Finland
  • Container: nvcr.io/nvidia/pytorch:25.01-py3
  • NeMo: commit 557177a18d (r2.6.0 / v2.8.0rc0), editable install

๐Ÿ“‚ Repository Structure

.
โ”œโ”€โ”€ NeMo/                              # NeMo toolkit (with patches applied)
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ canary-finnish-v2.nemo         # Round 2 finetuned model (1B)
โ”‚   โ”œโ”€โ”€ canary-finnish.nemo            # Round 1 finetuned model (1B)
โ”‚   โ”œโ”€โ”€ canary-1b-v2.nemo              # Base Canary-v2 model
โ”‚   โ”œโ”€โ”€ kenlm_1M.nemo                  # 6-gram KenLM (1M corpus)
โ”‚   โ”œโ”€โ”€ kenlm_2M.nemo                  # 6-gram KenLM (2M corpus)
โ”‚   โ””โ”€โ”€ kenlm_5M.nemo                  # 6-gram KenLM (5M corpus, recommended default)
โ”œโ”€โ”€ inference_pyannote.py              # Speaker-diarized inference (BEST for long audio)
โ”œโ”€โ”€ inference_vad.py                   # VAD-based inference (fast, single speaker)
โ”œโ”€โ”€ moo_merged_kenlm.json              # 30-min podcast example (Diarized + KenLM)
โ”œโ”€โ”€ moo_merged_greedy.json             # 30-min podcast example (Diarized, Greedy)
โ”œโ”€โ”€ PLAN_AND_PROGRESS.md               # Detailed training & analysis log
โ””โ”€โ”€ README.md

๐Ÿ› ๏ธ Setup

Prerequisites

  • NVIDIA GPU with โ‰ฅ 48 GB VRAM (tested on 96 GB RTX 6000 Pro Blackwell)
  • Docker with NVIDIA Container Toolkit
  • Container: nvcr.io/nvidia/pytorch:25.01-py3

Install

git clone https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2
cd Finnish-ASR-Canary-v2

# NeMo with required patches already applied
cd NeMo && pip install -e .[asr]
pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \
            kaldialign wandb soundfile editdistance

Additional setup for long-form diarized inference (inference_pyannote.py)

inference_pyannote.py requires pyannote + transformers components on top of base NeMo:

pip install pyannote.audio transformers accelerate sentencepiece

# Required by torchaudio 2.10+ audio I/O path in this container
pip install torchcodec

Set your Hugging Face token before running diarization (used to download pyannote/speaker-diarization-community-1):

export HF_TOKEN=your_hf_token

Or place it in .env as:

HF_TOKEN=your_hf_token

Critical NeMo Patches (already applied in included NeMo)

  1. OneLogger Fix โ€” makes proprietary telemetry optional for public containers
  2. Canary2 EOS Assertion Fix โ€” relaxes a strict EOS check to allow inference with placeholder transcripts

๐Ÿ™ Acknowledgments

Citations

@article{park2019css10,
  title={CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages},
  author={Park, Kyubyong and Mulc, Thomas},
  journal={Interspeech},
  year={2019}
}

@inproceedings{wang2021voxpopuli,
  title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning,
         Semi-Supervised Learning and Interpretation},
  author={Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and
          Talnikar, Chutier and Haziza, Daniel and Williamson, Maryam and
          Pino, Juan and Dupoux, Emmanuel},
  booktitle={ACL 2021},
  year={2021}
}
Downloads last month
81,513
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RASMUS/Finnish-ASR-Canary-v2

Finetuned
(5)
this model

Datasets used to train RASMUS/Finnish-ASR-Canary-v2

Evaluation results