๐ซ๐ฎ Finnish ASR Canary-v2: State-of-the-Art Finnish Speech Recognition
A high-performance fine-tuned version of NVIDIA's Canary-v2 (1B parameter) model, specifically optimized for the Finnish language. This project provides a robust Finnish ASR solution through two rounds of finetuning, combined with a 6-gram KenLM language model for Shallow Fusion.
Round 2 (March 2026) โ Improved training corpus (28,857 samples), TTS-augmented long-form data, and transcript normalization. Best overall result on Common Voice and CSS10. See Round 2 Analysis below.
๐ Performance Benchmarks (WER %)
All numbers use jiwer normalization (lowercase, punctuation stripped). Lower is better.
Best Configuration Per Dataset
| Dataset | R1 + KenLM 5M | R2 Greedy | R2 + KenLM 5M | Best |
|---|---|---|---|---|
| Common Voice | 5.98% | 5.41% | 4.58% | R2 + KenLM |
| FLEURS | 6.48% | 8.39% | 7.75% | R1 + KenLM |
| CSS10 (Audiobook) | 11.85% | 7.03% | 12.39% | R2 Greedy |
| VoxPopuli (Parliament) | 5.73% | 13.91% | 13.23% | R1 + KenLM |
| Global Average | 7.51% | 8.69% | 9.49% | R1 + KenLM |
VoxPopuli is the one domain where R1 still leads. The R2 regression is caused by transcript normalization during training (number words โ digits) while the eval manifest retains word-form numbers. This will be corrected in Round 3.
Full Benchmark Table
| Model | CommonVoice | FLEURS | CSS10 | VoxPopuli | Avg |
|---|---|---|---|---|---|
| Base Canary-v2 | 17.95% | 7.79% | 17.07% | 7.96% | 12.69% |
| R1 Greedy | 12.82% | 8.33% | 12.19% | 4.46% | 9.45% |
| R1 + KenLM 5M | 5.98% | 6.48% | 11.85% | 5.73% | 7.51% |
| R2 Greedy | 5.41% | 8.39% | 7.03% | 13.91% | 8.69% |
| R2 + KenLM 5M | 4.58% | 7.75% | 12.39% | 13.23% | 9.49% |
KenLM Impact Within R2
| Dataset | R2 Greedy | R2 + KenLM | ฮ | Verdict |
|---|---|---|---|---|
| Common Voice | 5.41% | 4.58% | โ15.3% | KenLM helps |
| FLEURS | 8.39% | 7.75% | โ7.6% | KenLM helps |
| CSS10 | 7.03% | 12.39% | +76% | KenLM hurts โ use greedy |
| VoxPopuli | 13.91% | 13.23% | โ4.9% | Marginal |
KenLM and CSS10: When the acoustic model is already very accurate (7% WER), the n-gram LM can override high-confidence acoustic decisions with mismatched web-text Finnish. Always benchmark KenLM on your target domain before deploying.
๐ Round 2 Analysis
What Changed in Round 2
| Change | Detail |
|---|---|
| Training corpus | 28,857 samples (+24% vs R1's 23,180) |
| TTS long-form data | 4,377 synthesized samples (mean 14.5s, max 25s) added to shift duration distribution |
max_duration |
20s โ 30s to include TTS segments |
| Transcript normalization | Number words โ digits, en-dash โ ASCII |
| Init checkpoint | Base canary-1b-v2.nemo (fresh start, no R1 regressions inherited) |
| New eval sets | eval_tts (487 entries) and eval_long_form (200 entries, all >20s) |
R2 Results vs R1
| Dataset | R1 Greedy | R2 Greedy | ฮ | Why |
|---|---|---|---|---|
| Common Voice | 12.82% | 5.41% | โ57.8% | TSV contamination fixed + normalization |
| CSS10 | 12.19% | 7.03% | โ42.3% | TTS data improved read-speech alignment |
| FLEURS | 8.33% | 8.39% | โ flat | Clean read-speech; unchanged by TTS additions |
| VoxPopuli | 4.46% | 13.91% | +211% | Normalization mismatch + TTS distribution shift |
Key Lesson: Normalization Consistency
R2 normalized training transcripts (e.g. "kaksituhattaneljรคtoista" โ "2014") but the eval_voxpopuli.json evaluation manifest was not updated to match. This inflates VoxPopuli WER for R2. A forthcoming Round 3 will normalize all eval manifests consistently.
๐ Running Inference
This model requires NVIDIA NeMo (commit 557177a18d, included in this repo with two patches applied).
Short Audio (< 30s)
from nemo.collections.asr.models import EncDecMultiTaskModel
from omegaconf import OmegaConf
# Load R2 model (recommended for most use cases)
model = EncDecMultiTaskModel.restore_from("models/canary-finnish-v2.nemo")
model.eval().cuda()
# Greedy decoding โ best for audiobooks, read speech
result = model.transcribe(
audio=["sample.wav"],
taskname="asr",
source_lang="fi",
target_lang="fi",
pnc="yes"
)
print(result[0].text)
Short Audio with KenLM (recommended for conversational / CV-style audio)
model.change_decoding_strategy(
decoding_cfg=OmegaConf.create({
'strategy': 'beam',
'beam': {
'beam_size': 5,
'ngram_lm_model': "models/kenlm_5M.nemo",
'ngram_lm_alpha': 0.2,
},
'batch_size': 1
})
)
result = model.transcribe(
audio=["sample.wav"],
taskname="asr",
source_lang="fi",
target_lang="fi",
pnc="yes"
)
Long-Form Audio (podcasts, interviews, lectures)
We provide two scripts for long-form audio. The Pyannote-based pipeline is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary.
1. Diarized Pipeline (Recommended) โ inference_pyannote.py
This script uses pyannote/speaker-diarization-community-1 to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio.
# Optimized for podcasts/interviews (includes diarization + KenLM)
python inference_pyannote.py \
--audio long_recording.wav \
--model models/canary-finnish-v2.nemo \
--kenlm models/kenlm_5M.nemo \
--output transcript.json
2. VAD-only Pipeline โ inference_vad.py
A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording.
python inference_vad.py \
--audio long_recording.wav \
--model models/canary-finnish-v2.nemo \
--output transcript.txt
Example Output
See moo_merged_kenlm.json for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps.
โ๏ธ Parameter Recommendations
By Content Type
| Content Type | --min_silence_ms |
--beam_size |
KenLM | Notes |
|---|---|---|---|---|
| Podcast / interview | 150 | 5 | Yes | Conversational Finnish, KenLM helps most |
| Lecture / presentation | 500โ1000 | 5 | Yes | Longer pauses โ sentence-level VAD splits |
| Audiobook / read speech | 150 | โ | No | R2 greedy already at 7% WER; KenLM hurts |
| Parliament / formal speech | 150 | 4 | No | Use R1 model; R2 regressed on this domain |
| Unknown / mixed | 150 (default) | 5 | Yes | Safe default |
KenLM Alpha Tuning
--alpha controls how strongly the LM influences decoding (0 = greedy, higher = more LM):
| ฮฑ | Effect |
|---|---|
| 0.1 | Conservative โ mostly acoustic |
| 0.2 | Recommended default |
| 0.3 | More LM correction โ good for noisy audio |
| 0.5+ | Risky โ LM can override correct acoustic output |
Full CLI Reference
inference_vad.py
--audio Path to input audio file (WAV, 16kHz mono)
--model Path to .nemo acoustic model
--kenlm Path to .nemo KenLM bundle (omit for greedy)
--output Output path (.txt); .json written alongside automatically
--chunk_len Max chunk duration in seconds (default: 15)
--beam_size Beam width for KenLM decoding (default: 5)
--alpha KenLM language model weight (default: 0.2)
--min_silence_ms Min silence to split VAD segments (default: 150)
--min_speech_ms Min speech duration to keep a segment (default: 250)
--speech_pad_ms Padding added around each speech segment (default: 400)
๐๏ธ Methodology & Architecture
Acoustic Model
Built on NVIDIA's Canary-v2 (Fast-Conformer AED, 1B parameters). Both rounds use speech_to_text_finetune.py which restores the full model architecture from the base .nemo checkpoint โ only the dataloader, optimizer, and tokenizer (kept frozen, update_tokenizer: false) need to be specified.
KenLM Language Model
A 6-gram KenLM trained on 5 million lines of high-quality Finnish text:
| Source | Lines |
|---|---|
| Reddit (Finnish communities) | 1.5M |
| FinePDF (Finnish documents) | 1.5M |
| Wiki-Edu (Wikipedia + educational) | 1.0M |
| ASR transcripts | ~23k |
Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed before training. The model is token-aligned with the Canary BPE tokenizer and runs on GPU via NVIDIA's NGPU-LM engine (binary .nemo bundle, loads in <10s).
Training Infrastructure
- Hardware: RTX 6000 PRO Blackwell (96 GB VRAM), Verda.com, Finland
- Container:
nvcr.io/nvidia/pytorch:25.01-py3 - NeMo: commit
557177a18d(r2.6.0 / v2.8.0rc0), editable install
๐ Repository Structure
.
โโโ NeMo/ # NeMo toolkit (with patches applied)
โโโ models/
โ โโโ canary-finnish-v2.nemo # Round 2 finetuned model (1B)
โ โโโ canary-finnish.nemo # Round 1 finetuned model (1B)
โ โโโ canary-1b-v2.nemo # Base Canary-v2 model
โ โโโ kenlm_1M.nemo # 6-gram KenLM (1M corpus)
โ โโโ kenlm_2M.nemo # 6-gram KenLM (2M corpus)
โ โโโ kenlm_5M.nemo # 6-gram KenLM (5M corpus, recommended default)
โโโ inference_pyannote.py # Speaker-diarized inference (BEST for long audio)
โโโ inference_vad.py # VAD-based inference (fast, single speaker)
โโโ moo_merged_kenlm.json # 30-min podcast example (Diarized + KenLM)
โโโ moo_merged_greedy.json # 30-min podcast example (Diarized, Greedy)
โโโ PLAN_AND_PROGRESS.md # Detailed training & analysis log
โโโ README.md
๐ ๏ธ Setup
Prerequisites
- NVIDIA GPU with โฅ 48 GB VRAM (tested on 96 GB RTX 6000 Pro Blackwell)
- Docker with NVIDIA Container Toolkit
- Container:
nvcr.io/nvidia/pytorch:25.01-py3
Install
git clone https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2
cd Finnish-ASR-Canary-v2
# NeMo with required patches already applied
cd NeMo && pip install -e .[asr]
pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \
kaldialign wandb soundfile editdistance
Additional setup for long-form diarized inference (inference_pyannote.py)
inference_pyannote.py requires pyannote + transformers components on top of base NeMo:
pip install pyannote.audio transformers accelerate sentencepiece
# Required by torchaudio 2.10+ audio I/O path in this container
pip install torchcodec
Set your Hugging Face token before running diarization (used to download pyannote/speaker-diarization-community-1):
export HF_TOKEN=your_hf_token
Or place it in .env as:
HF_TOKEN=your_hf_token
Critical NeMo Patches (already applied in included NeMo)
- OneLogger Fix โ makes proprietary telemetry optional for public containers
- Canary2 EOS Assertion Fix โ relaxes a strict EOS check to allow inference with placeholder transcripts
๐ Acknowledgments
- Foundation: Built on NVIDIA's Canary-v2 architecture
- Training Infrastructure: Verda.com GPU cloud, Finland
- Data Sources:
- Mozilla Common Voice v24.0
- Google FLEURS
- CSS10 Finnish
- VoxPopuli (European Parliament)
Citations
@article{park2019css10,
title={CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages},
author={Park, Kyubyong and Mulc, Thomas},
journal={Interspeech},
year={2019}
}
@inproceedings{wang2021voxpopuli,
title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning,
Semi-Supervised Learning and Interpretation},
author={Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and
Talnikar, Chutier and Haziza, Daniel and Williamson, Maryam and
Pino, Juan and Dupoux, Emmanuel},
booktitle={ACL 2021},
year={2021}
}
- Downloads last month
- 81,513
Model tree for RASMUS/Finnish-ASR-Canary-v2
Base model
nvidia/canary-1b-v2Datasets used to train RASMUS/Finnish-ASR-Canary-v2
Evaluation results
- Word Error Rate (WER) on Mozilla Common Voice v24.0test set self-reported4.580
- Word Error Rate (WER) on FLEURS Finnishtest set self-reported7.750
- Word Error Rate (WER) on CSS10 Finnishtest set self-reported7.030
- Word Error Rate (WER) on VoxPopuli Finnishtest set self-reported11.650