| | --- |
| | language: |
| | - fi |
| | license: mit |
| | tags: |
| | - automatic-speech-recognition |
| | - asr |
| | - speech-recognition |
| | - canary-v2 |
| | - kenlm |
| | - finnish |
| | datasets: |
| | - mozilla-foundation/common_voice_17_0 |
| | - google/fleurs |
| | - facebook/voxpopuli |
| | base_model: nvidia/canary-1b-v2 |
| | pipeline_tag: automatic-speech-recognition |
| | library_name: nemo |
| | model-index: |
| | - name: Finnish ASR Canary-v2 Round 2 |
| | results: |
| | - task: |
| | type: automatic-speech-recognition |
| | name: Automatic Speech Recognition |
| | dataset: |
| | name: Mozilla Common Voice v24.0 |
| | type: mozilla-foundation/common_voice_17_0 |
| | config: fi |
| | split: test |
| | metrics: |
| | - name: Word Error Rate (WER) |
| | type: wer |
| | value: 4.58 |
| | - task: |
| | type: automatic-speech-recognition |
| | name: Automatic Speech Recognition |
| | dataset: |
| | name: FLEURS Finnish |
| | type: google/fleurs |
| | config: fi_fi |
| | split: test |
| | metrics: |
| | - name: Word Error Rate (WER) |
| | type: wer |
| | value: 7.75 |
| | - task: |
| | type: automatic-speech-recognition |
| | name: Automatic Speech Recognition |
| | dataset: |
| | name: CSS10 Finnish |
| | type: asr-benchmark |
| | split: test |
| | metrics: |
| | - name: Word Error Rate (WER) |
| | type: wer |
| | value: 7.03 |
| | - task: |
| | type: automatic-speech-recognition |
| | name: Automatic Speech Recognition |
| | dataset: |
| | name: VoxPopuli Finnish |
| | type: facebook/voxpopuli |
| | config: fi |
| | split: test |
| | metrics: |
| | - name: Word Error Rate (WER) |
| | type: wer |
| | value: 11.65 |
| | --- |
| | |
| | # 🇫🇮 Finnish ASR Canary-v2: State-of-the-Art Finnish Speech Recognition |
| |
|
| | A high-performance fine-tuned version of NVIDIA's **Canary-v2** (1B parameter) model, specifically optimized for the Finnish language. This project provides a robust Finnish ASR solution through two rounds of finetuning, combined with a 6-gram KenLM language model for Shallow Fusion. |
| |
|
| | > **Round 2 (March 2026)** — Improved training corpus (28,857 samples), TTS-augmented long-form data, and transcript normalization. Best overall result on Common Voice and CSS10. See [Round 2 Analysis](#round-2-analysis) below. |
| |
|
| | --- |
| |
|
| | ## 🚀 Performance Benchmarks (WER %) |
| |
|
| | All numbers use jiwer normalization (lowercase, punctuation stripped). Lower is better. |
| |
|
| | ### Best Configuration Per Dataset |
| |
|
| | | Dataset | R1 + KenLM 5M | R2 Greedy | R2 + KenLM 5M | **Best** | |
| | | :--- | :---: | :---: | :---: | :---: | |
| | | **Common Voice** | 5.98% | 5.41% | **4.58%** | R2 + KenLM | |
| | | **FLEURS** | **6.48%** | 8.39% | 7.75% | R1 + KenLM | |
| | | **CSS10 (Audiobook)** | 11.85% | **7.03%** | 12.39% | R2 Greedy | |
| | | **VoxPopuli (Parliament)** | **5.73%** | 13.91% | 13.23% | R1 + KenLM | |
| | | **Global Average** | 7.51% | 8.69% | 9.49% | R1 + KenLM | |
| |
|
| | > [!NOTE] |
| | > VoxPopuli is the one domain where R1 still leads. The R2 regression is caused by transcript normalization during training (number words → digits) while the eval manifest retains word-form numbers. This will be corrected in Round 3. |
| |
|
| | ### Full Benchmark Table |
| |
|
| | | Model | CommonVoice | FLEURS | CSS10 | VoxPopuli | Avg | |
| | | :--- | :---: | :---: | :---: | :---: | :---: | |
| | | Base Canary-v2 | 17.95% | 7.79% | 17.07% | 7.96% | 12.69% | |
| | | R1 Greedy | 12.82% | 8.33% | 12.19% | 4.46% | 9.45% | |
| | | R1 + KenLM 5M | 5.98% | 6.48% | 11.85% | 5.73% | **7.51%** | |
| | | R2 Greedy | 5.41% | 8.39% | **7.03%** | 13.91% | 8.69% | |
| | | R2 + KenLM 5M | **4.58%** | **7.75%** | 12.39% | 13.23% | 9.49% | |
| |
|
| | ### KenLM Impact Within R2 |
| |
|
| | | Dataset | R2 Greedy | R2 + KenLM | Δ | Verdict | |
| | | :--- | :---: | :---: | :---: | :--- | |
| | | Common Voice | 5.41% | **4.58%** | −15.3% | KenLM helps | |
| | | FLEURS | 8.39% | **7.75%** | −7.6% | KenLM helps | |
| | | CSS10 | **7.03%** | 12.39% | +76% | KenLM hurts — use greedy | |
| | | VoxPopuli | 13.91% | **13.23%** | −4.9% | Marginal | |
| |
|
| | > [!IMPORTANT] |
| | > **KenLM and CSS10**: When the acoustic model is already very accurate (7% WER), the n-gram LM can override high-confidence acoustic decisions with mismatched web-text Finnish. Always benchmark KenLM on your target domain before deploying. |
| |
|
| | --- |
| |
|
| | ## 📖 Round 2 Analysis |
| |
|
| | ### What Changed in Round 2 |
| |
|
| | | Change | Detail | |
| | | :--- | :--- | |
| | | Training corpus | 28,857 samples (+24% vs R1's 23,180) | |
| | | TTS long-form data | 4,377 synthesized samples (mean 14.5s, max 25s) added to shift duration distribution | |
| | | `max_duration` | 20s → 30s to include TTS segments | |
| | | Transcript normalization | Number words → digits, en-dash → ASCII | |
| | | Init checkpoint | Base `canary-1b-v2.nemo` (fresh start, no R1 regressions inherited) | |
| | | New eval sets | `eval_tts` (487 entries) and `eval_long_form` (200 entries, all >20s) | |
| |
|
| | ### R2 Results vs R1 |
| |
|
| | | Dataset | R1 Greedy | R2 Greedy | Δ | Why | |
| | | :--- | :---: | :---: | :---: | :--- | |
| | | Common Voice | 12.82% | **5.41%** | −57.8% | TSV contamination fixed + normalization | |
| | | CSS10 | 12.19% | **7.03%** | −42.3% | TTS data improved read-speech alignment | |
| | | FLEURS | 8.33% | 8.39% | ≈ flat | Clean read-speech; unchanged by TTS additions | |
| | | VoxPopuli | **4.46%** | 13.91% | +211% | Normalization mismatch + TTS distribution shift | |
| |
|
| | ### Key Lesson: Normalization Consistency |
| |
|
| | R2 normalized training transcripts (e.g. "kaksituhattaneljätoista" → "2014") but the `eval_voxpopuli.json` evaluation manifest was not updated to match. This inflates VoxPopuli WER for R2. A forthcoming Round 3 will normalize all eval manifests consistently. |
| |
|
| | --- |
| |
|
| | ## 🏃 Running Inference |
| |
|
| | This model requires **NVIDIA NeMo** (commit `557177a18d`, included in this repo with two patches applied). |
| |
|
| | ### Short Audio (< 30s) |
| |
|
| | ```python |
| | from nemo.collections.asr.models import EncDecMultiTaskModel |
| | from omegaconf import OmegaConf |
| | |
| | # Load R2 model (recommended for most use cases) |
| | model = EncDecMultiTaskModel.restore_from("models/canary-finnish-v2.nemo") |
| | model.eval().cuda() |
| | |
| | # Greedy decoding — best for audiobooks, read speech |
| | result = model.transcribe( |
| | audio=["sample.wav"], |
| | taskname="asr", |
| | source_lang="fi", |
| | target_lang="fi", |
| | pnc="yes" |
| | ) |
| | print(result[0].text) |
| | ``` |
| |
|
| | ### Short Audio with KenLM (recommended for conversational / CV-style audio) |
| |
|
| | ```python |
| | model.change_decoding_strategy( |
| | decoding_cfg=OmegaConf.create({ |
| | 'strategy': 'beam', |
| | 'beam': { |
| | 'beam_size': 5, |
| | 'ngram_lm_model': "models/kenlm_5M.nemo", |
| | 'ngram_lm_alpha': 0.2, |
| | }, |
| | 'batch_size': 1 |
| | }) |
| | ) |
| | result = model.transcribe( |
| | audio=["sample.wav"], |
| | taskname="asr", |
| | source_lang="fi", |
| | target_lang="fi", |
| | pnc="yes" |
| | ) |
| | ``` |
| |
|
| | ### Long-Form Audio (podcasts, interviews, lectures) |
| |
|
| | We provide two scripts for long-form audio. The **Pyannote-based pipeline** is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary. |
| |
|
| | #### 1. Diarized Pipeline (Recommended) — `inference_pyannote.py` |
| | This script uses `pyannote/speaker-diarization-community-1` to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio. |
| | |
| | ```bash |
| | # Optimized for podcasts/interviews (includes diarization + KenLM) |
| | python inference_pyannote.py \ |
| | --audio long_recording.wav \ |
| | --model models/canary-finnish-v2.nemo \ |
| | --kenlm models/kenlm_5M.nemo \ |
| | --output transcript.json |
| | ``` |
| | |
| | #### 2. VAD-only Pipeline — `inference_vad.py` |
| | A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording. |
| | |
| | ```bash |
| | python inference_vad.py \ |
| | --audio long_recording.wav \ |
| | --model models/canary-finnish-v2.nemo \ |
| | --output transcript.txt |
| | ``` |
| | |
| | #### Example Output |
| | See [`moo_merged_kenlm.json`](moo_merged_kenlm.json) for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps. |
| | |
| | --- |
| | |
| | ## ⚙️ Parameter Recommendations |
| | |
| | ### By Content Type |
| | |
| | | Content Type | `--min_silence_ms` | `--beam_size` | KenLM | Notes | |
| | | :--- | :---: | :---: | :---: | :--- | |
| | | **Podcast / interview** | 150 | 5 | Yes | Conversational Finnish, KenLM helps most | |
| | | **Lecture / presentation** | 500–1000 | 5 | Yes | Longer pauses → sentence-level VAD splits | |
| | | **Audiobook / read speech** | 150 | — | **No** | R2 greedy already at 7% WER; KenLM hurts | |
| | | **Parliament / formal speech** | 150 | 4 | No | Use R1 model; R2 regressed on this domain | |
| | | **Unknown / mixed** | 150 (default) | 5 | Yes | Safe default | |
| | |
| | ### KenLM Alpha Tuning |
| | |
| | `--alpha` controls how strongly the LM influences decoding (0 = greedy, higher = more LM): |
| | |
| | | α | Effect | |
| | | :--- | :--- | |
| | | 0.1 | Conservative — mostly acoustic | |
| | | **0.2** | **Recommended default** | |
| | | 0.3 | More LM correction — good for noisy audio | |
| | | 0.5+ | Risky — LM can override correct acoustic output | |
| | |
| | ### Full CLI Reference |
| | |
| | ``` |
| | inference_vad.py |
| | --audio Path to input audio file (WAV, 16kHz mono) |
| | --model Path to .nemo acoustic model |
| | --kenlm Path to .nemo KenLM bundle (omit for greedy) |
| | --output Output path (.txt); .json written alongside automatically |
| | --chunk_len Max chunk duration in seconds (default: 15) |
| | --beam_size Beam width for KenLM decoding (default: 5) |
| | --alpha KenLM language model weight (default: 0.2) |
| | --min_silence_ms Min silence to split VAD segments (default: 150) |
| | --min_speech_ms Min speech duration to keep a segment (default: 250) |
| | --speech_pad_ms Padding added around each speech segment (default: 400) |
| | ``` |
| | |
| | --- |
| | |
| | ## 🏗️ Methodology & Architecture |
| | |
| | ### Acoustic Model |
| | |
| | Built on NVIDIA's **Canary-v2** (Fast-Conformer AED, 1B parameters). Both rounds use `speech_to_text_finetune.py` which restores the full model architecture from the base `.nemo` checkpoint — only the dataloader, optimizer, and tokenizer (kept frozen, `update_tokenizer: false`) need to be specified. |
| | |
| | ### KenLM Language Model |
| | |
| | A **6-gram KenLM** trained on 5 million lines of high-quality Finnish text: |
| | |
| | | Source | Lines | |
| | | :--- | :---: | |
| | | Reddit (Finnish communities) | 1.5M | |
| | | FinePDF (Finnish documents) | 1.5M | |
| | | Wiki-Edu (Wikipedia + educational) | 1.0M | |
| | | ASR transcripts | ~23k | |
| | |
| | Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed before training. The model is token-aligned with the Canary BPE tokenizer and runs on GPU via NVIDIA's **NGPU-LM** engine (binary `.nemo` bundle, loads in <10s). |
| | |
| | ### Training Infrastructure |
| | |
| | - **Hardware**: RTX 6000 PRO Blackwell (96 GB VRAM), [Verda.com](https://verda.com), Finland |
| | - **Container**: `nvcr.io/nvidia/pytorch:25.01-py3` |
| | - **NeMo**: commit `557177a18d` (r2.6.0 / v2.8.0rc0), editable install |
| | |
| | --- |
| | |
| | ## 📂 Repository Structure |
| | |
| | ``` |
| | . |
| | ├── NeMo/ # NeMo toolkit (with patches applied) |
| | ├── models/ |
| | │ ├── canary-finnish-v2.nemo # Round 2 finetuned model (1B) |
| | │ ├── canary-finnish.nemo # Round 1 finetuned model (1B) |
| | │ ├── canary-1b-v2.nemo # Base Canary-v2 model |
| | │ ├── kenlm_1M.nemo # 6-gram KenLM (1M corpus) |
| | │ ├── kenlm_2M.nemo # 6-gram KenLM (2M corpus) |
| | │ └── kenlm_5M.nemo # 6-gram KenLM (5M corpus, recommended default) |
| | ├── inference_pyannote.py # Speaker-diarized inference (BEST for long audio) |
| | ├── inference_vad.py # VAD-based inference (fast, single speaker) |
| | ├── moo_merged_kenlm.json # 30-min podcast example (Diarized + KenLM) |
| | ├── moo_merged_greedy.json # 30-min podcast example (Diarized, Greedy) |
| | ├── PLAN_AND_PROGRESS.md # Detailed training & analysis log |
| | └── README.md |
| | ``` |
| | |
| | --- |
| | |
| | ## 🛠️ Setup |
| | |
| | ### Prerequisites |
| | |
| | - NVIDIA GPU with ≥ 48 GB VRAM (tested on 96 GB RTX 6000 Pro Blackwell) |
| | - Docker with NVIDIA Container Toolkit |
| | - **Container**: `nvcr.io/nvidia/pytorch:25.01-py3` |
| | |
| | ### Install |
| | |
| | ```bash |
| | git clone https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2 |
| | cd Finnish-ASR-Canary-v2 |
| |
|
| | # NeMo with required patches already applied |
| | cd NeMo && pip install -e .[asr] |
| | pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \ |
| | kaldialign wandb soundfile editdistance |
| | ``` |
| | |
| | ### Additional setup for long-form diarized inference (`inference_pyannote.py`) |
| | |
| | `inference_pyannote.py` requires pyannote + transformers components on top of base NeMo: |
| |
|
| | ```bash |
| | pip install pyannote.audio transformers accelerate sentencepiece |
| | |
| | # Required by torchaudio 2.10+ audio I/O path in this container |
| | pip install torchcodec |
| | ``` |
| |
|
| | Set your Hugging Face token before running diarization (used to download `pyannote/speaker-diarization-community-1`): |
| |
|
| | ```bash |
| | export HF_TOKEN=your_hf_token |
| | ``` |
| |
|
| | Or place it in `.env` as: |
| |
|
| | ```bash |
| | HF_TOKEN=your_hf_token |
| | ``` |
| |
|
| | ### Critical NeMo Patches (already applied in included NeMo) |
| |
|
| | 1. **OneLogger Fix** — makes proprietary telemetry optional for public containers |
| | 2. **Canary2 EOS Assertion Fix** — relaxes a strict EOS check to allow inference with placeholder transcripts |
| |
|
| | --- |
| |
|
| | ## 🙏 Acknowledgments |
| |
|
| | - **Foundation**: Built on NVIDIA's [Canary-v2](https://huggingface.co/nvidia/canary-1b-v2) architecture |
| | - **Training Infrastructure**: [Verda.com](https://verda.com) GPU cloud, Finland |
| | - **Data Sources**: |
| | - [Mozilla Common Voice](https://commonvoice.mozilla.org/) v24.0 |
| | - [Google FLEURS](https://huggingface.co/datasets/google/fleurs) |
| | - [CSS10 Finnish](https://github.com/Kyubyong/css10) |
| | - [VoxPopuli](https://github.com/facebookresearch/voxpopuli) (European Parliament) |
| |
|
| | ### Citations |
| |
|
| | ```bibtex |
| | @article{park2019css10, |
| | title={CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages}, |
| | author={Park, Kyubyong and Mulc, Thomas}, |
| | journal={Interspeech}, |
| | year={2019} |
| | } |
| | |
| | @inproceedings{wang2021voxpopuli, |
| | title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, |
| | Semi-Supervised Learning and Interpretation}, |
| | author={Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and |
| | Talnikar, Chutier and Haziza, Daniel and Williamson, Maryam and |
| | Pino, Juan and Dupoux, Emmanuel}, |
| | booktitle={ACL 2021}, |
| | year={2021} |
| | } |
| | ``` |
| |
|