--- language: - fi license: mit tags: - automatic-speech-recognition - asr - speech-recognition - canary-v2 - kenlm - finnish datasets: - mozilla-foundation/common_voice_17_0 - google/fleurs - facebook/voxpopuli base_model: nvidia/canary-1b-v2 pipeline_tag: automatic-speech-recognition library_name: nemo model-index: - name: Finnish ASR Canary-v2 Round 2 results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: Mozilla Common Voice v24.0 type: mozilla-foundation/common_voice_17_0 config: fi split: test metrics: - name: Word Error Rate (WER) type: wer value: 4.58 - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: FLEURS Finnish type: google/fleurs config: fi_fi split: test metrics: - name: Word Error Rate (WER) type: wer value: 7.75 - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: CSS10 Finnish type: asr-benchmark split: test metrics: - name: Word Error Rate (WER) type: wer value: 7.03 - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: VoxPopuli Finnish type: facebook/voxpopuli config: fi split: test metrics: - name: Word Error Rate (WER) type: wer value: 11.65 --- # ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish ASR Canary-v2: State-of-the-Art Finnish Speech Recognition A high-performance fine-tuned version of NVIDIA's **Canary-v2** (1B parameter) model, specifically optimized for the Finnish language. This project provides a robust Finnish ASR solution through two rounds of finetuning, combined with a 6-gram KenLM language model for Shallow Fusion. > **Round 2 (March 2026)** โ€” Improved training corpus (28,857 samples), TTS-augmented long-form data, and transcript normalization. Best overall result on Common Voice and CSS10. See [Round 2 Analysis](#round-2-analysis) below. --- ## ๐Ÿš€ Performance Benchmarks (WER %) All numbers use jiwer normalization (lowercase, punctuation stripped). Lower is better. ### Best Configuration Per Dataset | Dataset | R1 + KenLM 5M | R2 Greedy | R2 + KenLM 5M | **Best** | | :--- | :---: | :---: | :---: | :---: | | **Common Voice** | 5.98% | 5.41% | **4.58%** | R2 + KenLM | | **FLEURS** | **6.48%** | 8.39% | 7.75% | R1 + KenLM | | **CSS10 (Audiobook)** | 11.85% | **7.03%** | 12.39% | R2 Greedy | | **VoxPopuli (Parliament)** | **5.73%** | 13.91% | 13.23% | R1 + KenLM | | **Global Average** | 7.51% | 8.69% | 9.49% | R1 + KenLM | > [!NOTE] > VoxPopuli is the one domain where R1 still leads. The R2 regression is caused by transcript normalization during training (number words โ†’ digits) while the eval manifest retains word-form numbers. This will be corrected in Round 3. ### Full Benchmark Table | Model | CommonVoice | FLEURS | CSS10 | VoxPopuli | Avg | | :--- | :---: | :---: | :---: | :---: | :---: | | Base Canary-v2 | 17.95% | 7.79% | 17.07% | 7.96% | 12.69% | | R1 Greedy | 12.82% | 8.33% | 12.19% | 4.46% | 9.45% | | R1 + KenLM 5M | 5.98% | 6.48% | 11.85% | 5.73% | **7.51%** | | R2 Greedy | 5.41% | 8.39% | **7.03%** | 13.91% | 8.69% | | R2 + KenLM 5M | **4.58%** | **7.75%** | 12.39% | 13.23% | 9.49% | ### KenLM Impact Within R2 | Dataset | R2 Greedy | R2 + KenLM | ฮ” | Verdict | | :--- | :---: | :---: | :---: | :--- | | Common Voice | 5.41% | **4.58%** | โˆ’15.3% | KenLM helps | | FLEURS | 8.39% | **7.75%** | โˆ’7.6% | KenLM helps | | CSS10 | **7.03%** | 12.39% | +76% | KenLM hurts โ€” use greedy | | VoxPopuli | 13.91% | **13.23%** | โˆ’4.9% | Marginal | > [!IMPORTANT] > **KenLM and CSS10**: When the acoustic model is already very accurate (7% WER), the n-gram LM can override high-confidence acoustic decisions with mismatched web-text Finnish. Always benchmark KenLM on your target domain before deploying. --- ## ๐Ÿ“– Round 2 Analysis ### What Changed in Round 2 | Change | Detail | | :--- | :--- | | Training corpus | 28,857 samples (+24% vs R1's 23,180) | | TTS long-form data | 4,377 synthesized samples (mean 14.5s, max 25s) added to shift duration distribution | | `max_duration` | 20s โ†’ 30s to include TTS segments | | Transcript normalization | Number words โ†’ digits, en-dash โ†’ ASCII | | Init checkpoint | Base `canary-1b-v2.nemo` (fresh start, no R1 regressions inherited) | | New eval sets | `eval_tts` (487 entries) and `eval_long_form` (200 entries, all >20s) | ### R2 Results vs R1 | Dataset | R1 Greedy | R2 Greedy | ฮ” | Why | | :--- | :---: | :---: | :---: | :--- | | Common Voice | 12.82% | **5.41%** | โˆ’57.8% | TSV contamination fixed + normalization | | CSS10 | 12.19% | **7.03%** | โˆ’42.3% | TTS data improved read-speech alignment | | FLEURS | 8.33% | 8.39% | โ‰ˆ flat | Clean read-speech; unchanged by TTS additions | | VoxPopuli | **4.46%** | 13.91% | +211% | Normalization mismatch + TTS distribution shift | ### Key Lesson: Normalization Consistency R2 normalized training transcripts (e.g. "kaksituhattaneljรคtoista" โ†’ "2014") but the `eval_voxpopuli.json` evaluation manifest was not updated to match. This inflates VoxPopuli WER for R2. A forthcoming Round 3 will normalize all eval manifests consistently. --- ## ๐Ÿƒ Running Inference This model requires **NVIDIA NeMo** (commit `557177a18d`, included in this repo with two patches applied). ### Short Audio (< 30s) ```python from nemo.collections.asr.models import EncDecMultiTaskModel from omegaconf import OmegaConf # Load R2 model (recommended for most use cases) model = EncDecMultiTaskModel.restore_from("models/canary-finnish-v2.nemo") model.eval().cuda() # Greedy decoding โ€” best for audiobooks, read speech result = model.transcribe( audio=["sample.wav"], taskname="asr", source_lang="fi", target_lang="fi", pnc="yes" ) print(result[0].text) ``` ### Short Audio with KenLM (recommended for conversational / CV-style audio) ```python model.change_decoding_strategy( decoding_cfg=OmegaConf.create({ 'strategy': 'beam', 'beam': { 'beam_size': 5, 'ngram_lm_model': "models/kenlm_5M.nemo", 'ngram_lm_alpha': 0.2, }, 'batch_size': 1 }) ) result = model.transcribe( audio=["sample.wav"], taskname="asr", source_lang="fi", target_lang="fi", pnc="yes" ) ``` ### Long-Form Audio (podcasts, interviews, lectures) We provide two scripts for long-form audio. The **Pyannote-based pipeline** is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary. #### 1. Diarized Pipeline (Recommended) โ€” `inference_pyannote.py` This script uses `pyannote/speaker-diarization-community-1` to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio. ```bash # Optimized for podcasts/interviews (includes diarization + KenLM) python inference_pyannote.py \ --audio long_recording.wav \ --model models/canary-finnish-v2.nemo \ --kenlm models/kenlm_5M.nemo \ --output transcript.json ``` #### 2. VAD-only Pipeline โ€” `inference_vad.py` A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording. ```bash python inference_vad.py \ --audio long_recording.wav \ --model models/canary-finnish-v2.nemo \ --output transcript.txt ``` #### Example Output See [`moo_merged_kenlm.json`](moo_merged_kenlm.json) for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps. --- ## โš™๏ธ Parameter Recommendations ### By Content Type | Content Type | `--min_silence_ms` | `--beam_size` | KenLM | Notes | | :--- | :---: | :---: | :---: | :--- | | **Podcast / interview** | 150 | 5 | Yes | Conversational Finnish, KenLM helps most | | **Lecture / presentation** | 500โ€“1000 | 5 | Yes | Longer pauses โ†’ sentence-level VAD splits | | **Audiobook / read speech** | 150 | โ€” | **No** | R2 greedy already at 7% WER; KenLM hurts | | **Parliament / formal speech** | 150 | 4 | No | Use R1 model; R2 regressed on this domain | | **Unknown / mixed** | 150 (default) | 5 | Yes | Safe default | ### KenLM Alpha Tuning `--alpha` controls how strongly the LM influences decoding (0 = greedy, higher = more LM): | ฮฑ | Effect | | :--- | :--- | | 0.1 | Conservative โ€” mostly acoustic | | **0.2** | **Recommended default** | | 0.3 | More LM correction โ€” good for noisy audio | | 0.5+ | Risky โ€” LM can override correct acoustic output | ### Full CLI Reference ``` inference_vad.py --audio Path to input audio file (WAV, 16kHz mono) --model Path to .nemo acoustic model --kenlm Path to .nemo KenLM bundle (omit for greedy) --output Output path (.txt); .json written alongside automatically --chunk_len Max chunk duration in seconds (default: 15) --beam_size Beam width for KenLM decoding (default: 5) --alpha KenLM language model weight (default: 0.2) --min_silence_ms Min silence to split VAD segments (default: 150) --min_speech_ms Min speech duration to keep a segment (default: 250) --speech_pad_ms Padding added around each speech segment (default: 400) ``` --- ## ๐Ÿ—๏ธ Methodology & Architecture ### Acoustic Model Built on NVIDIA's **Canary-v2** (Fast-Conformer AED, 1B parameters). Both rounds use `speech_to_text_finetune.py` which restores the full model architecture from the base `.nemo` checkpoint โ€” only the dataloader, optimizer, and tokenizer (kept frozen, `update_tokenizer: false`) need to be specified. ### KenLM Language Model A **6-gram KenLM** trained on 5 million lines of high-quality Finnish text: | Source | Lines | | :--- | :---: | | Reddit (Finnish communities) | 1.5M | | FinePDF (Finnish documents) | 1.5M | | Wiki-Edu (Wikipedia + educational) | 1.0M | | ASR transcripts | ~23k | Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed before training. The model is token-aligned with the Canary BPE tokenizer and runs on GPU via NVIDIA's **NGPU-LM** engine (binary `.nemo` bundle, loads in <10s). ### Training Infrastructure - **Hardware**: RTX 6000 PRO Blackwell (96 GB VRAM), [Verda.com](https://verda.com), Finland - **Container**: `nvcr.io/nvidia/pytorch:25.01-py3` - **NeMo**: commit `557177a18d` (r2.6.0 / v2.8.0rc0), editable install --- ## ๐Ÿ“‚ Repository Structure ``` . โ”œโ”€โ”€ NeMo/ # NeMo toolkit (with patches applied) โ”œโ”€โ”€ models/ โ”‚ โ”œโ”€โ”€ canary-finnish-v2.nemo # Round 2 finetuned model (1B) โ”‚ โ”œโ”€โ”€ canary-finnish.nemo # Round 1 finetuned model (1B) โ”‚ โ”œโ”€โ”€ canary-1b-v2.nemo # Base Canary-v2 model โ”‚ โ”œโ”€โ”€ kenlm_1M.nemo # 6-gram KenLM (1M corpus) โ”‚ โ”œโ”€โ”€ kenlm_2M.nemo # 6-gram KenLM (2M corpus) โ”‚ โ””โ”€โ”€ kenlm_5M.nemo # 6-gram KenLM (5M corpus, recommended default) โ”œโ”€โ”€ inference_pyannote.py # Speaker-diarized inference (BEST for long audio) โ”œโ”€โ”€ inference_vad.py # VAD-based inference (fast, single speaker) โ”œโ”€โ”€ moo_merged_kenlm.json # 30-min podcast example (Diarized + KenLM) โ”œโ”€โ”€ moo_merged_greedy.json # 30-min podcast example (Diarized, Greedy) โ”œโ”€โ”€ PLAN_AND_PROGRESS.md # Detailed training & analysis log โ””โ”€โ”€ README.md ``` --- ## ๐Ÿ› ๏ธ Setup ### Prerequisites - NVIDIA GPU with โ‰ฅ 48 GB VRAM (tested on 96 GB RTX 6000 Pro Blackwell) - Docker with NVIDIA Container Toolkit - **Container**: `nvcr.io/nvidia/pytorch:25.01-py3` ### Install ```bash git clone https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2 cd Finnish-ASR-Canary-v2 # NeMo with required patches already applied cd NeMo && pip install -e .[asr] pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \ kaldialign wandb soundfile editdistance ``` ### Additional setup for long-form diarized inference (`inference_pyannote.py`) `inference_pyannote.py` requires pyannote + transformers components on top of base NeMo: ```bash pip install pyannote.audio transformers accelerate sentencepiece # Required by torchaudio 2.10+ audio I/O path in this container pip install torchcodec ``` Set your Hugging Face token before running diarization (used to download `pyannote/speaker-diarization-community-1`): ```bash export HF_TOKEN=your_hf_token ``` Or place it in `.env` as: ```bash HF_TOKEN=your_hf_token ``` ### Critical NeMo Patches (already applied in included NeMo) 1. **OneLogger Fix** โ€” makes proprietary telemetry optional for public containers 2. **Canary2 EOS Assertion Fix** โ€” relaxes a strict EOS check to allow inference with placeholder transcripts --- ## ๐Ÿ™ Acknowledgments - **Foundation**: Built on NVIDIA's [Canary-v2](https://huggingface.co/nvidia/canary-1b-v2) architecture - **Training Infrastructure**: [Verda.com](https://verda.com) GPU cloud, Finland - **Data Sources**: - [Mozilla Common Voice](https://commonvoice.mozilla.org/) v24.0 - [Google FLEURS](https://huggingface.co/datasets/google/fleurs) - [CSS10 Finnish](https://github.com/Kyubyong/css10) - [VoxPopuli](https://github.com/facebookresearch/voxpopuli) (European Parliament) ### Citations ```bibtex @article{park2019css10, title={CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages}, author={Park, Kyubyong and Mulc, Thomas}, journal={Interspeech}, year={2019} } @inproceedings{wang2021voxpopuli, title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation}, author={Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and Talnikar, Chutier and Haziza, Daniel and Williamson, Maryam and Pino, Juan and Dupoux, Emmanuel}, booktitle={ACL 2021}, year={2021} } ```