---
language:
- fi
license: mit
tags:
- automatic-speech-recognition
- asr
- speech-recognition
- canary-v2
- kenlm
- finnish
datasets:
- mozilla-foundation/common_voice_17_0
- google/fleurs
- facebook/voxpopuli
base_model: nvidia/canary-1b-v2
pipeline_tag: automatic-speech-recognition
library_name: nemo
model-index:
- name: Finnish ASR Canary-v2 Round 2
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Mozilla Common Voice v24.0
      type: mozilla-foundation/common_voice_17_0
      config: fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 4.58
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: FLEURS Finnish
      type: google/fleurs
      config: fi_fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 7.75
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: CSS10 Finnish
      type: asr-benchmark
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 7.03
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: VoxPopuli Finnish
      type: facebook/voxpopuli
      config: fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 11.65
---

# 🇫🇮 Finnish ASR Canary-v2: State-of-the-Art Finnish Speech Recognition

A high-performance fine-tuned version of NVIDIA's **Canary-v2** (1B parameter) model, specifically optimized for the Finnish language. This project provides a robust Finnish ASR solution through two rounds of finetuning, combined with a 6-gram KenLM language model for Shallow Fusion.

> **Round 2 (March 2026)** — Improved training corpus (28,857 samples), TTS-augmented long-form data, and transcript normalization. Best overall result on Common Voice and CSS10. See [Round 2 Analysis](#round-2-analysis) below.

---

## 🚀 Performance Benchmarks (WER %)

All numbers use jiwer normalization (lowercase, punctuation stripped). Lower is better.

### Best Configuration Per Dataset

| Dataset | R1 + KenLM 5M | R2 Greedy | R2 + KenLM 5M | **Best** |
| :--- | :---: | :---: | :---: | :---: |
| **Common Voice** | 5.98% | 5.41% | **4.58%** | R2 + KenLM |
| **FLEURS** | **6.48%** | 8.39% | 7.75% | R1 + KenLM |
| **CSS10 (Audiobook)** | 11.85% | **7.03%** | 12.39% | R2 Greedy |
| **VoxPopuli (Parliament)** | **5.73%** | 13.91% | 13.23% | R1 + KenLM |
| **Global Average** | 7.51% | 8.69% | 9.49% | R1 + KenLM |

> [!NOTE]
> VoxPopuli is the one domain where R1 still leads. The R2 regression is caused by transcript normalization during training (number words → digits) while the eval manifest retains word-form numbers. This will be corrected in Round 3.

### Full Benchmark Table

| Model | CommonVoice | FLEURS | CSS10 | VoxPopuli | Avg |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Base Canary-v2 | 17.95% | 7.79% | 17.07% | 7.96% | 12.69% |
| R1 Greedy | 12.82% | 8.33% | 12.19% | 4.46% | 9.45% |
| R1 + KenLM 5M | 5.98% | 6.48% | 11.85% | 5.73% | **7.51%** |
| R2 Greedy | 5.41% | 8.39% | **7.03%** | 13.91% | 8.69% |
| R2 + KenLM 5M | **4.58%** | **7.75%** | 12.39% | 13.23% | 9.49% |

### KenLM Impact Within R2

| Dataset | R2 Greedy | R2 + KenLM | Δ | Verdict |
| :--- | :---: | :---: | :---: | :--- |
| Common Voice | 5.41% | **4.58%** | −15.3% | KenLM helps |
| FLEURS | 8.39% | **7.75%** | −7.6% | KenLM helps |
| CSS10 | **7.03%** | 12.39% | +76% | KenLM hurts — use greedy |
| VoxPopuli | 13.91% | **13.23%** | −4.9% | Marginal |

> [!IMPORTANT]
> **KenLM and CSS10**: When the acoustic model is already very accurate (7% WER), the n-gram LM can override high-confidence acoustic decisions with mismatched web-text Finnish. Always benchmark KenLM on your target domain before deploying.

---

## 📖 Round 2 Analysis

### What Changed in Round 2

| Change | Detail |
| :--- | :--- |
| Training corpus | 28,857 samples (+24% vs R1's 23,180) |
| TTS long-form data | 4,377 synthesized samples (mean 14.5s, max 25s) added to shift duration distribution |
| `max_duration` | 20s → 30s to include TTS segments |
| Transcript normalization | Number words → digits, en-dash → ASCII |
| Init checkpoint | Base `canary-1b-v2.nemo` (fresh start, no R1 regressions inherited) |
| New eval sets | `eval_tts` (487 entries) and `eval_long_form` (200 entries, all >20s) |

### R2 Results vs R1

| Dataset | R1 Greedy | R2 Greedy | Δ | Why |
| :--- | :---: | :---: | :---: | :--- |
| Common Voice | 12.82% | **5.41%** | −57.8% | TSV contamination fixed + normalization |
| CSS10 | 12.19% | **7.03%** | −42.3% | TTS data improved read-speech alignment |
| FLEURS | 8.33% | 8.39% | ≈ flat | Clean read-speech; unchanged by TTS additions |
| VoxPopuli | **4.46%** | 13.91% | +211% | Normalization mismatch + TTS distribution shift |

### Key Lesson: Normalization Consistency

R2 normalized training transcripts (e.g. "kaksituhattaneljätoista" → "2014") but the `eval_voxpopuli.json` evaluation manifest was not updated to match. This inflates VoxPopuli WER for R2. A forthcoming Round 3 will normalize all eval manifests consistently.

---

## 🏃 Running Inference

This model requires **NVIDIA NeMo** (commit `557177a18d`, included in this repo with two patches applied).

### Short Audio (< 30s)

```python
from nemo.collections.asr.models import EncDecMultiTaskModel
from omegaconf import OmegaConf

# Load R2 model (recommended for most use cases)
model = EncDecMultiTaskModel.restore_from("models/canary-finnish-v2.nemo")
model.eval().cuda()

# Greedy decoding — best for audiobooks, read speech
result = model.transcribe(
    audio=["sample.wav"],
    taskname="asr",
    source_lang="fi",
    target_lang="fi",
    pnc="yes"
)
print(result[0].text)
```

### Short Audio with KenLM (recommended for conversational / CV-style audio)

```python
model.change_decoding_strategy(
    decoding_cfg=OmegaConf.create({
        'strategy': 'beam',
        'beam': {
            'beam_size': 5,
            'ngram_lm_model': "models/kenlm_5M.nemo",
            'ngram_lm_alpha': 0.2,
        },
        'batch_size': 1
    })
)
result = model.transcribe(
    audio=["sample.wav"],
    taskname="asr",
    source_lang="fi",
    target_lang="fi",
    pnc="yes"
)
```

### Long-Form Audio (podcasts, interviews, lectures)

We provide two scripts for long-form audio. The **Pyannote-based pipeline** is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary.

#### 1. Diarized Pipeline (Recommended) — `inference_pyannote.py`
This script uses `pyannote/speaker-diarization-community-1` to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio.

```bash
# Optimized for podcasts/interviews (includes diarization + KenLM)
python inference_pyannote.py \
  --audio long_recording.wav \
  --model models/canary-finnish-v2.nemo \
  --kenlm models/kenlm_5M.nemo \
  --output transcript.json
```

#### 2. VAD-only Pipeline — `inference_vad.py`
A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording.

```bash
python inference_vad.py \
  --audio long_recording.wav \
  --model models/canary-finnish-v2.nemo \
  --output transcript.txt
```

#### Example Output
See [`moo_merged_kenlm.json`](moo_merged_kenlm.json) for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps.

---

## ⚙️ Parameter Recommendations

### By Content Type

| Content Type | `--min_silence_ms` | `--beam_size` | KenLM | Notes |
| :--- | :---: | :---: | :---: | :--- |
| **Podcast / interview** | 150 | 5 | Yes | Conversational Finnish, KenLM helps most |
| **Lecture / presentation** | 500–1000 | 5 | Yes | Longer pauses → sentence-level VAD splits |
| **Audiobook / read speech** | 150 | — | **No** | R2 greedy already at 7% WER; KenLM hurts |
| **Parliament / formal speech** | 150 | 4 | No | Use R1 model; R2 regressed on this domain |
| **Unknown / mixed** | 150 (default) | 5 | Yes | Safe default |

### KenLM Alpha Tuning

`--alpha` controls how strongly the LM influences decoding (0 = greedy, higher = more LM):

| α | Effect |
| :--- | :--- |
| 0.1 | Conservative — mostly acoustic |
| **0.2** | **Recommended default** |
| 0.3 | More LM correction — good for noisy audio |
| 0.5+ | Risky — LM can override correct acoustic output |

### Full CLI Reference

```
inference_vad.py
  --audio           Path to input audio file (WAV, 16kHz mono)
  --model           Path to .nemo acoustic model
  --kenlm           Path to .nemo KenLM bundle (omit for greedy)
  --output          Output path (.txt); .json written alongside automatically
  --chunk_len       Max chunk duration in seconds (default: 15)
  --beam_size       Beam width for KenLM decoding (default: 5)
  --alpha           KenLM language model weight (default: 0.2)
  --min_silence_ms  Min silence to split VAD segments (default: 150)
  --min_speech_ms   Min speech duration to keep a segment (default: 250)
  --speech_pad_ms   Padding added around each speech segment (default: 400)
```

---

## 🏗️ Methodology & Architecture

### Acoustic Model

Built on NVIDIA's **Canary-v2** (Fast-Conformer AED, 1B parameters). Both rounds use `speech_to_text_finetune.py` which restores the full model architecture from the base `.nemo` checkpoint — only the dataloader, optimizer, and tokenizer (kept frozen, `update_tokenizer: false`) need to be specified.

### KenLM Language Model

A **6-gram KenLM** trained on 5 million lines of high-quality Finnish text:

| Source | Lines |
| :--- | :---: |
| Reddit (Finnish communities) | 1.5M |
| FinePDF (Finnish documents) | 1.5M |
| Wiki-Edu (Wikipedia + educational) | 1.0M |
| ASR transcripts | ~23k |

Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed before training. The model is token-aligned with the Canary BPE tokenizer and runs on GPU via NVIDIA's **NGPU-LM** engine (binary `.nemo` bundle, loads in <10s).

### Training Infrastructure

- **Hardware**: RTX 6000 PRO Blackwell (96 GB VRAM), [Verda.com](https://verda.com), Finland
- **Container**: `nvcr.io/nvidia/pytorch:25.01-py3`
- **NeMo**: commit `557177a18d` (r2.6.0 / v2.8.0rc0), editable install

---

## 📂 Repository Structure

```
.
├── NeMo/                              # NeMo toolkit (with patches applied)
├── models/
│   ├── canary-finnish-v2.nemo         # Round 2 finetuned model (1B)
│   ├── canary-finnish.nemo            # Round 1 finetuned model (1B)
│   ├── canary-1b-v2.nemo              # Base Canary-v2 model
│   ├── kenlm_1M.nemo                  # 6-gram KenLM (1M corpus)
│   ├── kenlm_2M.nemo                  # 6-gram KenLM (2M corpus)
│   └── kenlm_5M.nemo                  # 6-gram KenLM (5M corpus, recommended default)
├── inference_pyannote.py              # Speaker-diarized inference (BEST for long audio)
├── inference_vad.py                   # VAD-based inference (fast, single speaker)
├── moo_merged_kenlm.json              # 30-min podcast example (Diarized + KenLM)
├── moo_merged_greedy.json             # 30-min podcast example (Diarized, Greedy)
├── PLAN_AND_PROGRESS.md               # Detailed training & analysis log
└── README.md
```

---

## 🛠️ Setup

### Prerequisites

- NVIDIA GPU with ≥ 48 GB VRAM (tested on 96 GB RTX 6000 Pro Blackwell)
- Docker with NVIDIA Container Toolkit
- **Container**: `nvcr.io/nvidia/pytorch:25.01-py3`

### Install

```bash
git clone https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2
cd Finnish-ASR-Canary-v2

# NeMo with required patches already applied
cd NeMo && pip install -e .[asr]
pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \
            kaldialign wandb soundfile editdistance
```

### Additional setup for long-form diarized inference (`inference_pyannote.py`)

`inference_pyannote.py` requires pyannote + transformers components on top of base NeMo:

```bash
pip install pyannote.audio transformers accelerate sentencepiece

# Required by torchaudio 2.10+ audio I/O path in this container
pip install torchcodec
```

Set your Hugging Face token before running diarization (used to download `pyannote/speaker-diarization-community-1`):

```bash
export HF_TOKEN=your_hf_token
```

Or place it in `.env` as:

```bash
HF_TOKEN=your_hf_token
```

### Critical NeMo Patches (already applied in included NeMo)

1. **OneLogger Fix** — makes proprietary telemetry optional for public containers
2. **Canary2 EOS Assertion Fix** — relaxes a strict EOS check to allow inference with placeholder transcripts

---

## 🙏 Acknowledgments

- **Foundation**: Built on NVIDIA's [Canary-v2](https://huggingface.co/nvidia/canary-1b-v2) architecture
- **Training Infrastructure**: [Verda.com](https://verda.com) GPU cloud, Finland
- **Data Sources**:
  - [Mozilla Common Voice](https://commonvoice.mozilla.org/) v24.0
  - [Google FLEURS](https://huggingface.co/datasets/google/fleurs)
  - [CSS10 Finnish](https://github.com/Kyubyong/css10)
  - [VoxPopuli](https://github.com/facebookresearch/voxpopuli) (European Parliament)

### Citations

```bibtex
@article{park2019css10,
  title={CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages},
  author={Park, Kyubyong and Mulc, Thomas},
  journal={Interspeech},
  year={2019}
}

@inproceedings{wang2021voxpopuli,
  title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning,
         Semi-Supervised Learning and Interpretation},
  author={Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and
          Talnikar, Chutier and Haziza, Daniel and Williamson, Maryam and
          Pino, Juan and Dupoux, Emmanuel},
  booktitle={ACL 2021},
  year={2021}
}
```