File size: 14,209 Bytes

6e0f94c
 
 
 
 
2562870
6e0f94c
 
 
 
 
 
2562870
6e0f94c
 
 
 
 
 
bf1ecba
6e0f94c
 
 
 
 
2562870
 
 
 
6e0f94c
 
 
bf1ecba
2562870
 
 
 
 
 
 
 
 
 
 
bf1ecba
2562870
 
 
 
 
 
 
 
 
 
bf1ecba
2562870
 
 
 
 
 
 
 
 
 
 
 
e99f86f
 
2562870
e99f86f
bf1ecba
e99f86f
bf1ecba
2562870
 
e99f86f
2562870
e99f86f
bf1ecba
e99f86f
bf1ecba
2562870
bf1ecba
 
 
67ac13f
bf1ecba
 
 
2562870
bf1ecba
 
2562870
bf1ecba
 
 
 
 
 
 
 
 
2562870
bf1ecba
2562870
bf1ecba
 
 
 
 
 
2562870
 
bf1ecba
e99f86f
6e0f94c
e99f86f
bf1ecba
 
 
e99f86f
bf1ecba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e99f86f
 
 
6e0f94c
 
bf1ecba
2562870
bf1ecba
6e0f94c
 
 
 
 
bf1ecba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e0f94c
 
 
 
bf1ecba
 
6e0f94c
 
 
 
 
bf1ecba
2562870
6e0f94c
 
 
bf1ecba
6e0f94c
bf1ecba
 
 
 
3700d96
 
 
 
bf1ecba
 
3700d96
 
bf1ecba
 
3700d96
 
 
 
 
 
bf1ecba
3700d96
bf1ecba
 
 
 
 
 
3700d96
 
bf1ecba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e0f94c
e99f86f
6e0f94c
e99f86f
bf1ecba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2562870
bf1ecba
2562870
 
bf1ecba
 
3700d96
bf1ecba
3700d96
 
 
 
 
 
 
 
 
bf1ecba
2562870
e99f86f
 
 
bf1ecba
e99f86f
2562870
bf1ecba
cb47cf7
2562870
 
 
bf1ecba
 
2562870
 
 
 
bf1ecba
2562870
bf1ecba
 
2562870
 
3700d96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf1ecba
 
 
 
2562870
 
 
 
 
bf1ecba
 
 
 
2562870
bf1ecba
 
2562870
 
 
 
 
 
 
 
 
 
 
 
bf1ecba
 
 
 
 
 
2562870

---
language:
- fi
license: mit
tags:
- automatic-speech-recognition
- asr
- speech-recognition
- canary-v2
- kenlm
- finnish
datasets:
- mozilla-foundation/common_voice_17_0
- google/fleurs
- facebook/voxpopuli
base_model: nvidia/canary-1b-v2
pipeline_tag: automatic-speech-recognition
library_name: nemo
model-index:
- name: Finnish ASR Canary-v2 Round 2
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Mozilla Common Voice v24.0
      type: mozilla-foundation/common_voice_17_0
      config: fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 4.58
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: FLEURS Finnish
      type: google/fleurs
      config: fi_fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 7.75
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: CSS10 Finnish
      type: asr-benchmark
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 7.03
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: VoxPopuli Finnish
      type: facebook/voxpopuli
      config: fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 11.65
---

# 🇫🇮 Finnish ASR Canary-v2: State-of-the-Art Finnish Speech Recognition

A high-performance fine-tuned version of NVIDIA's **Canary-v2** (1B parameter) model, specifically optimized for the Finnish language. This project provides a robust Finnish ASR solution through two rounds of finetuning, combined with a 6-gram KenLM language model for Shallow Fusion.

> **Round 2 (March 2026)** — Improved training corpus (28,857 samples), TTS-augmented long-form data, and transcript normalization. Best overall result on Common Voice and CSS10. See [Round 2 Analysis](#round-2-analysis) below.

---

## 🚀 Performance Benchmarks (WER %)

All numbers use jiwer normalization (lowercase, punctuation stripped). Lower is better.

### Best Configuration Per Dataset

| Dataset | R1 + KenLM 5M | R2 Greedy | R2 + KenLM 5M | **Best** |
| :--- | :---: | :---: | :---: | :---: |
| **Common Voice** | 5.98% | 5.41% | **4.58%** | R2 + KenLM |
| **FLEURS** | **6.48%** | 8.39% | 7.75% | R1 + KenLM |
| **CSS10 (Audiobook)** | 11.85% | **7.03%** | 12.39% | R2 Greedy |
| **VoxPopuli (Parliament)** | **5.73%** | 13.91% | 13.23% | R1 + KenLM |
| **Global Average** | 7.51% | 8.69% | 9.49% | R1 + KenLM |

> [!NOTE]
> VoxPopuli is the one domain where R1 still leads. The R2 regression is caused by transcript normalization during training (number words → digits) while the eval manifest retains word-form numbers. This will be corrected in Round 3.

### Full Benchmark Table

| Model | CommonVoice | FLEURS | CSS10 | VoxPopuli | Avg |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Base Canary-v2 | 17.95% | 7.79% | 17.07% | 7.96% | 12.69% |
| R1 Greedy | 12.82% | 8.33% | 12.19% | 4.46% | 9.45% |
| R1 + KenLM 5M | 5.98% | 6.48% | 11.85% | 5.73% | **7.51%** |
| R2 Greedy | 5.41% | 8.39% | **7.03%** | 13.91% | 8.69% |
| R2 + KenLM 5M | **4.58%** | **7.75%** | 12.39% | 13.23% | 9.49% |

### KenLM Impact Within R2

| Dataset | R2 Greedy | R2 + KenLM | Δ | Verdict |
| :--- | :---: | :---: | :---: | :--- |
| Common Voice | 5.41% | **4.58%** | −15.3% | KenLM helps |
| FLEURS | 8.39% | **7.75%** | −7.6% | KenLM helps |
| CSS10 | **7.03%** | 12.39% | +76% | KenLM hurts — use greedy |
| VoxPopuli | 13.91% | **13.23%** | −4.9% | Marginal |

> [!IMPORTANT]
> **KenLM and CSS10**: When the acoustic model is already very accurate (7% WER), the n-gram LM can override high-confidence acoustic decisions with mismatched web-text Finnish. Always benchmark KenLM on your target domain before deploying.

---

## 📖 Round 2 Analysis

### What Changed in Round 2

| Change | Detail |
| :--- | :--- |
| Training corpus | 28,857 samples (+24% vs R1's 23,180) |
| TTS long-form data | 4,377 synthesized samples (mean 14.5s, max 25s) added to shift duration distribution |
| `max_duration` | 20s → 30s to include TTS segments |
| Transcript normalization | Number words → digits, en-dash → ASCII |
| Init checkpoint | Base `canary-1b-v2.nemo` (fresh start, no R1 regressions inherited) |
| New eval sets | `eval_tts` (487 entries) and `eval_long_form` (200 entries, all >20s) |

### R2 Results vs R1

| Dataset | R1 Greedy | R2 Greedy | Δ | Why |
| :--- | :---: | :---: | :---: | :--- |
| Common Voice | 12.82% | **5.41%** | −57.8% | TSV contamination fixed + normalization |
| CSS10 | 12.19% | **7.03%** | −42.3% | TTS data improved read-speech alignment |
| FLEURS | 8.33% | 8.39% | ≈ flat | Clean read-speech; unchanged by TTS additions |
| VoxPopuli | **4.46%** | 13.91% | +211% | Normalization mismatch + TTS distribution shift |

### Key Lesson: Normalization Consistency

R2 normalized training transcripts (e.g. "kaksituhattaneljätoista" → "2014") but the `eval_voxpopuli.json` evaluation manifest was not updated to match. This inflates VoxPopuli WER for R2. A forthcoming Round 3 will normalize all eval manifests consistently.

---

## 🏃 Running Inference

This model requires **NVIDIA NeMo** (commit `557177a18d`, included in this repo with two patches applied).

### Short Audio (< 30s)

```python
from nemo.collections.asr.models import EncDecMultiTaskModel
from omegaconf import OmegaConf

# Load R2 model (recommended for most use cases)
model = EncDecMultiTaskModel.restore_from("models/canary-finnish-v2.nemo")
model.eval().cuda()

# Greedy decoding — best for audiobooks, read speech
result = model.transcribe(
    audio=["sample.wav"],
    taskname="asr",
    source_lang="fi",
    target_lang="fi",
    pnc="yes"
)
print(result[0].text)
```

### Short Audio with KenLM (recommended for conversational / CV-style audio)

```python
model.change_decoding_strategy(
    decoding_cfg=OmegaConf.create({
        'strategy': 'beam',
        'beam': {
            'beam_size': 5,
            'ngram_lm_model': "models/kenlm_5M.nemo",
            'ngram_lm_alpha': 0.2,
        },
        'batch_size': 1
    })
)
result = model.transcribe(
    audio=["sample.wav"],
    taskname="asr",
    source_lang="fi",
    target_lang="fi",
    pnc="yes"
)
```

### Long-Form Audio (podcasts, interviews, lectures)

We provide two scripts for long-form audio. The **Pyannote-based pipeline** is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary.

#### 1. Diarized Pipeline (Recommended) — `inference_pyannote.py`
This script uses `pyannote/speaker-diarization-community-1` to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio.

```bash
# Optimized for podcasts/interviews (includes diarization + KenLM)
python inference_pyannote.py \
  --audio long_recording.wav \
  --model models/canary-finnish-v2.nemo \
  --kenlm models/kenlm_5M.nemo \
  --output transcript.json
```

#### 2. VAD-only Pipeline — `inference_vad.py`
A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording.

```bash
python inference_vad.py \
  --audio long_recording.wav \
  --model models/canary-finnish-v2.nemo \
  --output transcript.txt
```

#### Example Output
See [`moo_merged_kenlm.json`](moo_merged_kenlm.json) for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps.

---

## ⚙️ Parameter Recommendations

### By Content Type

| Content Type | `--min_silence_ms` | `--beam_size` | KenLM | Notes |
| :--- | :---: | :---: | :---: | :--- |
| **Podcast / interview** | 150 | 5 | Yes | Conversational Finnish, KenLM helps most |
| **Lecture / presentation** | 500–1000 | 5 | Yes | Longer pauses → sentence-level VAD splits |
| **Audiobook / read speech** | 150 | — | **No** | R2 greedy already at 7% WER; KenLM hurts |
| **Parliament / formal speech** | 150 | 4 | No | Use R1 model; R2 regressed on this domain |
| **Unknown / mixed** | 150 (default) | 5 | Yes | Safe default |

### KenLM Alpha Tuning

`--alpha` controls how strongly the LM influences decoding (0 = greedy, higher = more LM):

| α | Effect |
| :--- | :--- |
| 0.1 | Conservative — mostly acoustic |
| **0.2** | **Recommended default** |
| 0.3 | More LM correction — good for noisy audio |
| 0.5+ | Risky — LM can override correct acoustic output |

### Full CLI Reference

```
inference_vad.py
  --audio           Path to input audio file (WAV, 16kHz mono)
  --model           Path to .nemo acoustic model
  --kenlm           Path to .nemo KenLM bundle (omit for greedy)
  --output          Output path (.txt); .json written alongside automatically
  --chunk_len       Max chunk duration in seconds (default: 15)
  --beam_size       Beam width for KenLM decoding (default: 5)
  --alpha           KenLM language model weight (default: 0.2)
  --min_silence_ms  Min silence to split VAD segments (default: 150)
  --min_speech_ms   Min speech duration to keep a segment (default: 250)
  --speech_pad_ms   Padding added around each speech segment (default: 400)
```

---

## 🏗️ Methodology & Architecture

### Acoustic Model

Built on NVIDIA's **Canary-v2** (Fast-Conformer AED, 1B parameters). Both rounds use `speech_to_text_finetune.py` which restores the full model architecture from the base `.nemo` checkpoint — only the dataloader, optimizer, and tokenizer (kept frozen, `update_tokenizer: false`) need to be specified.

### KenLM Language Model

A **6-gram KenLM** trained on 5 million lines of high-quality Finnish text:

| Source | Lines |
| :--- | :---: |
| Reddit (Finnish communities) | 1.5M |
| FinePDF (Finnish documents) | 1.5M |
| Wiki-Edu (Wikipedia + educational) | 1.0M |
| ASR transcripts | ~23k |

Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed before training. The model is token-aligned with the Canary BPE tokenizer and runs on GPU via NVIDIA's **NGPU-LM** engine (binary `.nemo` bundle, loads in <10s).

### Training Infrastructure

- **Hardware**: RTX 6000 PRO Blackwell (96 GB VRAM), [Verda.com](https://verda.com), Finland
- **Container**: `nvcr.io/nvidia/pytorch:25.01-py3`
- **NeMo**: commit `557177a18d` (r2.6.0 / v2.8.0rc0), editable install

---

## 📂 Repository Structure

```
.
├── NeMo/                              # NeMo toolkit (with patches applied)
├── models/
│   ├── canary-finnish-v2.nemo         # Round 2 finetuned model (1B)
│   ├── canary-finnish.nemo            # Round 1 finetuned model (1B)
│   ├── canary-1b-v2.nemo              # Base Canary-v2 model
│   ├── kenlm_1M.nemo                  # 6-gram KenLM (1M corpus)
│   ├── kenlm_2M.nemo                  # 6-gram KenLM (2M corpus)
│   └── kenlm_5M.nemo                  # 6-gram KenLM (5M corpus, recommended default)
├── inference_pyannote.py              # Speaker-diarized inference (BEST for long audio)
├── inference_vad.py                   # VAD-based inference (fast, single speaker)
├── moo_merged_kenlm.json              # 30-min podcast example (Diarized + KenLM)
├── moo_merged_greedy.json             # 30-min podcast example (Diarized, Greedy)
├── PLAN_AND_PROGRESS.md               # Detailed training & analysis log
└── README.md
```

---

## 🛠️ Setup

### Prerequisites

- NVIDIA GPU with ≥ 48 GB VRAM (tested on 96 GB RTX 6000 Pro Blackwell)
- Docker with NVIDIA Container Toolkit
- **Container**: `nvcr.io/nvidia/pytorch:25.01-py3`

### Install

```bash
git clone https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2
cd Finnish-ASR-Canary-v2

# NeMo with required patches already applied
cd NeMo && pip install -e .[asr]
pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \
            kaldialign wandb soundfile editdistance
```

### Additional setup for long-form diarized inference (`inference_pyannote.py`)

`inference_pyannote.py` requires pyannote + transformers components on top of base NeMo:

```bash
pip install pyannote.audio transformers accelerate sentencepiece

# Required by torchaudio 2.10+ audio I/O path in this container
pip install torchcodec
```

Set your Hugging Face token before running diarization (used to download `pyannote/speaker-diarization-community-1`):

```bash
export HF_TOKEN=your_hf_token
```

Or place it in `.env` as:

```bash
HF_TOKEN=your_hf_token
```

### Critical NeMo Patches (already applied in included NeMo)

1. **OneLogger Fix** — makes proprietary telemetry optional for public containers
2. **Canary2 EOS Assertion Fix** — relaxes a strict EOS check to allow inference with placeholder transcripts

---

## 🙏 Acknowledgments

- **Foundation**: Built on NVIDIA's [Canary-v2](https://huggingface.co/nvidia/canary-1b-v2) architecture
- **Training Infrastructure**: [Verda.com](https://verda.com) GPU cloud, Finland
- **Data Sources**:
  - [Mozilla Common Voice](https://commonvoice.mozilla.org/) v24.0
  - [Google FLEURS](https://huggingface.co/datasets/google/fleurs)
  - [CSS10 Finnish](https://github.com/Kyubyong/css10)
  - [VoxPopuli](https://github.com/facebookresearch/voxpopuli) (European Parliament)

### Citations

```bibtex
@article{park2019css10,
  title={CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages},
  author={Park, Kyubyong and Mulc, Thomas},
  journal={Interspeech},
  year={2019}
}

@inproceedings{wang2021voxpopuli,
  title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning,
         Semi-Supervised Learning and Interpretation},
  author={Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and
          Talnikar, Chutier and Haziza, Daniel and Williamson, Maryam and
          Pino, Juan and Dupoux, Emmanuel},
  booktitle={ACL 2021},
  year={2021}
}
```