File size: 14,209 Bytes
6e0f94c
 
 
 
 
2562870
6e0f94c
 
 
 
 
 
2562870
6e0f94c
 
 
 
 
 
bf1ecba
6e0f94c
 
 
 
 
2562870
 
 
 
6e0f94c
 
 
bf1ecba
2562870
 
 
 
 
 
 
 
 
 
 
bf1ecba
2562870
 
 
 
 
 
 
 
 
 
bf1ecba
2562870
 
 
 
 
 
 
 
 
 
 
 
e99f86f
 
2562870
e99f86f
bf1ecba
e99f86f
bf1ecba
2562870
 
e99f86f
2562870
e99f86f
bf1ecba
e99f86f
bf1ecba
2562870
bf1ecba
 
 
67ac13f
bf1ecba
 
 
2562870
bf1ecba
 
2562870
bf1ecba
 
 
 
 
 
 
 
 
2562870
bf1ecba
2562870
bf1ecba
 
 
 
 
 
2562870
 
bf1ecba
e99f86f
6e0f94c
e99f86f
bf1ecba
 
 
e99f86f
bf1ecba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e99f86f
 
 
6e0f94c
 
bf1ecba
2562870
bf1ecba
6e0f94c
 
 
 
 
bf1ecba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e0f94c
 
 
 
bf1ecba
 
6e0f94c
 
 
 
 
bf1ecba
2562870
6e0f94c
 
 
bf1ecba
6e0f94c
bf1ecba
 
 
 
3700d96
 
 
 
bf1ecba
 
3700d96
 
bf1ecba
 
3700d96
 
 
 
 
 
bf1ecba
3700d96
bf1ecba
 
 
 
 
 
3700d96
 
bf1ecba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e0f94c
e99f86f
6e0f94c
e99f86f
bf1ecba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2562870
bf1ecba
2562870
 
bf1ecba
 
3700d96
bf1ecba
3700d96
 
 
 
 
 
 
 
 
bf1ecba
2562870
e99f86f
 
 
bf1ecba
e99f86f
2562870
bf1ecba
cb47cf7
2562870
 
 
bf1ecba
 
2562870
 
 
 
bf1ecba
2562870
bf1ecba
 
2562870
 
3700d96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf1ecba
 
 
 
2562870
 
 
 
 
bf1ecba
 
 
 
2562870
bf1ecba
 
2562870
 
 
 
 
 
 
 
 
 
 
 
bf1ecba
 
 
 
 
 
2562870
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
---
language:
- fi
license: mit
tags:
- automatic-speech-recognition
- asr
- speech-recognition
- canary-v2
- kenlm
- finnish
datasets:
- mozilla-foundation/common_voice_17_0
- google/fleurs
- facebook/voxpopuli
base_model: nvidia/canary-1b-v2
pipeline_tag: automatic-speech-recognition
library_name: nemo
model-index:
- name: Finnish ASR Canary-v2 Round 2
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Mozilla Common Voice v24.0
      type: mozilla-foundation/common_voice_17_0
      config: fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 4.58
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: FLEURS Finnish
      type: google/fleurs
      config: fi_fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 7.75
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: CSS10 Finnish
      type: asr-benchmark
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 7.03
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: VoxPopuli Finnish
      type: facebook/voxpopuli
      config: fi
      split: test
    metrics:
    - name: Word Error Rate (WER)
      type: wer
      value: 11.65
---

# ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish ASR Canary-v2: State-of-the-Art Finnish Speech Recognition

A high-performance fine-tuned version of NVIDIA's **Canary-v2** (1B parameter) model, specifically optimized for the Finnish language. This project provides a robust Finnish ASR solution through two rounds of finetuning, combined with a 6-gram KenLM language model for Shallow Fusion.

> **Round 2 (March 2026)** โ€” Improved training corpus (28,857 samples), TTS-augmented long-form data, and transcript normalization. Best overall result on Common Voice and CSS10. See [Round 2 Analysis](#round-2-analysis) below.

---

## ๐Ÿš€ Performance Benchmarks (WER %)

All numbers use jiwer normalization (lowercase, punctuation stripped). Lower is better.

### Best Configuration Per Dataset

| Dataset | R1 + KenLM 5M | R2 Greedy | R2 + KenLM 5M | **Best** |
| :--- | :---: | :---: | :---: | :---: |
| **Common Voice** | 5.98% | 5.41% | **4.58%** | R2 + KenLM |
| **FLEURS** | **6.48%** | 8.39% | 7.75% | R1 + KenLM |
| **CSS10 (Audiobook)** | 11.85% | **7.03%** | 12.39% | R2 Greedy |
| **VoxPopuli (Parliament)** | **5.73%** | 13.91% | 13.23% | R1 + KenLM |
| **Global Average** | 7.51% | 8.69% | 9.49% | R1 + KenLM |

> [!NOTE]
> VoxPopuli is the one domain where R1 still leads. The R2 regression is caused by transcript normalization during training (number words โ†’ digits) while the eval manifest retains word-form numbers. This will be corrected in Round 3.

### Full Benchmark Table

| Model | CommonVoice | FLEURS | CSS10 | VoxPopuli | Avg |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Base Canary-v2 | 17.95% | 7.79% | 17.07% | 7.96% | 12.69% |
| R1 Greedy | 12.82% | 8.33% | 12.19% | 4.46% | 9.45% |
| R1 + KenLM 5M | 5.98% | 6.48% | 11.85% | 5.73% | **7.51%** |
| R2 Greedy | 5.41% | 8.39% | **7.03%** | 13.91% | 8.69% |
| R2 + KenLM 5M | **4.58%** | **7.75%** | 12.39% | 13.23% | 9.49% |

### KenLM Impact Within R2

| Dataset | R2 Greedy | R2 + KenLM | ฮ” | Verdict |
| :--- | :---: | :---: | :---: | :--- |
| Common Voice | 5.41% | **4.58%** | โˆ’15.3% | KenLM helps |
| FLEURS | 8.39% | **7.75%** | โˆ’7.6% | KenLM helps |
| CSS10 | **7.03%** | 12.39% | +76% | KenLM hurts โ€” use greedy |
| VoxPopuli | 13.91% | **13.23%** | โˆ’4.9% | Marginal |

> [!IMPORTANT]
> **KenLM and CSS10**: When the acoustic model is already very accurate (7% WER), the n-gram LM can override high-confidence acoustic decisions with mismatched web-text Finnish. Always benchmark KenLM on your target domain before deploying.

---

## ๐Ÿ“– Round 2 Analysis

### What Changed in Round 2

| Change | Detail |
| :--- | :--- |
| Training corpus | 28,857 samples (+24% vs R1's 23,180) |
| TTS long-form data | 4,377 synthesized samples (mean 14.5s, max 25s) added to shift duration distribution |
| `max_duration` | 20s โ†’ 30s to include TTS segments |
| Transcript normalization | Number words โ†’ digits, en-dash โ†’ ASCII |
| Init checkpoint | Base `canary-1b-v2.nemo` (fresh start, no R1 regressions inherited) |
| New eval sets | `eval_tts` (487 entries) and `eval_long_form` (200 entries, all >20s) |

### R2 Results vs R1

| Dataset | R1 Greedy | R2 Greedy | ฮ” | Why |
| :--- | :---: | :---: | :---: | :--- |
| Common Voice | 12.82% | **5.41%** | โˆ’57.8% | TSV contamination fixed + normalization |
| CSS10 | 12.19% | **7.03%** | โˆ’42.3% | TTS data improved read-speech alignment |
| FLEURS | 8.33% | 8.39% | โ‰ˆ flat | Clean read-speech; unchanged by TTS additions |
| VoxPopuli | **4.46%** | 13.91% | +211% | Normalization mismatch + TTS distribution shift |

### Key Lesson: Normalization Consistency

R2 normalized training transcripts (e.g. "kaksituhattaneljรคtoista" โ†’ "2014") but the `eval_voxpopuli.json` evaluation manifest was not updated to match. This inflates VoxPopuli WER for R2. A forthcoming Round 3 will normalize all eval manifests consistently.

---

## ๐Ÿƒ Running Inference

This model requires **NVIDIA NeMo** (commit `557177a18d`, included in this repo with two patches applied).

### Short Audio (< 30s)

```python
from nemo.collections.asr.models import EncDecMultiTaskModel
from omegaconf import OmegaConf

# Load R2 model (recommended for most use cases)
model = EncDecMultiTaskModel.restore_from("models/canary-finnish-v2.nemo")
model.eval().cuda()

# Greedy decoding โ€” best for audiobooks, read speech
result = model.transcribe(
    audio=["sample.wav"],
    taskname="asr",
    source_lang="fi",
    target_lang="fi",
    pnc="yes"
)
print(result[0].text)
```

### Short Audio with KenLM (recommended for conversational / CV-style audio)

```python
model.change_decoding_strategy(
    decoding_cfg=OmegaConf.create({
        'strategy': 'beam',
        'beam': {
            'beam_size': 5,
            'ngram_lm_model': "models/kenlm_5M.nemo",
            'ngram_lm_alpha': 0.2,
        },
        'batch_size': 1
    })
)
result = model.transcribe(
    audio=["sample.wav"],
    taskname="asr",
    source_lang="fi",
    target_lang="fi",
    pnc="yes"
)
```

### Long-Form Audio (podcasts, interviews, lectures)

We provide two scripts for long-form audio. The **Pyannote-based pipeline** is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary.

#### 1. Diarized Pipeline (Recommended) โ€” `inference_pyannote.py`
This script uses `pyannote/speaker-diarization-community-1` to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio.

```bash
# Optimized for podcasts/interviews (includes diarization + KenLM)
python inference_pyannote.py \
  --audio long_recording.wav \
  --model models/canary-finnish-v2.nemo \
  --kenlm models/kenlm_5M.nemo \
  --output transcript.json
```

#### 2. VAD-only Pipeline โ€” `inference_vad.py`
A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording.

```bash
python inference_vad.py \
  --audio long_recording.wav \
  --model models/canary-finnish-v2.nemo \
  --output transcript.txt
```

#### Example Output
See [`moo_merged_kenlm.json`](moo_merged_kenlm.json) for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps.

---

## โš™๏ธ Parameter Recommendations

### By Content Type

| Content Type | `--min_silence_ms` | `--beam_size` | KenLM | Notes |
| :--- | :---: | :---: | :---: | :--- |
| **Podcast / interview** | 150 | 5 | Yes | Conversational Finnish, KenLM helps most |
| **Lecture / presentation** | 500โ€“1000 | 5 | Yes | Longer pauses โ†’ sentence-level VAD splits |
| **Audiobook / read speech** | 150 | โ€” | **No** | R2 greedy already at 7% WER; KenLM hurts |
| **Parliament / formal speech** | 150 | 4 | No | Use R1 model; R2 regressed on this domain |
| **Unknown / mixed** | 150 (default) | 5 | Yes | Safe default |

### KenLM Alpha Tuning

`--alpha` controls how strongly the LM influences decoding (0 = greedy, higher = more LM):

| ฮฑ | Effect |
| :--- | :--- |
| 0.1 | Conservative โ€” mostly acoustic |
| **0.2** | **Recommended default** |
| 0.3 | More LM correction โ€” good for noisy audio |
| 0.5+ | Risky โ€” LM can override correct acoustic output |

### Full CLI Reference

```
inference_vad.py
  --audio           Path to input audio file (WAV, 16kHz mono)
  --model           Path to .nemo acoustic model
  --kenlm           Path to .nemo KenLM bundle (omit for greedy)
  --output          Output path (.txt); .json written alongside automatically
  --chunk_len       Max chunk duration in seconds (default: 15)
  --beam_size       Beam width for KenLM decoding (default: 5)
  --alpha           KenLM language model weight (default: 0.2)
  --min_silence_ms  Min silence to split VAD segments (default: 150)
  --min_speech_ms   Min speech duration to keep a segment (default: 250)
  --speech_pad_ms   Padding added around each speech segment (default: 400)
```

---

## ๐Ÿ—๏ธ Methodology & Architecture

### Acoustic Model

Built on NVIDIA's **Canary-v2** (Fast-Conformer AED, 1B parameters). Both rounds use `speech_to_text_finetune.py` which restores the full model architecture from the base `.nemo` checkpoint โ€” only the dataloader, optimizer, and tokenizer (kept frozen, `update_tokenizer: false`) need to be specified.

### KenLM Language Model

A **6-gram KenLM** trained on 5 million lines of high-quality Finnish text:

| Source | Lines |
| :--- | :---: |
| Reddit (Finnish communities) | 1.5M |
| FinePDF (Finnish documents) | 1.5M |
| Wiki-Edu (Wikipedia + educational) | 1.0M |
| ASR transcripts | ~23k |

Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed before training. The model is token-aligned with the Canary BPE tokenizer and runs on GPU via NVIDIA's **NGPU-LM** engine (binary `.nemo` bundle, loads in <10s).

### Training Infrastructure

- **Hardware**: RTX 6000 PRO Blackwell (96 GB VRAM), [Verda.com](https://verda.com), Finland
- **Container**: `nvcr.io/nvidia/pytorch:25.01-py3`
- **NeMo**: commit `557177a18d` (r2.6.0 / v2.8.0rc0), editable install

---

## ๐Ÿ“‚ Repository Structure

```
.
โ”œโ”€โ”€ NeMo/                              # NeMo toolkit (with patches applied)
โ”œโ”€โ”€ models/
โ”‚   โ”œโ”€โ”€ canary-finnish-v2.nemo         # Round 2 finetuned model (1B)
โ”‚   โ”œโ”€โ”€ canary-finnish.nemo            # Round 1 finetuned model (1B)
โ”‚   โ”œโ”€โ”€ canary-1b-v2.nemo              # Base Canary-v2 model
โ”‚   โ”œโ”€โ”€ kenlm_1M.nemo                  # 6-gram KenLM (1M corpus)
โ”‚   โ”œโ”€โ”€ kenlm_2M.nemo                  # 6-gram KenLM (2M corpus)
โ”‚   โ””โ”€โ”€ kenlm_5M.nemo                  # 6-gram KenLM (5M corpus, recommended default)
โ”œโ”€โ”€ inference_pyannote.py              # Speaker-diarized inference (BEST for long audio)
โ”œโ”€โ”€ inference_vad.py                   # VAD-based inference (fast, single speaker)
โ”œโ”€โ”€ moo_merged_kenlm.json              # 30-min podcast example (Diarized + KenLM)
โ”œโ”€โ”€ moo_merged_greedy.json             # 30-min podcast example (Diarized, Greedy)
โ”œโ”€โ”€ PLAN_AND_PROGRESS.md               # Detailed training & analysis log
โ””โ”€โ”€ README.md
```

---

## ๐Ÿ› ๏ธ Setup

### Prerequisites

- NVIDIA GPU with โ‰ฅ 48 GB VRAM (tested on 96 GB RTX 6000 Pro Blackwell)
- Docker with NVIDIA Container Toolkit
- **Container**: `nvcr.io/nvidia/pytorch:25.01-py3`

### Install

```bash
git clone https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2
cd Finnish-ASR-Canary-v2

# NeMo with required patches already applied
cd NeMo && pip install -e .[asr]
pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \
            kaldialign wandb soundfile editdistance
```

### Additional setup for long-form diarized inference (`inference_pyannote.py`)

`inference_pyannote.py` requires pyannote + transformers components on top of base NeMo:

```bash
pip install pyannote.audio transformers accelerate sentencepiece

# Required by torchaudio 2.10+ audio I/O path in this container
pip install torchcodec
```

Set your Hugging Face token before running diarization (used to download `pyannote/speaker-diarization-community-1`):

```bash
export HF_TOKEN=your_hf_token
```

Or place it in `.env` as:

```bash
HF_TOKEN=your_hf_token
```

### Critical NeMo Patches (already applied in included NeMo)

1. **OneLogger Fix** โ€” makes proprietary telemetry optional for public containers
2. **Canary2 EOS Assertion Fix** โ€” relaxes a strict EOS check to allow inference with placeholder transcripts

---

## ๐Ÿ™ Acknowledgments

- **Foundation**: Built on NVIDIA's [Canary-v2](https://huggingface.co/nvidia/canary-1b-v2) architecture
- **Training Infrastructure**: [Verda.com](https://verda.com) GPU cloud, Finland
- **Data Sources**:
  - [Mozilla Common Voice](https://commonvoice.mozilla.org/) v24.0
  - [Google FLEURS](https://huggingface.co/datasets/google/fleurs)
  - [CSS10 Finnish](https://github.com/Kyubyong/css10)
  - [VoxPopuli](https://github.com/facebookresearch/voxpopuli) (European Parliament)

### Citations

```bibtex
@article{park2019css10,
  title={CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages},
  author={Park, Kyubyong and Mulc, Thomas},
  journal={Interspeech},
  year={2019}
}

@inproceedings{wang2021voxpopuli,
  title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning,
         Semi-Supervised Learning and Interpretation},
  author={Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and
          Talnikar, Chutier and Haziza, Daniel and Williamson, Maryam and
          Pino, Juan and Dupoux, Emmanuel},
  booktitle={ACL 2021},
  year={2021}
}
```