File size: 33,689 Bytes
09db5fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d2ba80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09db5fc
 
4d2ba80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09db5fc
4d2ba80
 
 
 
 
 
 
 
09db5fc
 
 
4d2ba80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09db5fc
 
 
 
4d2ba80
09db5fc
4d2ba80
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
# Finnish ASR: Canary-v2 Finetuning & Progress

This document provides a high-level overview of our Finnish ASR finetuning process, model architecture, and current progress for the Data Science team.

---

## πŸ“Š Project Overview
Our goal is to adapt NVIDIA's **Canary-v2** (a 1-billion parameter multilingual model) for high-accuracy Finnish Automatic Speech Recognition (ASR). We leverage four diverse datasets to ensure robustness across different domains and speaking styles.

---

## πŸ—οΈ Model Architecture
Canary-v2 is an **Attention-Encoder-Decoder (AED)** model that utilizes the **Fast-Conformer** architecture. This design allows for efficient processing of long audio sequences while maintaining high accuracy.

```mermaid

graph TD

    A[Audio Input] -->|Preprocessing| B[Mel Spectrogram]

    

    subgraph TrainingBlock [Finetuned Components]

        direction TB

        subgraph Encoder [Encoder: Acoustic Modeling]

            C1[Convolutional Subsampling] -->|Downsample| C2[Conformer Blocks]

            C2 -->|Latent Features| C_Out[Acoustic Latents]

        end



        subgraph Decoder [Decoder: Language Modeling]

            D1[Masked Self-Attention] --> D2[Cross-Attention]

            D2 --> D3[Feed Forward]

            D3 --> D_Out[Text Generation]

        end

    end



    B -->|Input| C1

    P[Input Prompts:<br/>Lang, Task, PnC] -->|Conditioning| D1

    C_Out -->|Acoustic Context| D2

    D_Out -->|Output| E[Finnish Text]



    %% Styling

    style TrainingBlock fill:#f0f7ff,stroke:#0052cc,stroke-width:3px,stroke-dasharray: 5 5

    style A fill:#ffffff,stroke:#333,stroke-width:2px

    style B fill:#ffffff,stroke:#333,stroke-width:2px

    style P fill:#ffffff,stroke:#333,stroke-width:2px

    style E fill:#e6ffed,stroke:#28a745,stroke-width:2px

    

    style Encoder fill:#ffffff,stroke:#0052cc,stroke-width:1px

    style Decoder fill:#ffffff,stroke:#0052cc,stroke-width:1px

```

### Component Roles & Finetuning:
- **Highlighted Area (Blue Dashed Box)**: This represents the core weights of the **Canary-v2** model. During our finetuning, we update the parameters in both the **Encoder** and **Decoder** to specifically recognize Finnish phonemes and grammar.
- **Mel Spectrogram**: The "Vision" stage. It turns raw audio waves into a structured 2D representation of sound frequencies over time.
- **Fast-Conformer Encoder**: The "Acoustic Processor." We finetuned this to understand the unique sounds of the Finnish language (like double vowels and consonants).
- **Input Prompts**: The "Context Injector." These are the same color as other inputs because they are part of the model's standard input pipeline, telling it: "Act as a Finnish ASR system."
- **Attention-Decoder**: The "Linguistic Brain." We finetuned this to map the Finnish sounds from the encoder into grammatically correct Finnish text, guided by the prompts.

---

## πŸ”„ Finetuning Workflow
Our pipeline is fully automated, from data ingestion to multi-dataset evaluation.

```mermaid

graph TD

    subgraph DataPrep [Data Preparation]

        D1[CSS10 Finnish] --> P[Unified Processing Script]

        D2[FLEURS Finnish] --> P

        D3[VoxPopuli Finnish] --> P

        D4[Common Voice v24] --> P

        P --> M1[train_manifest.json]

        P --> M2[eval_fleurs.json]

        P --> M3[eval_common_voice.json]

        P --> M4[eval_css10.json]

        P --> M5[eval_voxpopuli.json]

    end



    subgraph Training [Canary-v2 Finetuning]

        M1 --> T[NVIDIA NeMo Trainer]

        CM[nvidia/canary-1b-v2] --> T

        T --> CK[Model Checkpoints]

        M2 & M3 & M4 & M5 --> V[Multi-Validation]

        V --> W[WandB Tracking]

    end



    subgraph Inference [Post-Processing]

        CK --> Inf[Inference]

        Inf --> K[KenLM/NGPU-LM Integration]

        K --> R[Final ASR Output]

    end

```

---

## πŸ“š Datasets
We use a balanced mix of datasets to cover various audio qualities and transcript styles:

| Dataset | Source | Characteristics |
|---------|--------|-----------------|
| **FLEURS** | Google | High-quality, diverse speakers (Benchmark) |
| **Common Voice** | Mozilla | Crowdsourced, varied quality and accents |
| **CSS10** | Single Speaker | Clean, high-quality audio books |
| **VoxPopuli** | EU Parliament | European Parliament speeches (Formal) |

---

## πŸ“Š Training Data Analysis

This section documents the composition and length distribution of our training data (from `RASMUS/canary-finnish-asr-data`, accessed 2026-02-26).

### Dataset Summary

| Dataset | Samples | Mean Duration | Max Duration | Total Hours |
|---------|---------|--------------|-------------|-------------|
| **Common Voice v24** | 9,086 | 4.5s | 10.5s | 11.2h |
| **VoxPopuli** | 8,164 | 10.1s | 50.5s | 23.0h |
| **CSS10** | 3,226 | 7.7s | 20.2s | 6.9h |
| **FLEURS** | 2,704 | 11.7s | 43.2s | 8.8h |
| **TOTAL** | **23,180** | **7.8s** | **50.5s** | **~50h** |

### Duration Distribution (Training Set)

```

 0–5s   : 33.3%  (7,725 samples)  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

 5–10s  : 43.7%  (10,139 samples) β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

10–15s  : 15.0%  (3,473 samples)  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

15–20s  :  5.4%  (1,241 samples)  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ

20–30s  :  2.4%  (562 samples)    β–ˆβ–ˆβ–ˆ

  >30s  :  0.2%  (40 samples)

```

**Key insight:** 77% of training samples are shorter than 10 seconds. The model has very little exposure to longer audio segments (only 0.2% are >30s). This has direct implications for long-form inference stability.

### Evaluation Set Durations

| Eval Set | Samples | Mean Duration | Max Duration |
|----------|---------|--------------|-------------|
| FLEURS   | 918     | 13.0s        | 33.7s       |
| Common Voice | 1,554 | 5.1s       | 10.5s       |
| CSS10    | 170     | 7.5s         | 10.2s       |
| VoxPopuli | 430   | 10.6s        | 47.5s       |

---

## πŸ”’ Number Handling Analysis

### Live Inference Results: Base vs Finetuned (2026-02-26)

We ran both models on 5 FLEURS test samples to determine each model's number output style.

| # | Scenario | Reference | Base Canary-v2 | Our Finetuned |
|---|----------|-----------|----------------|---------------|
| 1 | Spoken "sata" (hundred) | `yli sata vuotta` | `yli 100 vuotta` ❌ | `yli 100 vuotta` ❌ |
| 2 | Spoken "seitsemΓ€ntoista" (17) | `surmaten seitsemΓ€ntoista henkeΓ€` | `surmaten 17 henkeΓ€` ❌ | `surmaten seitsemΓ€ntoista henkeΓ€` βœ… |
| 3 | Digits in reference (15, 2011, 2017) | `15 metriΓ€... 2011... 2017` | Correct βœ… | Correct βœ… |
| 4 | Abbreviation "jKr." (AD) | `400 jKr.` | `400 jΓ€lkeen Kristuksen` | `400 jΓ€lkeen Kristuksen` |
| 5 | Range "25–30" (en-dash U+2013) | `25–30 vuodella` | `25-30 vuodella` (ASCII hyphen) | `25 ⁇ 30 vuodella` ❌ UNK token |

**Key findings:**

1. **Base model outputs digits.** When the speaker says "sata" (hundred) or "seitsemΓ€ntoista" (seventeen), the base Canary-v2 outputs `100` and `17`. This is NVIDIA's built-in text normalisation β€” Canary always outputs digit form for numbers.

2. **Finetuning introduced inconsistency.** Our finetuning partially reversed this: for `seitsemΓ€ntoista` the finetuned model now outputs the written word (because FLEURS training transcripts used written-out numbers), but still outputs `100` for `sata`. This inconsistency is worse than either consistent policy.

3. **En-dash produces a UNK token in the finetuned model.** The character `–` (U+2013 en-dash) in `25–30` causes the finetuned model to emit `⁇` (SentencePiece UNK). The base model degrades gracefully to an ASCII hyphen `25-30`. This is a regression introduced by finetuning β€” likely because the en-dash was absent or inconsistently encoded in our training data.

4. **Abbreviations are expanded by both models.** `jKr.` β†’ `jΓ€lkeen Kristuksen` in both β€” this is model behaviour, not a finetuning artifact.

### Policy Decision
**We want digit output** (not written-out Finnish number words). The base model's behaviour is correct here. The finetuned model regressed on consistency because our FLEURS training transcripts used written-out numbers.

### Training Data Issues Found
- Only **2.5% (578 / 23,180)** of training samples contain digit characters at all.
- FLEURS transcripts use written-out numbers (`sata vuotta`) while VoxPopuli and Common Voice use digits. This gives the model conflicting signal.
- En-dash (`–` U+2013) may be absent or mis-encoded in training manifests, causing UNK tokens at inference time.

### Action Plan: Numbers & UNK Token

#### Step 1 β€” Normalise training transcripts to digit form
Run a pre-processing pass on `train_manifest.json` before the next training run:
- Use the Python library `num2words` with locale `fi` to convert Finnish written-out numbers to digits: e.g. `sata` β†’ `100`, `seitsemΓ€ntoista` β†’ `17`.
- OR (simpler / safer): replace the FLEURS transcripts in the manifest with their **raw reference texts which already have digits** (FLEURS provides both `raw_transcription` and `transcription` columns; currently we use `raw_transcription` which has written numbers).
- Target: **all numeric quantities consistently in digit form** across all four datasets.

#### Step 2 β€” Fix en-dash encoding (ROOT CAUSE CONFIRMED)

**Confirmed via tokenizer inspection (2026-02-26):**

```python

m.tokenizer.text_to_ids("25–30")  # β†’ [16053, 1125, 1128, 0, 1126, 1123]

#                                              ↑ id 0 = UNK for the en-dash!

m.tokenizer.text_to_ids("25-30")  # β†’ [16053, 1125, 1128, 16107, 1126, 1123]

#                                              ↑ ASCII hyphen tokenises correctly

```

- **En-dash `–` (U+2013) and em-dash `β€”` (U+2014) are NOT in the CanaryBPETokenizer vocabulary** (both map to UNK id 0).
- Training data contains **85 entries with en-dash** (83 FLEURS, 2 Common Voice). During training, the en-dash in the TARGET text was encoded as UNK, so the model learned to produce UNK for the corresponding speech sounds.
- **Fix: replace all `–` and `β€”` with ASCII hyphen `-` in all training transcripts** before the next training run. This is a one-line preprocessing step.

```python

# In manifest preprocessing:

text = text.replace('\u2013', '-').replace('\u2014', '-')

```

#### Step 3 β€” Re-evaluate after normalisation
After normalising transcripts, re-run the 5-sample live inference test to verify:
- `sata vuotta` audio β†’ model outputs `100 vuotta`
- `seitsemΓ€ntoista` audio β†’ model outputs `17`
- `25–30` audio β†’ model outputs `25-30` or `25–30` (no UNK)

---

## πŸ”ˆ Long-Form Audio: Root Cause Analysis

Our test file `moo.wav` is **30 minutes** (1,800s) of continuous Finnish speech. This reveals a core gap vs. our finetuned Whisper model.

### How Canary-v2 Handles Long Audio (Natively)
- NVIDIA's Canary-v2 uses **dynamic chunking** with 1-second overlap between chunks.
- This is automatically triggered for audio longer than **40 seconds**.
- The model was pre-trained on a 1.7M-hour multilingual corpus with this chunking strategy baked in.

### Our Current Approach (`inference_vad.py`)

1. Silero VAD detects speech segments.

2. Segments are merged into chunks up to `chunk_len` seconds (default: **15s**).
3. Each chunk is transcribed **independently** β€” no shared context between chunks.

### Root Causes of Degradation on Long-Form

| Issue | Detail |
|-------|--------|
| **Training length mismatch** | 77% of fine-tuning data is <10s. Inference chunks at 15s are longer than nearly all training examples, creating distribution shift. |
| **No cross-chunk context** | Each 15s chunk is transcribed in isolation. Canary's attention decoder has no memory of previous chunks, so topic/speaker continuity is lost at boundaries. |
| **VAD vs. native chunking** | Our VAD-based approach differs from Canary's built-in dynamic chunking. The model was not fine-tuned with this chunking strategy. |
| **Repetition / hallucination** | At chunk boundaries with silence or music, the decoder can loop. This is worsened when segments are near the edge of the model's training length distribution. |
| **No overlap** | Without overlap between chunks, words at segment boundaries can be dropped or doubled. |

### Comparison: Canary vs. Our Finetuned Whisper on Long-Form

Whisper was explicitly designed and trained for long-form audio with:
- Sliding window inference with overlap
- Previous-chunk text as conditioning (prompt-based context)
- Timestamps for alignment

Canary's AED architecture does not use previous-chunk text as input, making long-form continuity fundamentally harder to achieve without careful chunk overlap and stitching.

---

## πŸš€ Progress & Results

### Current Status: **Model Released & Repository Consolidated**
We have successfully completed the finetuning, KenLM integration, and repository consolidation phases. The model and its associated language models are now hosted on Hugging Face at `RASMUS/Finnish-ASR-Canary-v2`.

- **Infrastructure:** Finetuned on **RTX 6000 PRO Blackwell** (96 GB VRAM) on Verda.com platform in Finland.
- **Model Suite:** Acoustic model + 3 KenLM variants (1M, 2M, 5M sentences).
- **Best Performance (with KenLM 5M):**
    - **FLEURS:** 7.86% WER
    - **Common Voice:** 4.70% WER
    - **CSS10:** 7.07% WER
    - **VoxPopuli:** 11.65% WER
- **Deployment:** Integrated Silero VAD-based inference for robust long-form audio processing.

### Next Steps:
1.  **Long-form Tuning:** Reduce default `chunk_len` to 8–10s (closer to training distribution median) and add 0.5–1s overlap between chunks to reduce boundary artifacts.
2.  **Data Quality Audit:** Fix 28 confirmed corrupted Common Voice entries where raw TSV metadata (client ID hashes, gender tags) was accidentally written into the `text` field. Audit VoxPopuli for missing capitalisation (all-lowercase transcripts despite `pnc: yes`).
3.  **Number Handling:** Add Finnish-specific training data with numeric content. Consider TTS-synthesised samples covering phone numbers, years, statistics, and measurements (both digit and written-out forms paired).
4.  **Long-form Training Data:** Incorporate longer audio segments: TTS synthetic long-form audio (`fbc_monolog_processed`, parliament data) into the training manifest to shift the duration distribution toward 15–30s.
5.  **KenLM Refinement:** Re-train KenLM with high-quality punctuated text. Current LM trained on mixed-quality data.
6.  **Advanced Evaluation:** Implement CER evaluation on non-normalised test sets to better capture punctuation/casing accuracy.
7.  **Repetition Penalty:** Explore repetition penalty in decoding if chunk-level loops persist after chunk length tuning.
8.  **Real-world Evaluation:** Benchmark on diverse long-form samples (podcasts, meetings, call-centre audio).

---

## πŸ—ΊοΈ Action Plan: Next Training Run

This section details the concrete steps for the next finetuning iteration, based on the root-cause analysis above.

### Priority 1 β€” Fix Training Data (before re-training)

#### 1a. Normalise numbers to digit form (Gemini Flash)
Finnish written-out numbers in FLEURS transcripts cause the finetuned model to output inconsistent number forms. We will use the Gemini Flash API to convert all training transcripts in a single batch pass:

```python

# Pseudocode β€” run once on train_manifest.json before next training

import google.generativeai as genai

import json



genai.configure(api_key=GEMINI_API_KEY)

model = genai.GenerativeModel("gemini-2.0-flash")



SYSTEM_PROMPT = """You are a Finnish text normalizer.

Convert any written-out Finnish numbers, ordinals, or number words in the text to digit form.

Examples:

  "yli sata vuotta" β†’ "yli 100 vuotta"

  "seitsemΓ€ntoista henkeΓ€" β†’ "17 henkeΓ€"

  "vuonna tuhat yhdeksΓ€nsataa" β†’ "vuonna 1900"

Keep all other text exactly as-is. Return only the modified text, nothing else."""



entries = []

with open('manifests/train_manifest.json') as f:

    for line in f:

        d = json.loads(line)

        response = model.generate_content(f"{SYSTEM_PROMPT}\n\n{d['text']}")

        d['text'] = response.text.strip()

        entries.append(d)



with open('manifests/train_manifest_normalised.json', 'w') as f:

    for e in entries:

        f.write(json.dumps(e, ensure_ascii=False) + '\n')

```

Cost estimate: 23,180 entries Γ— ~50 tokens average = ~1.2M tokens. At Gemini Flash pricing (~$0.075/1M tokens input) β‰ˆ **< $0.10 total**.

#### 1b. Fix en-dash UNK token (confirmed root cause)
The en-dash `–` (U+2013) is NOT in the tokenizer vocabulary β€” it maps to UNK (id 0). Replace it with ASCII hyphen before training:

```python

# Add to the manifest preprocessing step

text = text.replace('\u2013', '-').replace('\u2014', '-')

```

This affects **85 entries** in `train_manifest.json` (83 FLEURS, 2 Common Voice).

#### 1c. Fix 28 corrupted Common Voice entries
Replace entries where the `text` field contains raw TSV metadata (tabs + client_id hashes). Strip everything after the first tab character.



---



### Priority 2 β€” Add Long-Form Training Data



#### TTS Long-Form Dataset: `RASMUS/canary_asr_finetune_tts_long_data`

| Property | Value |
|----------|-------|
| Size | 8.0 GB zip |
| Format | FLAC audio + JSONL manifest |
| Mean duration | **16.5s** (vs 7.8s in current data) |
| Median duration | 15.9s |
| Max duration | 25.0s |
| Content | Finnish speech: lectures, podcasts, YouTube |
| Segments >20s | ~25% |

This dataset directly addresses the training length mismatch. Adding it will shift the duration distribution from a mean of 7.8s toward ~10–12s and significantly increase the proportion of 15–25s segments that match inference chunk lengths.

**Integration plan:**
```bash

# Download the dataset

curl -L -H "Authorization: Bearer ${HF_TOKEN}" \

  "https://huggingface.co/datasets/RASMUS/canary_asr_finetune_tts_long_data/resolve/main/canary_dataset.zip" \

  -o /workspace/data/tts_long_data.zip



# Extract

unzip /workspace/data/tts_long_data.zip -d /workspace/data/tts_long_data/



# Apply number normalisation and dash fix to canary_manifest.jsonl

# then merge with existing train_manifest_normalised.json

```

After applying number normalisation and dash fixes to the new manifest, concatenate with the existing training set. Expected combined size: ~23,180 + N (estimate 5,000–20,000+ entries depending on total dataset size).

---

### Priority 3 β€” Inference Tuning (without re-training)

Even before re-training, we can improve `moo.wav` performance by adjusting `inference_vad.py`:

| Parameter | Current | Recommended |
|-----------|---------|-------------|
| `chunk_len` | 15s | 8–10s (match training median of 7.8s) |
| chunk overlap | 0s | 0.5s (reduce boundary word drops) |
| `alpha` (KenLM) | 0.2 | Try 0.1–0.15 (current may over-constrain decoder) |

---

## πŸ”„ Round 2: Data Pipeline & Splits

This section documents the data preparation methodology for Round 2 finetuning, including all new eval sets, the TTS integration, and the final manifest composition.

### Overview of Changes vs Round 1

| Item | Round 1 | Round 2 |
|------|---------|---------|
| Base model | `canary-1b-v2.nemo` | `canary-1b-v2.nemo` (fresh start) |
| Training samples | 23,180 | **28,858** |
| Training hours | ~50h | **75.6h** |
| Mean duration | 7.8s | **9.4s** |
| Max duration allowed | 20.0s | **30.0s** |
| Transcripts normalised | No | **Yes (digits, dashes fixed)** |
| Eval sets | 4 | **6** |

### Step 1 β€” Transcript Normalisation (`normalize_manifests.py`)



All training transcripts were cleaned in two layers:



**Deterministic fixes (no API call needed):**

- En-dash `–` (U+2013) and em-dash `β€”` (U+2014) β†’ ASCII hyphen `-` (fixes UNK token regression)

- Corrupted Common Voice entries (raw TSV metadata in `text` field) β†’ strip everything after first tab



**Gemini 2.5 Flash API calls (2,586 of 23,180 entries needed conversion):**

- Pre-filtered with a Finnish number-word regex so only entries that actually contain written numbers are sent to the API (cost: ~$0.62)

- Written Finnish numbers converted to digit form: `sata vuotta` β†’ `100 vuotta`, `seitsemΓ€ntoista` β†’ `17`

- Explicit DO NOT CONVERT rules: ordinals (`ensimmΓ€inen`, `toinen`), superlative constructions (`yksi tΓ€rkeimmistΓ€`), and `toinen` as "another/other"



### Step 2 β€” TTS Long-Form Data Integration



Downloaded `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB, 6,365 entries, mean 16.4s).

Aligned to NeMo training format:
- Path rewritten to relative style: `data/tts_long_data/audio/{filename}`
- Fields mapped: `language` β†’ `source_lang`/`target_lang`, `task: "transcription"` β†’ `taskname: "asr"`, added `pnc: "yes"`
- Same Gemini normalisation pass applied (888 entries converted)

### Step 3 β€” Eval Set Construction (TTS Data)

The 6,365 normalised TTS entries were split into train / eval / long-form-test:

```

All TTS entries (6,365)

β”‚

β”œβ”€β”€ Long-form pool (>20s): 1,501 entries

β”‚   β”œβ”€β”€ eval_long_form (sampled): 200 entries  ← random.seed(42) shuffle β†’ first 200

β”‚   └── Returned to training pool: 1,301 entries

β”‚

└── Medium pool (10–20s): 4,864 entries

    β”œβ”€β”€ eval_tts (10% hold-out): 487 entries  ← stratified by duration bucket

    └── tts_train: 4,377 entries

```

**Why eval_long_form = 200 entries?**
The original 1,501 long-form entries (>20s) had a total duration of ~9.4 hours β€” far too long to run as a validation set every epoch. At batch_size=32 on a single GPU, each validation pass over 1,501 entries takes ~25 minutes, adding 2.5h per epoch. 200 entries (β‰ˆ75 minutes of audio) provides a representative sample of the long-form distribution at reasonable cost: ~4 minutes of eval time per epoch.



**eval_tts construction:**

487 entries were held out from the 10–20s duration range (10% stratified sample). This tests the model's ability to handle medium-length audio and is separate from the original 4 eval sets.



### Step 4 β€” Combined Training Manifest



Final `train_manifest_combined.jsonl` composition:



| Source | Entries | Notes |

|--------|---------|-------|

| Original train (normalised) | 23,180 | Digits + dash fix applied |

| TTS train (10–20s) | 4,377 | Synthesised long-form speech |

| Long-form overflow | 1,301 | >20s entries not selected for eval_long_form |

| **Total** | **28,858** | Mean 9.4s, 75.6h |



### Final Eval Sets (Round 2)



| Set | File | Entries | Mean Duration | Purpose |

|-----|------|---------|--------------|---------|

| `eval_fleurs` | `eval_fleurs.json` | 918 | 13.0s | Primary benchmark (monitored for checkpointing) |

| `eval_common_voice` | `eval_common_voice.json` | 1,554 | 5.1s | Crowdsourced quality |

| `eval_css10` | `eval_css10.json` | 170 | 7.5s | Clean single-speaker |

| `eval_voxpopuli` | `eval_voxpopuli.json` | 430 | 10.6s | Formal/parliament speech |

| `eval_tts` | `eval_tts.jsonl` | 487 | 14.5s | Medium-length TTS (new) |

| `eval_long_form` | `eval_long_form.jsonl` | **200** | 22.5s | Long-form >20s sample (new) |



**Checkpoint monitoring:** `val_wer` tracks FLEURS (first validation set). All 6 WERs are logged independently to WandB.

### Round 2 Training Config

File: `configs/canary_finetune_finnish_v2.yaml`
Key settings:
- `init_from_nemo_model`: `/workspace/Finnish-ASR-Canary-v2/models/canary-1b-v2.nemo` (fresh start from base)
- `max_duration`: 30.0s (up from 20.0s to include TTS segments up to 25s)
- `max_steps`: 18,000 (scaled: 28,858 / 32 β‰ˆ 902 steps/epoch Γ— 20 epochs β‰ˆ 18,040)
- `lr`: 1e-5, `WarmupAnnealing`, 500 warmup steps
- `precision`: bf16, single GPU, `strategy: auto`

---

## πŸ› οΈ Workflow Status Details

### 1. Data Preparation - DONE
- [x] Identify and inventory all 4 datasets
- [x] Create unified processing script (`scripts/prepare_all_manifests.py`)
- [x] Run `scripts/prepare_all_manifests.py` on devcontainer
- [x] Verify manifest sample counts and audio file integrity

### 2. Configuration Setup - DONE
- [x] Create Hydra training config (`configs/canary_finetune_finnish.yaml`)
- [x] Configure multi-validation with 4 eval datasets
- [x] Checkpoint monitors primary eval set (FLEURS) via `val_wer`
- [x] All 4 eval WERs logged independently to WandB

### 3. Training - DONE
- [x] Run finetuning via `run_training.sh`
- [x] Monitor per-dataset WER in WandB

### 4. KenLM / NGPU-LM Language Model Integration - DONE
- [x] Install KenLM tools (`install_beamsearch_decoders.sh`)
- [x] Gather Finnish text (ASR transcripts + Wikipedia + mc4)
- [x] Train 3 variants of KenLM (1M, 2M, 5M sentences)
- [x] Evaluate with LM fusion on all 4 test sets

### 5. Repository & Long-Form Inference - IN PROGRESS
- [x] Consolidate README and model metadata for Hugging Face release
- [x] Upload model checkpoints and KenLM bundles to HF Hub
- [x] Implement Silero VAD-based chunking for long-form audio (`inference_vad.py`)
- [x] Root-cause analysis of long-form degradation vs. Whisper (see above)
- [ ] Reduce `chunk_len` to 8–10s and add chunk overlap (Current Focus)
- [ ] Optimize `alpha` for stability on `moo.wav` (30 min test file)

### 6. Data Quality & Advanced Evaluation - PARTIALLY DONE
- [x] Fix 28 corrupted Common Voice manifest entries (raw TSV data in text field) β€” done in normalisation pass.
- [x] Fix en-dash/em-dash UNK token regression β€” done in normalisation pass.
- [ ] Audit VoxPopuli transcripts for all-lowercase entries (capitalisation missing).
- [ ] Re-train KenLM with high-quality punctuated text.
- [ ] Evaluate CER on non-normalized test sets.

### 7. Number Normalisation & UNK Token Fix - DONE
- [x] Replace en-dash `–` and em-dash `β€”` with ASCII hyphen `-` in all training manifests (85 train + 70 TTS entries fixed).
- [x] Use Gemini 2.5 Flash to normalise written-out Finnish numbers to digit form (2,586 API calls across train + TTS).
- [ ] Re-evaluate on the 5-sample number test set after Round 2 training to verify consistency.

### 8. Long-Form Data Expansion - DONE
- [x] Download `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB zip, 6,365 entries, mean 16.4s).
- [x] Align TTS manifest to NeMo training format and integrate into combined training manifest.
- [x] Round 2 training configured and ready to launch (see Round 2 section below).
- [ ] Benchmark Round 2 model against Round 1 and finetuned Whisper on `moo.wav`.

---

## πŸ› οΈ NeMo Environment Setup

This section documents the exact steps to set up a working NeMo inference/training environment, including the fixes required for the `nvcr.io/nvidia/pytorch:25.01-py3` container.

### Installation (from scratch on pytorch:25.01-py3 base image)

```bash

# 1. Clone the HF model repo (contains NeMo source with patches applied)

#    Skip LFS to avoid downloading the 3.6 GB model during clone

GIT_LFS_SKIP_SMUDGE=1 git clone \

  "https://user:${HF_TOKEN}@huggingface.co/RASMUS/Finnish-ASR-Canary-v2" \

  /workspace/Finnish-ASR-Canary-v2



# 2. Install NeMo in editable mode from the patched source

cd /workspace/Finnish-ASR-Canary-v2/NeMo

pip install -e ".[asr]"



# 3. Install pinned dependencies

pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' kaldialign wandb

```

### Required Compatibility Fixes

The pytorch:25.01-py3 container ships with packages that conflict with NeMo 2.8.0rc0:

```bash

# Fix 1: Downgrade lightning to the version NeMo requires (<=2.4.0)

# The container ships lightning 2.4.0 but pip may upgrade it β€” pin it back.

pip install "lightning==2.4.0" "pytorch-lightning==2.4.0"



# Fix 2: Remove incompatible torchvision

# The container's torchvision (0.20.0a0) was built against torch 2.6.0a0 (the original

# container torch), but NeMo's install upgrades torch to ~2.10. torchvision then fails

# on import and blocks NeMo. ASR does not need torchvision.

pip uninstall -y torchvision

```

### Downloading the Finetuned Model

```bash

# Download the finetuned acoustic model (3.6 GB)

curl -L \

  -H "Authorization: Bearer ${HF_TOKEN}" \

  "https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/canary-finnish.nemo" \

  -o /workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo



# KenLM models are also LFS β€” download the 5M variant (best WER):

curl -L \

  -H "Authorization: Bearer ${HF_TOKEN}" \

  "https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/kenlm_5M.nemo" \

  -o /workspace/Finnish-ASR-Canary-v2/kenlm_5M.nemo

```

### Quick Inference Smoke Test

```python

import warnings; warnings.filterwarnings('ignore')

from nemo.collections.asr.models import EncDecMultiTaskModel



model = EncDecMultiTaskModel.restore_from(

    '/workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo',

    map_location='cuda'

)

model.eval()



results = model.transcribe(

    audio=['path/to/audio.wav'],

    task='asr', source_lang='fi', target_lang='fi', pnc='yes'

)

print(results[0].text)

```

### Loading the Base Model (for comparison)

```python

# Downloads ~3.6 GB on first run, cached in ~/.cache/huggingface/

model_base = EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-v2", map_location='cuda')

```

---

## πŸ“ Progress Log
- **2026-01-11:** Initial project setup.
- **2026-02-08:** Redesigned data pipeline for 4 real datasets (CSS10, FLEURS, VoxPopuli, Common Voice).
- **2026-02-10:** **Finetuning complete.** Epoch 11 reached `val_wer=0.1258` on FLEURS.
- **2026-02-13:** Mermaid diagrams and project documentation for DS team.
- **2026-02-18:** **KenLM benchmarks finished.** Consolidated repository structure. Applied NeMo patches for inference stability.
- **2026-02-20:** **Model Released.** Release of `Finnish-ASR-Canary-v2` on HF. Implemented VAD-based inference pipeline. Currently tuning for long-form stability on `moo.wav` with various `alpha` settings (0.0 - 0.4 tested).
- **2026-02-26:** **Root-cause analysis complete.** Investigated long-form gap vs. Whisper and number handling. Key findings: (1) 77% of training data is <10s, creating distribution shift at inference chunk lengths; (2) No cross-chunk context in Canary's AED architecture; (3) Only 2.5% of training samples contain digit characters β€” numbers are a known weak point; (4) 28 corrupted Common Voice entries found (TSV metadata in text field); (5) `moo.wav` test file confirmed as 30 minutes. Action plan: shorten chunk_len, add chunk overlap, fix data corruption, and plan a long-form training data expansion round.

- **2026-02-26:** **Live number inference + tokenizer audit completed.** Ran base Canary-v2 vs. finetuned model on 5 FLEURS samples. Confirmed: (1) base model always outputs digits (`100`, `17`); (2) finetuned model regressed to mixed output β€” sometimes written words, sometimes digits β€” due to inconsistent training transcripts; (3) en-dash (`–`) produces UNK token `⁇` in finetuned model, base model degrades gracefully to ASCII hyphen. Policy decision: **standardise on digit output** and fix en-dash encoding in training manifests before next training run. NeMo environment setup documented (with fixes for `torchvision` and `lightning` version conflicts). TTS long-form dataset (`canary_asr_finetune_tts_long_data`, 8GB, mean 16.5s/segment) identified as key data source for next training run. Action plan for next run: (1) normalise numbers to digits via Gemini Flash API, (2) fix en-dash β†’ ASCII hyphen, (3) fix 28 corrupted CV entries, (4) add TTS long-form data.
- **2026-03-01:** **Round 2 data pipeline complete.** Ran `normalize_manifests.py`: 2,586 Gemini 2.5 Flash API calls (~$0.62), 1,137 number changes in train + 888 in TTS, 85 en-dash and 28 corrupted CV entries fixed. Downloaded and extracted TTS long-form dataset (6,365 entries, 4.8 GB). Split TTS data into train (4,377), eval_tts (487, mean 14.5s), and long-form pool (1,501 entries >20s). Sampled 200 entries into `eval_long_form.jsonl` (seed 42) and returned 1,301 to training, yielding `train_manifest_combined.jsonl` (28,858 entries, 75.6h). Round 2 training config created (`configs/canary_finetune_finnish_v2.yaml`). **Training ready to launch.**
- **2026-03-01:** **Training crash diagnosed and fixed.** Round 2 training ran 505 steps then crashed with CUDA `vectorized_gather_kernel index out of bounds`. Root cause: entry 14857 in `train_manifest_combined.jsonl` contained 11,247 chars of Python code (Gemini normalization returned a code block instead of a transcript for `voxpopuli_005371.wav`). When tokenized with the canary2 prompt format, the sequence far exceeded the decoder's `max_sequence_length=1024`, causing position-embedding OOB. Additionally, 4 entries in `eval_common_voice.json` had TSV metadata contamination (same v1 issue, not previously caught in the v2 eval set). Both manifests fixed. Config rewritten from full-architecture spec to minimal v1-style format (`tokenizer: update_tokenizer: false`) using `speech_to_text_finetune.py` (which restores the full model from the `.nemo` file). Training re-launched. Manifests synced to `canary-finnish-asr-data` HuggingFace dataset repo.