Finnish-ASR-Canary-v2 / PLAN_AND_PROGRESS.md

Upload PLAN_AND_PROGRESS.md with huggingface_hub

4d2ba80 verified 3 months ago

33.7 kB

	# Finnish ASR: Canary-v2 Finetuning & Progress

	This document provides a high-level overview of our Finnish ASR finetuning process, model architecture, and current progress for the Data Science team.

	---

	## 📊 Project Overview
	Our goal is to adapt NVIDIA's Canary-v2 (a 1-billion parameter multilingual model) for high-accuracy Finnish Automatic Speech Recognition (ASR). We leverage four diverse datasets to ensure robustness across different domains and speaking styles.

	---

	## 🏗️ Model Architecture
	Canary-v2 is an Attention-Encoder-Decoder (AED) model that utilizes the Fast-Conformer architecture. This design allows for efficient processing of long audio sequences while maintaining high accuracy.

	```mermaid
	graph TD
	A[Audio Input] -->\|Preprocessing\| B[Mel Spectrogram]

	subgraph TrainingBlock [Finetuned Components]
	direction TB
	subgraph Encoder [Encoder: Acoustic Modeling]
	C1[Convolutional Subsampling] -->\|Downsample\| C2[Conformer Blocks]
	C2 -->\|Latent Features\| C_Out[Acoustic Latents]
	end

	subgraph Decoder [Decoder: Language Modeling]
	D1[Masked Self-Attention] --> D2[Cross-Attention]
	D2 --> D3[Feed Forward]
	D3 --> D_Out[Text Generation]
	end
	end

	B -->\|Input\| C1
	P[Input Prompts:<br/>Lang, Task, PnC] -->\|Conditioning\| D1
	C_Out -->\|Acoustic Context\| D2
	D_Out -->\|Output\| E[Finnish Text]

	%% Styling
	style TrainingBlock fill:#f0f7ff,stroke:#0052cc,stroke-width:3px,stroke-dasharray: 5 5
	style A fill:#ffffff,stroke:#333,stroke-width:2px
	style B fill:#ffffff,stroke:#333,stroke-width:2px
	style P fill:#ffffff,stroke:#333,stroke-width:2px
	style E fill:#e6ffed,stroke:#28a745,stroke-width:2px

	style Encoder fill:#ffffff,stroke:#0052cc,stroke-width:1px
	style Decoder fill:#ffffff,stroke:#0052cc,stroke-width:1px
	```

	### Component Roles & Finetuning:
	- Highlighted Area (Blue Dashed Box): This represents the core weights of the Canary-v2 model. During our finetuning, we update the parameters in both the Encoder and Decoder to specifically recognize Finnish phonemes and grammar.
	- Mel Spectrogram: The "Vision" stage. It turns raw audio waves into a structured 2D representation of sound frequencies over time.
	- Fast-Conformer Encoder: The "Acoustic Processor." We finetuned this to understand the unique sounds of the Finnish language (like double vowels and consonants).
	- Input Prompts: The "Context Injector." These are the same color as other inputs because they are part of the model's standard input pipeline, telling it: "Act as a Finnish ASR system."
	- Attention-Decoder: The "Linguistic Brain." We finetuned this to map the Finnish sounds from the encoder into grammatically correct Finnish text, guided by the prompts.

	---

	## 🔄 Finetuning Workflow
	Our pipeline is fully automated, from data ingestion to multi-dataset evaluation.

	```mermaid
	graph TD
	subgraph DataPrep [Data Preparation]
	D1[CSS10 Finnish] --> P[Unified Processing Script]
	D2[FLEURS Finnish] --> P
	D3[VoxPopuli Finnish] --> P
	D4[Common Voice v24] --> P
	P --> M1[train_manifest.json]
	P --> M2[eval_fleurs.json]
	P --> M3[eval_common_voice.json]
	P --> M4[eval_css10.json]
	P --> M5[eval_voxpopuli.json]
	end

	subgraph Training [Canary-v2 Finetuning]
	M1 --> T[NVIDIA NeMo Trainer]
	CM[nvidia/canary-1b-v2] --> T
	T --> CK[Model Checkpoints]
	M2 & M3 & M4 & M5 --> V[Multi-Validation]
	V --> W[WandB Tracking]
	end

	subgraph Inference [Post-Processing]
	CK --> Inf[Inference]
	Inf --> K[KenLM/NGPU-LM Integration]
	K --> R[Final ASR Output]
	end
	```

	---

	## 📚 Datasets
	We use a balanced mix of datasets to cover various audio qualities and transcript styles:

	\| Dataset \| Source \| Characteristics \|
	\|---------\|--------\|-----------------\|
	\| FLEURS \| Google \| High-quality, diverse speakers (Benchmark) \|
	\| Common Voice \| Mozilla \| Crowdsourced, varied quality and accents \|
	\| CSS10 \| Single Speaker \| Clean, high-quality audio books \|
	\| VoxPopuli \| EU Parliament \| European Parliament speeches (Formal) \|

	---

	## 📊 Training Data Analysis

	This section documents the composition and length distribution of our training data (from `RASMUS/canary-finnish-asr-data`, accessed 2026-02-26).

	### Dataset Summary

	\| Dataset \| Samples \| Mean Duration \| Max Duration \| Total Hours \|
	\|---------\|---------\|--------------\|-------------\|-------------\|
	\| Common Voice v24 \| 9,086 \| 4.5s \| 10.5s \| 11.2h \|
	\| VoxPopuli \| 8,164 \| 10.1s \| 50.5s \| 23.0h \|
	\| CSS10 \| 3,226 \| 7.7s \| 20.2s \| 6.9h \|
	\| FLEURS \| 2,704 \| 11.7s \| 43.2s \| 8.8h \|
	\| TOTAL \| 23,180 \| 7.8s \| 50.5s \| ~50h \|

	### Duration Distribution (Training Set)

	```
	0–5s : 33.3% (7,725 samples) ████████████████████████████████████████
	5–10s : 43.7% (10,139 samples) █████████████████████████████████████████████████████
	10–15s : 15.0% (3,473 samples) ██████████████████
	15–20s : 5.4% (1,241 samples) ██████
	20–30s : 2.4% (562 samples) ███
	>30s : 0.2% (40 samples)
	```

	Key insight: 77% of training samples are shorter than 10 seconds. The model has very little exposure to longer audio segments (only 0.2% are >30s). This has direct implications for long-form inference stability.

	### Evaluation Set Durations

	\| Eval Set \| Samples \| Mean Duration \| Max Duration \|
	\|----------\|---------\|--------------\|-------------\|
	\| FLEURS \| 918 \| 13.0s \| 33.7s \|
	\| Common Voice \| 1,554 \| 5.1s \| 10.5s \|
	\| CSS10 \| 170 \| 7.5s \| 10.2s \|
	\| VoxPopuli \| 430 \| 10.6s \| 47.5s \|

	---

	## 🔢 Number Handling Analysis

	### Live Inference Results: Base vs Finetuned (2026-02-26)

	We ran both models on 5 FLEURS test samples to determine each model's number output style.

	\| # \| Scenario \| Reference \| Base Canary-v2 \| Our Finetuned \|
	\|---\|----------\|-----------\|----------------\|---------------\|
	\| 1 \| Spoken "sata" (hundred) \| `yli sata vuotta` \| `yli 100 vuotta` ❌ \| `yli 100 vuotta` ❌ \|
	\| 2 \| Spoken "seitsemäntoista" (17) \| `surmaten seitsemäntoista henkeä` \| `surmaten 17 henkeä` ❌ \| `surmaten seitsemäntoista henkeä` ✅ \|
	\| 3 \| Digits in reference (15, 2011, 2017) \| `15 metriä... 2011... 2017` \| Correct ✅ \| Correct ✅ \|
	\| 4 \| Abbreviation "jKr." (AD) \| `400 jKr.` \| `400 jälkeen Kristuksen` \| `400 jälkeen Kristuksen` \|
	\| 5 \| Range "25–30" (en-dash U+2013) \| `25–30 vuodella` \| `25-30 vuodella` (ASCII hyphen) \| `25 ⁇ 30 vuodella` ❌ UNK token \|

	Key findings:

	1. Base model outputs digits. When the speaker says "sata" (hundred) or "seitsemäntoista" (seventeen), the base Canary-v2 outputs `100` and `17`. This is NVIDIA's built-in text normalisation — Canary always outputs digit form for numbers.

	2. Finetuning introduced inconsistency. Our finetuning partially reversed this: for `seitsemäntoista` the finetuned model now outputs the written word (because FLEURS training transcripts used written-out numbers), but still outputs `100` for `sata`. This inconsistency is worse than either consistent policy.

	3. En-dash produces a UNK token in the finetuned model. The character `–` (U+2013 en-dash) in `25–30` causes the finetuned model to emit `⁇` (SentencePiece UNK). The base model degrades gracefully to an ASCII hyphen `25-30`. This is a regression introduced by finetuning — likely because the en-dash was absent or inconsistently encoded in our training data.

	4. Abbreviations are expanded by both models. `jKr.` → `jälkeen Kristuksen` in both — this is model behaviour, not a finetuning artifact.

	### Policy Decision
	We want digit output (not written-out Finnish number words). The base model's behaviour is correct here. The finetuned model regressed on consistency because our FLEURS training transcripts used written-out numbers.

	### Training Data Issues Found
	- Only 2.5% (578 / 23,180) of training samples contain digit characters at all.
	- FLEURS transcripts use written-out numbers (`sata vuotta`) while VoxPopuli and Common Voice use digits. This gives the model conflicting signal.
	- En-dash (`–` U+2013) may be absent or mis-encoded in training manifests, causing UNK tokens at inference time.

	### Action Plan: Numbers & UNK Token

	#### Step 1 — Normalise training transcripts to digit form
	Run a pre-processing pass on `train_manifest.json` before the next training run:
	- Use the Python library `num2words` with locale `fi` to convert Finnish written-out numbers to digits: e.g. `sata` → `100`, `seitsemäntoista` → `17`.
	- OR (simpler / safer): replace the FLEURS transcripts in the manifest with their raw reference texts which already have digits (FLEURS provides both `raw_transcription` and `transcription` columns; currently we use `raw_transcription` which has written numbers).
	- Target: all numeric quantities consistently in digit form across all four datasets.

	#### Step 2 — Fix en-dash encoding (ROOT CAUSE CONFIRMED)

	Confirmed via tokenizer inspection (2026-02-26):

	```python
	m.tokenizer.text_to_ids("25–30") # → [16053, 1125, 1128, 0, 1126, 1123]
	# ↑ id 0 = UNK for the en-dash!
	m.tokenizer.text_to_ids("25-30") # → [16053, 1125, 1128, 16107, 1126, 1123]
	# ↑ ASCII hyphen tokenises correctly
	```

	- En-dash `–` (U+2013) and em-dash `—` (U+2014) are NOT in the CanaryBPETokenizer vocabulary (both map to UNK id 0).
	- Training data contains 85 entries with en-dash (83 FLEURS, 2 Common Voice). During training, the en-dash in the TARGET text was encoded as UNK, so the model learned to produce UNK for the corresponding speech sounds.
	- Fix: replace all `–` and `—` with ASCII hyphen `-` in all training transcripts before the next training run. This is a one-line preprocessing step.

	```python
	# In manifest preprocessing:
	text = text.replace('\u2013', '-').replace('\u2014', '-')
	```

	#### Step 3 — Re-evaluate after normalisation
	After normalising transcripts, re-run the 5-sample live inference test to verify:
	- `sata vuotta` audio → model outputs `100 vuotta`
	- `seitsemäntoista` audio → model outputs `17`
	- `25–30` audio → model outputs `25-30` or `25–30` (no UNK)

	---

	## 🔈 Long-Form Audio: Root Cause Analysis

	Our test file `moo.wav` is 30 minutes (1,800s) of continuous Finnish speech. This reveals a core gap vs. our finetuned Whisper model.

	### How Canary-v2 Handles Long Audio (Natively)
	- NVIDIA's Canary-v2 uses dynamic chunking with 1-second overlap between chunks.
	- This is automatically triggered for audio longer than 40 seconds.
	- The model was pre-trained on a 1.7M-hour multilingual corpus with this chunking strategy baked in.

	### Our Current Approach (`inference_vad.py`)
	1. Silero VAD detects speech segments.
	2. Segments are merged into chunks up to `chunk_len` seconds (default: 15s).
	3. Each chunk is transcribed independently — no shared context between chunks.

	### Root Causes of Degradation on Long-Form

	\| Issue \| Detail \|
	\|-------\|--------\|
	\| Training length mismatch \| 77% of fine-tuning data is <10s. Inference chunks at 15s are longer than nearly all training examples, creating distribution shift. \|
	\| No cross-chunk context \| Each 15s chunk is transcribed in isolation. Canary's attention decoder has no memory of previous chunks, so topic/speaker continuity is lost at boundaries. \|
	\| VAD vs. native chunking \| Our VAD-based approach differs from Canary's built-in dynamic chunking. The model was not fine-tuned with this chunking strategy. \|
	\| Repetition / hallucination \| At chunk boundaries with silence or music, the decoder can loop. This is worsened when segments are near the edge of the model's training length distribution. \|
	\| No overlap \| Without overlap between chunks, words at segment boundaries can be dropped or doubled. \|

	### Comparison: Canary vs. Our Finetuned Whisper on Long-Form

	Whisper was explicitly designed and trained for long-form audio with:
	- Sliding window inference with overlap
	- Previous-chunk text as conditioning (prompt-based context)
	- Timestamps for alignment

	Canary's AED architecture does not use previous-chunk text as input, making long-form continuity fundamentally harder to achieve without careful chunk overlap and stitching.

	---

	## 🚀 Progress & Results

	### Current Status: Model Released & Repository Consolidated
	We have successfully completed the finetuning, KenLM integration, and repository consolidation phases. The model and its associated language models are now hosted on Hugging Face at `RASMUS/Finnish-ASR-Canary-v2`.

	- Infrastructure: Finetuned on RTX 6000 PRO Blackwell (96 GB VRAM) on Verda.com platform in Finland.
	- Model Suite: Acoustic model + 3 KenLM variants (1M, 2M, 5M sentences).
	- Best Performance (with KenLM 5M):
	- FLEURS: 7.86% WER
	- Common Voice: 4.70% WER
	- CSS10: 7.07% WER
	- VoxPopuli: 11.65% WER
	- Deployment: Integrated Silero VAD-based inference for robust long-form audio processing.

	### Next Steps:
	1. Long-form Tuning: Reduce default `chunk_len` to 8–10s (closer to training distribution median) and add 0.5–1s overlap between chunks to reduce boundary artifacts.
	2. Data Quality Audit: Fix 28 confirmed corrupted Common Voice entries where raw TSV metadata (client ID hashes, gender tags) was accidentally written into the `text` field. Audit VoxPopuli for missing capitalisation (all-lowercase transcripts despite `pnc: yes`).
	3. Number Handling: Add Finnish-specific training data with numeric content. Consider TTS-synthesised samples covering phone numbers, years, statistics, and measurements (both digit and written-out forms paired).
	4. Long-form Training Data: Incorporate longer audio segments: TTS synthetic long-form audio (`fbc_monolog_processed`, parliament data) into the training manifest to shift the duration distribution toward 15–30s.
	5. KenLM Refinement: Re-train KenLM with high-quality punctuated text. Current LM trained on mixed-quality data.
	6. Advanced Evaluation: Implement CER evaluation on non-normalised test sets to better capture punctuation/casing accuracy.
	7. Repetition Penalty: Explore repetition penalty in decoding if chunk-level loops persist after chunk length tuning.
	8. Real-world Evaluation: Benchmark on diverse long-form samples (podcasts, meetings, call-centre audio).

	---

	## 🗺️ Action Plan: Next Training Run

	This section details the concrete steps for the next finetuning iteration, based on the root-cause analysis above.

	### Priority 1 — Fix Training Data (before re-training)

	#### 1a. Normalise numbers to digit form (Gemini Flash)
	Finnish written-out numbers in FLEURS transcripts cause the finetuned model to output inconsistent number forms. We will use the Gemini Flash API to convert all training transcripts in a single batch pass:

	```python
	# Pseudocode — run once on train_manifest.json before next training
	import google.generativeai as genai
	import json

	genai.configure(api_key=GEMINI_API_KEY)
	model = genai.GenerativeModel("gemini-2.0-flash")

	SYSTEM_PROMPT = """You are a Finnish text normalizer.
	Convert any written-out Finnish numbers, ordinals, or number words in the text to digit form.
	Examples:
	"yli sata vuotta" → "yli 100 vuotta"
	"seitsemäntoista henkeä" → "17 henkeä"
	"vuonna tuhat yhdeksänsataa" → "vuonna 1900"
	Keep all other text exactly as-is. Return only the modified text, nothing else."""

	entries = []
	with open('manifests/train_manifest.json') as f:
	for line in f:
	d = json.loads(line)
	response = model.generate_content(f"{SYSTEM_PROMPT}\n\n{d['text']}")
	d['text'] = response.text.strip()
	entries.append(d)

	with open('manifests/train_manifest_normalised.json', 'w') as f:
	for e in entries:
	f.write(json.dumps(e, ensure_ascii=False) + '\n')
	```

	Cost estimate: 23,180 entries × ~50 tokens average = ~1.2M tokens. At Gemini Flash pricing (~$0.075/1M tokens input) ≈ < $0.10 total.

	#### 1b. Fix en-dash UNK token (confirmed root cause)
	The en-dash `–` (U+2013) is NOT in the tokenizer vocabulary — it maps to UNK (id 0). Replace it with ASCII hyphen before training:

	```python
	# Add to the manifest preprocessing step
	text = text.replace('\u2013', '-').replace('\u2014', '-')
	```

	This affects 85 entries in `train_manifest.json` (83 FLEURS, 2 Common Voice).

	#### 1c. Fix 28 corrupted Common Voice entries
	Replace entries where the `text` field contains raw TSV metadata (tabs + client_id hashes). Strip everything after the first tab character.

	---

	### Priority 2 — Add Long-Form Training Data

	#### TTS Long-Form Dataset: `RASMUS/canary_asr_finetune_tts_long_data`

	\| Property \| Value \|
	\|----------\|-------\|
	\| Size \| 8.0 GB zip \|
	\| Format \| FLAC audio + JSONL manifest \|
	\| Mean duration \| 16.5s (vs 7.8s in current data) \|
	\| Median duration \| 15.9s \|
	\| Max duration \| 25.0s \|
	\| Content \| Finnish speech: lectures, podcasts, YouTube \|
	\| Segments >20s \| ~25% \|

	This dataset directly addresses the training length mismatch. Adding it will shift the duration distribution from a mean of 7.8s toward ~10–12s and significantly increase the proportion of 15–25s segments that match inference chunk lengths.

	Integration plan:
	```bash
	# Download the dataset
	curl -L -H "Authorization: Bearer ${HF_TOKEN}" \
	"https://huggingface.co/datasets/RASMUS/canary_asr_finetune_tts_long_data/resolve/main/canary_dataset.zip" \
	-o /workspace/data/tts_long_data.zip

	# Extract
	unzip /workspace/data/tts_long_data.zip -d /workspace/data/tts_long_data/

	# Apply number normalisation and dash fix to canary_manifest.jsonl
	# then merge with existing train_manifest_normalised.json
	```

	After applying number normalisation and dash fixes to the new manifest, concatenate with the existing training set. Expected combined size: ~23,180 + N (estimate 5,000–20,000+ entries depending on total dataset size).

	---

	### Priority 3 — Inference Tuning (without re-training)

	Even before re-training, we can improve `moo.wav` performance by adjusting `inference_vad.py`:

	\| Parameter \| Current \| Recommended \|
	\|-----------\|---------\|-------------\|
	\| `chunk_len` \| 15s \| 8–10s (match training median of 7.8s) \|
	\| chunk overlap \| 0s \| 0.5s (reduce boundary word drops) \|
	\| `alpha` (KenLM) \| 0.2 \| Try 0.1–0.15 (current may over-constrain decoder) \|

	---

	## 🔄 Round 2: Data Pipeline & Splits

	This section documents the data preparation methodology for Round 2 finetuning, including all new eval sets, the TTS integration, and the final manifest composition.

	### Overview of Changes vs Round 1

	\| Item \| Round 1 \| Round 2 \|
	\|------\|---------\|---------\|
	\| Base model \| `canary-1b-v2.nemo` \| `canary-1b-v2.nemo` (fresh start) \|
	\| Training samples \| 23,180 \| 28,858 \|
	\| Training hours \| ~50h \| 75.6h \|
	\| Mean duration \| 7.8s \| 9.4s \|
	\| Max duration allowed \| 20.0s \| 30.0s \|
	\| Transcripts normalised \| No \| Yes (digits, dashes fixed) \|
	\| Eval sets \| 4 \| 6 \|

	### Step 1 — Transcript Normalisation (`normalize_manifests.py`)

	All training transcripts were cleaned in two layers:

	Deterministic fixes (no API call needed):
	- En-dash `–` (U+2013) and em-dash `—` (U+2014) → ASCII hyphen `-` (fixes UNK token regression)
	- Corrupted Common Voice entries (raw TSV metadata in `text` field) → strip everything after first tab

	Gemini 2.5 Flash API calls (2,586 of 23,180 entries needed conversion):
	- Pre-filtered with a Finnish number-word regex so only entries that actually contain written numbers are sent to the API (cost: ~$0.62)
	- Written Finnish numbers converted to digit form: `sata vuotta` → `100 vuotta`, `seitsemäntoista` → `17`
	- Explicit DO NOT CONVERT rules: ordinals (`ensimmäinen`, `toinen`), superlative constructions (`yksi tärkeimmistä`), and `toinen` as "another/other"

	### Step 2 — TTS Long-Form Data Integration

	Downloaded `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB, 6,365 entries, mean 16.4s).

	Aligned to NeMo training format:
	- Path rewritten to relative style: `data/tts_long_data/audio/{filename}`
	- Fields mapped: `language` → `source_lang`/`target_lang`, `task: "transcription"` → `taskname: "asr"`, added `pnc: "yes"`
	- Same Gemini normalisation pass applied (888 entries converted)

	### Step 3 — Eval Set Construction (TTS Data)

	The 6,365 normalised TTS entries were split into train / eval / long-form-test:

	```
	All TTS entries (6,365)
	│
	├── Long-form pool (>20s): 1,501 entries
	│ ├── eval_long_form (sampled): 200 entries ← random.seed(42) shuffle → first 200
	│ └── Returned to training pool: 1,301 entries
	│
	└── Medium pool (10–20s): 4,864 entries
	├── eval_tts (10% hold-out): 487 entries ← stratified by duration bucket
	└── tts_train: 4,377 entries
	```

	Why eval_long_form = 200 entries?
	The original 1,501 long-form entries (>20s) had a total duration of ~9.4 hours — far too long to run as a validation set every epoch. At batch_size=32 on a single GPU, each validation pass over 1,501 entries takes ~25 minutes, adding 2.5h per epoch. 200 entries (≈75 minutes of audio) provides a representative sample of the long-form distribution at reasonable cost: ~4 minutes of eval time per epoch.

	eval_tts construction:
	487 entries were held out from the 10–20s duration range (10% stratified sample). This tests the model's ability to handle medium-length audio and is separate from the original 4 eval sets.

	### Step 4 — Combined Training Manifest

	Final `train_manifest_combined.jsonl` composition:

	\| Source \| Entries \| Notes \|
	\|--------\|---------\|-------\|
	\| Original train (normalised) \| 23,180 \| Digits + dash fix applied \|
	\| TTS train (10–20s) \| 4,377 \| Synthesised long-form speech \|
	\| Long-form overflow \| 1,301 \| >20s entries not selected for eval_long_form \|
	\| Total \| 28,858 \| Mean 9.4s, 75.6h \|

	### Final Eval Sets (Round 2)

	\| Set \| File \| Entries \| Mean Duration \| Purpose \|
	\|-----\|------\|---------\|--------------\|---------\|
	\| `eval_fleurs` \| `eval_fleurs.json` \| 918 \| 13.0s \| Primary benchmark (monitored for checkpointing) \|
	\| `eval_common_voice` \| `eval_common_voice.json` \| 1,554 \| 5.1s \| Crowdsourced quality \|
	\| `eval_css10` \| `eval_css10.json` \| 170 \| 7.5s \| Clean single-speaker \|
	\| `eval_voxpopuli` \| `eval_voxpopuli.json` \| 430 \| 10.6s \| Formal/parliament speech \|
	\| `eval_tts` \| `eval_tts.jsonl` \| 487 \| 14.5s \| Medium-length TTS (new) \|
	\| `eval_long_form` \| `eval_long_form.jsonl` \| 200 \| 22.5s \| Long-form >20s sample (new) \|

	Checkpoint monitoring: `val_wer` tracks FLEURS (first validation set). All 6 WERs are logged independently to WandB.

	### Round 2 Training Config

	File: `configs/canary_finetune_finnish_v2.yaml`
	Key settings:
	- `init_from_nemo_model`: `/workspace/Finnish-ASR-Canary-v2/models/canary-1b-v2.nemo` (fresh start from base)
	- `max_duration`: 30.0s (up from 20.0s to include TTS segments up to 25s)
	- `max_steps`: 18,000 (scaled: 28,858 / 32 ≈ 902 steps/epoch × 20 epochs ≈ 18,040)
	- `lr`: 1e-5, `WarmupAnnealing`, 500 warmup steps
	- `precision`: bf16, single GPU, `strategy: auto`

	---

	## 🛠️ Workflow Status Details

	### 1. Data Preparation - DONE
	- [x] Identify and inventory all 4 datasets
	- [x] Create unified processing script (`scripts/prepare_all_manifests.py`)
	- [x] Run `scripts/prepare_all_manifests.py` on devcontainer
	- [x] Verify manifest sample counts and audio file integrity

	### 2. Configuration Setup - DONE
	- [x] Create Hydra training config (`configs/canary_finetune_finnish.yaml`)
	- [x] Configure multi-validation with 4 eval datasets
	- [x] Checkpoint monitors primary eval set (FLEURS) via `val_wer`
	- [x] All 4 eval WERs logged independently to WandB

	### 3. Training - DONE
	- [x] Run finetuning via `run_training.sh`
	- [x] Monitor per-dataset WER in WandB

	### 4. KenLM / NGPU-LM Language Model Integration - DONE
	- [x] Install KenLM tools (`install_beamsearch_decoders.sh`)
	- [x] Gather Finnish text (ASR transcripts + Wikipedia + mc4)
	- [x] Train 3 variants of KenLM (1M, 2M, 5M sentences)
	- [x] Evaluate with LM fusion on all 4 test sets

	### 5. Repository & Long-Form Inference - IN PROGRESS
	- [x] Consolidate README and model metadata for Hugging Face release
	- [x] Upload model checkpoints and KenLM bundles to HF Hub
	- [x] Implement Silero VAD-based chunking for long-form audio (`inference_vad.py`)
	- [x] Root-cause analysis of long-form degradation vs. Whisper (see above)
	- [ ] Reduce `chunk_len` to 8–10s and add chunk overlap (Current Focus)
	- [ ] Optimize `alpha` for stability on `moo.wav` (30 min test file)

	### 6. Data Quality & Advanced Evaluation - PARTIALLY DONE
	- [x] Fix 28 corrupted Common Voice manifest entries (raw TSV data in text field) — done in normalisation pass.
	- [x] Fix en-dash/em-dash UNK token regression — done in normalisation pass.
	- [ ] Audit VoxPopuli transcripts for all-lowercase entries (capitalisation missing).
	- [ ] Re-train KenLM with high-quality punctuated text.
	- [ ] Evaluate CER on non-normalized test sets.

	### 7. Number Normalisation & UNK Token Fix - DONE
	- [x] Replace en-dash `–` and em-dash `—` with ASCII hyphen `-` in all training manifests (85 train + 70 TTS entries fixed).
	- [x] Use Gemini 2.5 Flash to normalise written-out Finnish numbers to digit form (2,586 API calls across train + TTS).
	- [ ] Re-evaluate on the 5-sample number test set after Round 2 training to verify consistency.

	### 8. Long-Form Data Expansion - DONE
	- [x] Download `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB zip, 6,365 entries, mean 16.4s).
	- [x] Align TTS manifest to NeMo training format and integrate into combined training manifest.
	- [x] Round 2 training configured and ready to launch (see Round 2 section below).
	- [ ] Benchmark Round 2 model against Round 1 and finetuned Whisper on `moo.wav`.

	---

	## 🛠️ NeMo Environment Setup

	This section documents the exact steps to set up a working NeMo inference/training environment, including the fixes required for the `nvcr.io/nvidia/pytorch:25.01-py3` container.

	### Installation (from scratch on pytorch:25.01-py3 base image)

	```bash
	# 1. Clone the HF model repo (contains NeMo source with patches applied)
	# Skip LFS to avoid downloading the 3.6 GB model during clone
	GIT_LFS_SKIP_SMUDGE=1 git clone \
	"https://user:${HF_TOKEN}@huggingface.co/RASMUS/Finnish-ASR-Canary-v2" \
	/workspace/Finnish-ASR-Canary-v2

	# 2. Install NeMo in editable mode from the patched source
	cd /workspace/Finnish-ASR-Canary-v2/NeMo
	pip install -e ".[asr]"

	# 3. Install pinned dependencies
	pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' kaldialign wandb
	```

	### Required Compatibility Fixes

	The pytorch:25.01-py3 container ships with packages that conflict with NeMo 2.8.0rc0:

	```bash
	# Fix 1: Downgrade lightning to the version NeMo requires (<=2.4.0)
	# The container ships lightning 2.4.0 but pip may upgrade it — pin it back.
	pip install "lightning==2.4.0" "pytorch-lightning==2.4.0"

	# Fix 2: Remove incompatible torchvision
	# The container's torchvision (0.20.0a0) was built against torch 2.6.0a0 (the original
	# container torch), but NeMo's install upgrades torch to ~2.10. torchvision then fails
	# on import and blocks NeMo. ASR does not need torchvision.
	pip uninstall -y torchvision
	```

	### Downloading the Finetuned Model

	```bash
	# Download the finetuned acoustic model (3.6 GB)
	curl -L \
	-H "Authorization: Bearer ${HF_TOKEN}" \
	"https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/canary-finnish.nemo" \
	-o /workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo

	# KenLM models are also LFS — download the 5M variant (best WER):
	curl -L \
	-H "Authorization: Bearer ${HF_TOKEN}" \
	"https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/kenlm_5M.nemo" \
	-o /workspace/Finnish-ASR-Canary-v2/kenlm_5M.nemo
	```

	### Quick Inference Smoke Test

	```python
	import warnings; warnings.filterwarnings('ignore')
	from nemo.collections.asr.models import EncDecMultiTaskModel

	model = EncDecMultiTaskModel.restore_from(
	'/workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo',
	map_location='cuda'
	)
	model.eval()

	results = model.transcribe(
	audio=['path/to/audio.wav'],
	task='asr', source_lang='fi', target_lang='fi', pnc='yes'
	)
	print(results[0].text)
	```

	### Loading the Base Model (for comparison)

	```python
	# Downloads ~3.6 GB on first run, cached in ~/.cache/huggingface/
	model_base = EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-v2", map_location='cuda')
	```

	---

	## 📝 Progress Log
	- 2026-01-11: Initial project setup.
	- 2026-02-08: Redesigned data pipeline for 4 real datasets (CSS10, FLEURS, VoxPopuli, Common Voice).
	- 2026-02-10: Finetuning complete. Epoch 11 reached `val_wer=0.1258` on FLEURS.
	- 2026-02-13: Mermaid diagrams and project documentation for DS team.
	- 2026-02-18: KenLM benchmarks finished. Consolidated repository structure. Applied NeMo patches for inference stability.
	- 2026-02-20: Model Released. Release of `Finnish-ASR-Canary-v2` on HF. Implemented VAD-based inference pipeline. Currently tuning for long-form stability on `moo.wav` with various `alpha` settings (0.0 - 0.4 tested).
	- 2026-02-26: Root-cause analysis complete. Investigated long-form gap vs. Whisper and number handling. Key findings: (1) 77% of training data is <10s, creating distribution shift at inference chunk lengths; (2) No cross-chunk context in Canary's AED architecture; (3) Only 2.5% of training samples contain digit characters — numbers are a known weak point; (4) 28 corrupted Common Voice entries found (TSV metadata in text field); (5) `moo.wav` test file confirmed as 30 minutes. Action plan: shorten chunk_len, add chunk overlap, fix data corruption, and plan a long-form training data expansion round.
	- 2026-02-26: Live number inference + tokenizer audit completed. Ran base Canary-v2 vs. finetuned model on 5 FLEURS samples. Confirmed: (1) base model always outputs digits (`100`, `17`); (2) finetuned model regressed to mixed output — sometimes written words, sometimes digits — due to inconsistent training transcripts; (3) en-dash (`–`) produces UNK token `⁇` in finetuned model, base model degrades gracefully to ASCII hyphen. Policy decision: standardise on digit output and fix en-dash encoding in training manifests before next training run. NeMo environment setup documented (with fixes for `torchvision` and `lightning` version conflicts). TTS long-form dataset (`canary_asr_finetune_tts_long_data`, 8GB, mean 16.5s/segment) identified as key data source for next training run. Action plan for next run: (1) normalise numbers to digits via Gemini Flash API, (2) fix en-dash → ASCII hyphen, (3) fix 28 corrupted CV entries, (4) add TTS long-form data.
	- 2026-03-01: Round 2 data pipeline complete. Ran `normalize_manifests.py`: 2,586 Gemini 2.5 Flash API calls (~$0.62), 1,137 number changes in train + 888 in TTS, 85 en-dash and 28 corrupted CV entries fixed. Downloaded and extracted TTS long-form dataset (6,365 entries, 4.8 GB). Split TTS data into train (4,377), eval_tts (487, mean 14.5s), and long-form pool (1,501 entries >20s). Sampled 200 entries into `eval_long_form.jsonl` (seed 42) and returned 1,301 to training, yielding `train_manifest_combined.jsonl` (28,858 entries, 75.6h). Round 2 training config created (`configs/canary_finetune_finnish_v2.yaml`). Training ready to launch.
	- 2026-03-01: Training crash diagnosed and fixed. Round 2 training ran 505 steps then crashed with CUDA `vectorized_gather_kernel index out of bounds`. Root cause: entry 14857 in `train_manifest_combined.jsonl` contained 11,247 chars of Python code (Gemini normalization returned a code block instead of a transcript for `voxpopuli_005371.wav`). When tokenized with the canary2 prompt format, the sequence far exceeded the decoder's `max_sequence_length=1024`, causing position-embedding OOB. Additionally, 4 entries in `eval_common_voice.json` had TSV metadata contamination (same v1 issue, not previously caught in the v2 eval set). Both manifests fixed. Config rewritten from full-architecture spec to minimal v1-style format (`tokenizer: update_tokenizer: false`) using `speech_to_text_finetune.py` (which restores the full model from the `.nemo` file). Training re-launched. Manifests synced to `canary-finnish-asr-data` HuggingFace dataset repo.