Automatic Speech Recognition
NeMo
Finnish
asr
speech-recognition
canary-v2
kenlm
finnish
Eval Results (legacy)
Instructions to use RASMUS/Finnish-ASR-Canary-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use RASMUS/Finnish-ASR-Canary-v2 with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("RASMUS/Finnish-ASR-Canary-v2") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
| # Finnish ASR: Canary-v2 Finetuning & Progress | |
| This document provides a high-level overview of our Finnish ASR finetuning process, model architecture, and current progress for the Data Science team. | |
| --- | |
| ## π Project Overview | |
| Our goal is to adapt NVIDIA's **Canary-v2** (a 1-billion parameter multilingual model) for high-accuracy Finnish Automatic Speech Recognition (ASR). We leverage four diverse datasets to ensure robustness across different domains and speaking styles. | |
| --- | |
| ## ποΈ Model Architecture | |
| Canary-v2 is an **Attention-Encoder-Decoder (AED)** model that utilizes the **Fast-Conformer** architecture. This design allows for efficient processing of long audio sequences while maintaining high accuracy. | |
| ```mermaid | |
| graph TD | |
| A[Audio Input] -->|Preprocessing| B[Mel Spectrogram] | |
| subgraph TrainingBlock [Finetuned Components] | |
| direction TB | |
| subgraph Encoder [Encoder: Acoustic Modeling] | |
| C1[Convolutional Subsampling] -->|Downsample| C2[Conformer Blocks] | |
| C2 -->|Latent Features| C_Out[Acoustic Latents] | |
| end | |
| subgraph Decoder [Decoder: Language Modeling] | |
| D1[Masked Self-Attention] --> D2[Cross-Attention] | |
| D2 --> D3[Feed Forward] | |
| D3 --> D_Out[Text Generation] | |
| end | |
| end | |
| B -->|Input| C1 | |
| P[Input Prompts:<br/>Lang, Task, PnC] -->|Conditioning| D1 | |
| C_Out -->|Acoustic Context| D2 | |
| D_Out -->|Output| E[Finnish Text] | |
| %% Styling | |
| style TrainingBlock fill:#f0f7ff,stroke:#0052cc,stroke-width:3px,stroke-dasharray: 5 5 | |
| style A fill:#ffffff,stroke:#333,stroke-width:2px | |
| style B fill:#ffffff,stroke:#333,stroke-width:2px | |
| style P fill:#ffffff,stroke:#333,stroke-width:2px | |
| style E fill:#e6ffed,stroke:#28a745,stroke-width:2px | |
| style Encoder fill:#ffffff,stroke:#0052cc,stroke-width:1px | |
| style Decoder fill:#ffffff,stroke:#0052cc,stroke-width:1px | |
| ``` | |
| ### Component Roles & Finetuning: | |
| - **Highlighted Area (Blue Dashed Box)**: This represents the core weights of the **Canary-v2** model. During our finetuning, we update the parameters in both the **Encoder** and **Decoder** to specifically recognize Finnish phonemes and grammar. | |
| - **Mel Spectrogram**: The "Vision" stage. It turns raw audio waves into a structured 2D representation of sound frequencies over time. | |
| - **Fast-Conformer Encoder**: The "Acoustic Processor." We finetuned this to understand the unique sounds of the Finnish language (like double vowels and consonants). | |
| - **Input Prompts**: The "Context Injector." These are the same color as other inputs because they are part of the model's standard input pipeline, telling it: "Act as a Finnish ASR system." | |
| - **Attention-Decoder**: The "Linguistic Brain." We finetuned this to map the Finnish sounds from the encoder into grammatically correct Finnish text, guided by the prompts. | |
| --- | |
| ## π Finetuning Workflow | |
| Our pipeline is fully automated, from data ingestion to multi-dataset evaluation. | |
| ```mermaid | |
| graph TD | |
| subgraph DataPrep [Data Preparation] | |
| D1[CSS10 Finnish] --> P[Unified Processing Script] | |
| D2[FLEURS Finnish] --> P | |
| D3[VoxPopuli Finnish] --> P | |
| D4[Common Voice v24] --> P | |
| P --> M1[train_manifest.json] | |
| P --> M2[eval_fleurs.json] | |
| P --> M3[eval_common_voice.json] | |
| P --> M4[eval_css10.json] | |
| P --> M5[eval_voxpopuli.json] | |
| end | |
| subgraph Training [Canary-v2 Finetuning] | |
| M1 --> T[NVIDIA NeMo Trainer] | |
| CM[nvidia/canary-1b-v2] --> T | |
| T --> CK[Model Checkpoints] | |
| M2 & M3 & M4 & M5 --> V[Multi-Validation] | |
| V --> W[WandB Tracking] | |
| end | |
| subgraph Inference [Post-Processing] | |
| CK --> Inf[Inference] | |
| Inf --> K[KenLM/NGPU-LM Integration] | |
| K --> R[Final ASR Output] | |
| end | |
| ``` | |
| --- | |
| ## π Datasets | |
| We use a balanced mix of datasets to cover various audio qualities and transcript styles: | |
| | Dataset | Source | Characteristics | | |
| |---------|--------|-----------------| | |
| | **FLEURS** | Google | High-quality, diverse speakers (Benchmark) | | |
| | **Common Voice** | Mozilla | Crowdsourced, varied quality and accents | | |
| | **CSS10** | Single Speaker | Clean, high-quality audio books | | |
| | **VoxPopuli** | EU Parliament | European Parliament speeches (Formal) | | |
| --- | |
| ## π Training Data Analysis | |
| This section documents the composition and length distribution of our training data (from `RASMUS/canary-finnish-asr-data`, accessed 2026-02-26). | |
| ### Dataset Summary | |
| | Dataset | Samples | Mean Duration | Max Duration | Total Hours | | |
| |---------|---------|--------------|-------------|-------------| | |
| | **Common Voice v24** | 9,086 | 4.5s | 10.5s | 11.2h | | |
| | **VoxPopuli** | 8,164 | 10.1s | 50.5s | 23.0h | | |
| | **CSS10** | 3,226 | 7.7s | 20.2s | 6.9h | | |
| | **FLEURS** | 2,704 | 11.7s | 43.2s | 8.8h | | |
| | **TOTAL** | **23,180** | **7.8s** | **50.5s** | **~50h** | | |
| ### Duration Distribution (Training Set) | |
| ``` | |
| 0β5s : 33.3% (7,725 samples) ββββββββββββββββββββββββββββββββββββββββ | |
| 5β10s : 43.7% (10,139 samples) βββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| 10β15s : 15.0% (3,473 samples) ββββββββββββββββββ | |
| 15β20s : 5.4% (1,241 samples) ββββββ | |
| 20β30s : 2.4% (562 samples) βββ | |
| >30s : 0.2% (40 samples) | |
| ``` | |
| **Key insight:** 77% of training samples are shorter than 10 seconds. The model has very little exposure to longer audio segments (only 0.2% are >30s). This has direct implications for long-form inference stability. | |
| ### Evaluation Set Durations | |
| | Eval Set | Samples | Mean Duration | Max Duration | | |
| |----------|---------|--------------|-------------| | |
| | FLEURS | 918 | 13.0s | 33.7s | | |
| | Common Voice | 1,554 | 5.1s | 10.5s | | |
| | CSS10 | 170 | 7.5s | 10.2s | | |
| | VoxPopuli | 430 | 10.6s | 47.5s | | |
| --- | |
| ## π’ Number Handling Analysis | |
| ### Live Inference Results: Base vs Finetuned (2026-02-26) | |
| We ran both models on 5 FLEURS test samples to determine each model's number output style. | |
| | # | Scenario | Reference | Base Canary-v2 | Our Finetuned | | |
| |---|----------|-----------|----------------|---------------| | |
| | 1 | Spoken "sata" (hundred) | `yli sata vuotta` | `yli 100 vuotta` β | `yli 100 vuotta` β | | |
| | 2 | Spoken "seitsemΓ€ntoista" (17) | `surmaten seitsemΓ€ntoista henkeΓ€` | `surmaten 17 henkeΓ€` β | `surmaten seitsemΓ€ntoista henkeΓ€` β | | |
| | 3 | Digits in reference (15, 2011, 2017) | `15 metriΓ€... 2011... 2017` | Correct β | Correct β | | |
| | 4 | Abbreviation "jKr." (AD) | `400 jKr.` | `400 jΓ€lkeen Kristuksen` | `400 jΓ€lkeen Kristuksen` | | |
| | 5 | Range "25β30" (en-dash U+2013) | `25β30 vuodella` | `25-30 vuodella` (ASCII hyphen) | `25 β 30 vuodella` β UNK token | | |
| **Key findings:** | |
| 1. **Base model outputs digits.** When the speaker says "sata" (hundred) or "seitsemΓ€ntoista" (seventeen), the base Canary-v2 outputs `100` and `17`. This is NVIDIA's built-in text normalisation β Canary always outputs digit form for numbers. | |
| 2. **Finetuning introduced inconsistency.** Our finetuning partially reversed this: for `seitsemΓ€ntoista` the finetuned model now outputs the written word (because FLEURS training transcripts used written-out numbers), but still outputs `100` for `sata`. This inconsistency is worse than either consistent policy. | |
| 3. **En-dash produces a UNK token in the finetuned model.** The character `β` (U+2013 en-dash) in `25β30` causes the finetuned model to emit `β` (SentencePiece UNK). The base model degrades gracefully to an ASCII hyphen `25-30`. This is a regression introduced by finetuning β likely because the en-dash was absent or inconsistently encoded in our training data. | |
| 4. **Abbreviations are expanded by both models.** `jKr.` β `jΓ€lkeen Kristuksen` in both β this is model behaviour, not a finetuning artifact. | |
| ### Policy Decision | |
| **We want digit output** (not written-out Finnish number words). The base model's behaviour is correct here. The finetuned model regressed on consistency because our FLEURS training transcripts used written-out numbers. | |
| ### Training Data Issues Found | |
| - Only **2.5% (578 / 23,180)** of training samples contain digit characters at all. | |
| - FLEURS transcripts use written-out numbers (`sata vuotta`) while VoxPopuli and Common Voice use digits. This gives the model conflicting signal. | |
| - En-dash (`β` U+2013) may be absent or mis-encoded in training manifests, causing UNK tokens at inference time. | |
| ### Action Plan: Numbers & UNK Token | |
| #### Step 1 β Normalise training transcripts to digit form | |
| Run a pre-processing pass on `train_manifest.json` before the next training run: | |
| - Use the Python library `num2words` with locale `fi` to convert Finnish written-out numbers to digits: e.g. `sata` β `100`, `seitsemΓ€ntoista` β `17`. | |
| - OR (simpler / safer): replace the FLEURS transcripts in the manifest with their **raw reference texts which already have digits** (FLEURS provides both `raw_transcription` and `transcription` columns; currently we use `raw_transcription` which has written numbers). | |
| - Target: **all numeric quantities consistently in digit form** across all four datasets. | |
| #### Step 2 β Fix en-dash encoding (ROOT CAUSE CONFIRMED) | |
| **Confirmed via tokenizer inspection (2026-02-26):** | |
| ```python | |
| m.tokenizer.text_to_ids("25β30") # β [16053, 1125, 1128, 0, 1126, 1123] | |
| # β id 0 = UNK for the en-dash! | |
| m.tokenizer.text_to_ids("25-30") # β [16053, 1125, 1128, 16107, 1126, 1123] | |
| # β ASCII hyphen tokenises correctly | |
| ``` | |
| - **En-dash `β` (U+2013) and em-dash `β` (U+2014) are NOT in the CanaryBPETokenizer vocabulary** (both map to UNK id 0). | |
| - Training data contains **85 entries with en-dash** (83 FLEURS, 2 Common Voice). During training, the en-dash in the TARGET text was encoded as UNK, so the model learned to produce UNK for the corresponding speech sounds. | |
| - **Fix: replace all `β` and `β` with ASCII hyphen `-` in all training transcripts** before the next training run. This is a one-line preprocessing step. | |
| ```python | |
| # In manifest preprocessing: | |
| text = text.replace('\u2013', '-').replace('\u2014', '-') | |
| ``` | |
| #### Step 3 β Re-evaluate after normalisation | |
| After normalising transcripts, re-run the 5-sample live inference test to verify: | |
| - `sata vuotta` audio β model outputs `100 vuotta` | |
| - `seitsemΓ€ntoista` audio β model outputs `17` | |
| - `25β30` audio β model outputs `25-30` or `25β30` (no UNK) | |
| --- | |
| ## π Long-Form Audio: Root Cause Analysis | |
| Our test file `moo.wav` is **30 minutes** (1,800s) of continuous Finnish speech. This reveals a core gap vs. our finetuned Whisper model. | |
| ### How Canary-v2 Handles Long Audio (Natively) | |
| - NVIDIA's Canary-v2 uses **dynamic chunking** with 1-second overlap between chunks. | |
| - This is automatically triggered for audio longer than **40 seconds**. | |
| - The model was pre-trained on a 1.7M-hour multilingual corpus with this chunking strategy baked in. | |
| ### Our Current Approach (`inference_vad.py`) | |
| 1. Silero VAD detects speech segments. | |
| 2. Segments are merged into chunks up to `chunk_len` seconds (default: **15s**). | |
| 3. Each chunk is transcribed **independently** β no shared context between chunks. | |
| ### Root Causes of Degradation on Long-Form | |
| | Issue | Detail | | |
| |-------|--------| | |
| | **Training length mismatch** | 77% of fine-tuning data is <10s. Inference chunks at 15s are longer than nearly all training examples, creating distribution shift. | | |
| | **No cross-chunk context** | Each 15s chunk is transcribed in isolation. Canary's attention decoder has no memory of previous chunks, so topic/speaker continuity is lost at boundaries. | | |
| | **VAD vs. native chunking** | Our VAD-based approach differs from Canary's built-in dynamic chunking. The model was not fine-tuned with this chunking strategy. | | |
| | **Repetition / hallucination** | At chunk boundaries with silence or music, the decoder can loop. This is worsened when segments are near the edge of the model's training length distribution. | | |
| | **No overlap** | Without overlap between chunks, words at segment boundaries can be dropped or doubled. | | |
| ### Comparison: Canary vs. Our Finetuned Whisper on Long-Form | |
| Whisper was explicitly designed and trained for long-form audio with: | |
| - Sliding window inference with overlap | |
| - Previous-chunk text as conditioning (prompt-based context) | |
| - Timestamps for alignment | |
| Canary's AED architecture does not use previous-chunk text as input, making long-form continuity fundamentally harder to achieve without careful chunk overlap and stitching. | |
| --- | |
| ## π Progress & Results | |
| ### Current Status: **Model Released & Repository Consolidated** | |
| We have successfully completed the finetuning, KenLM integration, and repository consolidation phases. The model and its associated language models are now hosted on Hugging Face at `RASMUS/Finnish-ASR-Canary-v2`. | |
| - **Infrastructure:** Finetuned on **RTX 6000 PRO Blackwell** (96 GB VRAM) on Verda.com platform in Finland. | |
| - **Model Suite:** Acoustic model + 3 KenLM variants (1M, 2M, 5M sentences). | |
| - **Best Performance (with KenLM 5M):** | |
| - **FLEURS:** 7.86% WER | |
| - **Common Voice:** 4.70% WER | |
| - **CSS10:** 7.07% WER | |
| - **VoxPopuli:** 11.65% WER | |
| - **Deployment:** Integrated Silero VAD-based inference for robust long-form audio processing. | |
| ### Next Steps: | |
| 1. **Long-form Tuning:** Reduce default `chunk_len` to 8β10s (closer to training distribution median) and add 0.5β1s overlap between chunks to reduce boundary artifacts. | |
| 2. **Data Quality Audit:** Fix 28 confirmed corrupted Common Voice entries where raw TSV metadata (client ID hashes, gender tags) was accidentally written into the `text` field. Audit VoxPopuli for missing capitalisation (all-lowercase transcripts despite `pnc: yes`). | |
| 3. **Number Handling:** Add Finnish-specific training data with numeric content. Consider TTS-synthesised samples covering phone numbers, years, statistics, and measurements (both digit and written-out forms paired). | |
| 4. **Long-form Training Data:** Incorporate longer audio segments: TTS synthetic long-form audio (`fbc_monolog_processed`, parliament data) into the training manifest to shift the duration distribution toward 15β30s. | |
| 5. **KenLM Refinement:** Re-train KenLM with high-quality punctuated text. Current LM trained on mixed-quality data. | |
| 6. **Advanced Evaluation:** Implement CER evaluation on non-normalised test sets to better capture punctuation/casing accuracy. | |
| 7. **Repetition Penalty:** Explore repetition penalty in decoding if chunk-level loops persist after chunk length tuning. | |
| 8. **Real-world Evaluation:** Benchmark on diverse long-form samples (podcasts, meetings, call-centre audio). | |
| --- | |
| ## πΊοΈ Action Plan: Next Training Run | |
| This section details the concrete steps for the next finetuning iteration, based on the root-cause analysis above. | |
| ### Priority 1 β Fix Training Data (before re-training) | |
| #### 1a. Normalise numbers to digit form (Gemini Flash) | |
| Finnish written-out numbers in FLEURS transcripts cause the finetuned model to output inconsistent number forms. We will use the Gemini Flash API to convert all training transcripts in a single batch pass: | |
| ```python | |
| # Pseudocode β run once on train_manifest.json before next training | |
| import google.generativeai as genai | |
| import json | |
| genai.configure(api_key=GEMINI_API_KEY) | |
| model = genai.GenerativeModel("gemini-2.0-flash") | |
| SYSTEM_PROMPT = """You are a Finnish text normalizer. | |
| Convert any written-out Finnish numbers, ordinals, or number words in the text to digit form. | |
| Examples: | |
| "yli sata vuotta" β "yli 100 vuotta" | |
| "seitsemΓ€ntoista henkeΓ€" β "17 henkeΓ€" | |
| "vuonna tuhat yhdeksΓ€nsataa" β "vuonna 1900" | |
| Keep all other text exactly as-is. Return only the modified text, nothing else.""" | |
| entries = [] | |
| with open('manifests/train_manifest.json') as f: | |
| for line in f: | |
| d = json.loads(line) | |
| response = model.generate_content(f"{SYSTEM_PROMPT}\n\n{d['text']}") | |
| d['text'] = response.text.strip() | |
| entries.append(d) | |
| with open('manifests/train_manifest_normalised.json', 'w') as f: | |
| for e in entries: | |
| f.write(json.dumps(e, ensure_ascii=False) + '\n') | |
| ``` | |
| Cost estimate: 23,180 entries Γ ~50 tokens average = ~1.2M tokens. At Gemini Flash pricing (~$0.075/1M tokens input) β **< $0.10 total**. | |
| #### 1b. Fix en-dash UNK token (confirmed root cause) | |
| The en-dash `β` (U+2013) is NOT in the tokenizer vocabulary β it maps to UNK (id 0). Replace it with ASCII hyphen before training: | |
| ```python | |
| # Add to the manifest preprocessing step | |
| text = text.replace('\u2013', '-').replace('\u2014', '-') | |
| ``` | |
| This affects **85 entries** in `train_manifest.json` (83 FLEURS, 2 Common Voice). | |
| #### 1c. Fix 28 corrupted Common Voice entries | |
| Replace entries where the `text` field contains raw TSV metadata (tabs + client_id hashes). Strip everything after the first tab character. | |
| --- | |
| ### Priority 2 β Add Long-Form Training Data | |
| #### TTS Long-Form Dataset: `RASMUS/canary_asr_finetune_tts_long_data` | |
| | Property | Value | | |
| |----------|-------| | |
| | Size | 8.0 GB zip | | |
| | Format | FLAC audio + JSONL manifest | | |
| | Mean duration | **16.5s** (vs 7.8s in current data) | | |
| | Median duration | 15.9s | | |
| | Max duration | 25.0s | | |
| | Content | Finnish speech: lectures, podcasts, YouTube | | |
| | Segments >20s | ~25% | | |
| This dataset directly addresses the training length mismatch. Adding it will shift the duration distribution from a mean of 7.8s toward ~10β12s and significantly increase the proportion of 15β25s segments that match inference chunk lengths. | |
| **Integration plan:** | |
| ```bash | |
| # Download the dataset | |
| curl -L -H "Authorization: Bearer ${HF_TOKEN}" \ | |
| "https://huggingface.co/datasets/RASMUS/canary_asr_finetune_tts_long_data/resolve/main/canary_dataset.zip" \ | |
| -o /workspace/data/tts_long_data.zip | |
| # Extract | |
| unzip /workspace/data/tts_long_data.zip -d /workspace/data/tts_long_data/ | |
| # Apply number normalisation and dash fix to canary_manifest.jsonl | |
| # then merge with existing train_manifest_normalised.json | |
| ``` | |
| After applying number normalisation and dash fixes to the new manifest, concatenate with the existing training set. Expected combined size: ~23,180 + N (estimate 5,000β20,000+ entries depending on total dataset size). | |
| --- | |
| ### Priority 3 β Inference Tuning (without re-training) | |
| Even before re-training, we can improve `moo.wav` performance by adjusting `inference_vad.py`: | |
| | Parameter | Current | Recommended | | |
| |-----------|---------|-------------| | |
| | `chunk_len` | 15s | 8β10s (match training median of 7.8s) | | |
| | chunk overlap | 0s | 0.5s (reduce boundary word drops) | | |
| | `alpha` (KenLM) | 0.2 | Try 0.1β0.15 (current may over-constrain decoder) | | |
| --- | |
| ## π Round 2: Data Pipeline & Splits | |
| This section documents the data preparation methodology for Round 2 finetuning, including all new eval sets, the TTS integration, and the final manifest composition. | |
| ### Overview of Changes vs Round 1 | |
| | Item | Round 1 | Round 2 | | |
| |------|---------|---------| | |
| | Base model | `canary-1b-v2.nemo` | `canary-1b-v2.nemo` (fresh start) | | |
| | Training samples | 23,180 | **28,858** | | |
| | Training hours | ~50h | **75.6h** | | |
| | Mean duration | 7.8s | **9.4s** | | |
| | Max duration allowed | 20.0s | **30.0s** | | |
| | Transcripts normalised | No | **Yes (digits, dashes fixed)** | | |
| | Eval sets | 4 | **6** | | |
| ### Step 1 β Transcript Normalisation (`normalize_manifests.py`) | |
| All training transcripts were cleaned in two layers: | |
| **Deterministic fixes (no API call needed):** | |
| - En-dash `β` (U+2013) and em-dash `β` (U+2014) β ASCII hyphen `-` (fixes UNK token regression) | |
| - Corrupted Common Voice entries (raw TSV metadata in `text` field) β strip everything after first tab | |
| **Gemini 2.5 Flash API calls (2,586 of 23,180 entries needed conversion):** | |
| - Pre-filtered with a Finnish number-word regex so only entries that actually contain written numbers are sent to the API (cost: ~$0.62) | |
| - Written Finnish numbers converted to digit form: `sata vuotta` β `100 vuotta`, `seitsemΓ€ntoista` β `17` | |
| - Explicit DO NOT CONVERT rules: ordinals (`ensimmΓ€inen`, `toinen`), superlative constructions (`yksi tΓ€rkeimmistΓ€`), and `toinen` as "another/other" | |
| ### Step 2 β TTS Long-Form Data Integration | |
| Downloaded `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB, 6,365 entries, mean 16.4s). | |
| Aligned to NeMo training format: | |
| - Path rewritten to relative style: `data/tts_long_data/audio/{filename}` | |
| - Fields mapped: `language` β `source_lang`/`target_lang`, `task: "transcription"` β `taskname: "asr"`, added `pnc: "yes"` | |
| - Same Gemini normalisation pass applied (888 entries converted) | |
| ### Step 3 β Eval Set Construction (TTS Data) | |
| The 6,365 normalised TTS entries were split into train / eval / long-form-test: | |
| ``` | |
| All TTS entries (6,365) | |
| β | |
| βββ Long-form pool (>20s): 1,501 entries | |
| β βββ eval_long_form (sampled): 200 entries β random.seed(42) shuffle β first 200 | |
| β βββ Returned to training pool: 1,301 entries | |
| β | |
| βββ Medium pool (10β20s): 4,864 entries | |
| βββ eval_tts (10% hold-out): 487 entries β stratified by duration bucket | |
| βββ tts_train: 4,377 entries | |
| ``` | |
| **Why eval_long_form = 200 entries?** | |
| The original 1,501 long-form entries (>20s) had a total duration of ~9.4 hours β far too long to run as a validation set every epoch. At batch_size=32 on a single GPU, each validation pass over 1,501 entries takes ~25 minutes, adding 2.5h per epoch. 200 entries (β75 minutes of audio) provides a representative sample of the long-form distribution at reasonable cost: ~4 minutes of eval time per epoch. | |
| **eval_tts construction:** | |
| 487 entries were held out from the 10β20s duration range (10% stratified sample). This tests the model's ability to handle medium-length audio and is separate from the original 4 eval sets. | |
| ### Step 4 β Combined Training Manifest | |
| Final `train_manifest_combined.jsonl` composition: | |
| | Source | Entries | Notes | | |
| |--------|---------|-------| | |
| | Original train (normalised) | 23,180 | Digits + dash fix applied | | |
| | TTS train (10β20s) | 4,377 | Synthesised long-form speech | | |
| | Long-form overflow | 1,301 | >20s entries not selected for eval_long_form | | |
| | **Total** | **28,858** | Mean 9.4s, 75.6h | | |
| ### Final Eval Sets (Round 2) | |
| | Set | File | Entries | Mean Duration | Purpose | | |
| |-----|------|---------|--------------|---------| | |
| | `eval_fleurs` | `eval_fleurs.json` | 918 | 13.0s | Primary benchmark (monitored for checkpointing) | | |
| | `eval_common_voice` | `eval_common_voice.json` | 1,554 | 5.1s | Crowdsourced quality | | |
| | `eval_css10` | `eval_css10.json` | 170 | 7.5s | Clean single-speaker | | |
| | `eval_voxpopuli` | `eval_voxpopuli.json` | 430 | 10.6s | Formal/parliament speech | | |
| | `eval_tts` | `eval_tts.jsonl` | 487 | 14.5s | Medium-length TTS (new) | | |
| | `eval_long_form` | `eval_long_form.jsonl` | **200** | 22.5s | Long-form >20s sample (new) | | |
| **Checkpoint monitoring:** `val_wer` tracks FLEURS (first validation set). All 6 WERs are logged independently to WandB. | |
| ### Round 2 Training Config | |
| File: `configs/canary_finetune_finnish_v2.yaml` | |
| Key settings: | |
| - `init_from_nemo_model`: `/workspace/Finnish-ASR-Canary-v2/models/canary-1b-v2.nemo` (fresh start from base) | |
| - `max_duration`: 30.0s (up from 20.0s to include TTS segments up to 25s) | |
| - `max_steps`: 18,000 (scaled: 28,858 / 32 β 902 steps/epoch Γ 20 epochs β 18,040) | |
| - `lr`: 1e-5, `WarmupAnnealing`, 500 warmup steps | |
| - `precision`: bf16, single GPU, `strategy: auto` | |
| --- | |
| ## π οΈ Workflow Status Details | |
| ### 1. Data Preparation - DONE | |
| - [x] Identify and inventory all 4 datasets | |
| - [x] Create unified processing script (`scripts/prepare_all_manifests.py`) | |
| - [x] Run `scripts/prepare_all_manifests.py` on devcontainer | |
| - [x] Verify manifest sample counts and audio file integrity | |
| ### 2. Configuration Setup - DONE | |
| - [x] Create Hydra training config (`configs/canary_finetune_finnish.yaml`) | |
| - [x] Configure multi-validation with 4 eval datasets | |
| - [x] Checkpoint monitors primary eval set (FLEURS) via `val_wer` | |
| - [x] All 4 eval WERs logged independently to WandB | |
| ### 3. Training - DONE | |
| - [x] Run finetuning via `run_training.sh` | |
| - [x] Monitor per-dataset WER in WandB | |
| ### 4. KenLM / NGPU-LM Language Model Integration - DONE | |
| - [x] Install KenLM tools (`install_beamsearch_decoders.sh`) | |
| - [x] Gather Finnish text (ASR transcripts + Wikipedia + mc4) | |
| - [x] Train 3 variants of KenLM (1M, 2M, 5M sentences) | |
| - [x] Evaluate with LM fusion on all 4 test sets | |
| ### 5. Repository & Long-Form Inference - IN PROGRESS | |
| - [x] Consolidate README and model metadata for Hugging Face release | |
| - [x] Upload model checkpoints and KenLM bundles to HF Hub | |
| - [x] Implement Silero VAD-based chunking for long-form audio (`inference_vad.py`) | |
| - [x] Root-cause analysis of long-form degradation vs. Whisper (see above) | |
| - [ ] Reduce `chunk_len` to 8β10s and add chunk overlap (Current Focus) | |
| - [ ] Optimize `alpha` for stability on `moo.wav` (30 min test file) | |
| ### 6. Data Quality & Advanced Evaluation - PARTIALLY DONE | |
| - [x] Fix 28 corrupted Common Voice manifest entries (raw TSV data in text field) β done in normalisation pass. | |
| - [x] Fix en-dash/em-dash UNK token regression β done in normalisation pass. | |
| - [ ] Audit VoxPopuli transcripts for all-lowercase entries (capitalisation missing). | |
| - [ ] Re-train KenLM with high-quality punctuated text. | |
| - [ ] Evaluate CER on non-normalized test sets. | |
| ### 7. Number Normalisation & UNK Token Fix - DONE | |
| - [x] Replace en-dash `β` and em-dash `β` with ASCII hyphen `-` in all training manifests (85 train + 70 TTS entries fixed). | |
| - [x] Use Gemini 2.5 Flash to normalise written-out Finnish numbers to digit form (2,586 API calls across train + TTS). | |
| - [ ] Re-evaluate on the 5-sample number test set after Round 2 training to verify consistency. | |
| ### 8. Long-Form Data Expansion - DONE | |
| - [x] Download `RASMUS/canary_asr_finetune_tts_long_data` (4.8 GB zip, 6,365 entries, mean 16.4s). | |
| - [x] Align TTS manifest to NeMo training format and integrate into combined training manifest. | |
| - [x] Round 2 training configured and ready to launch (see Round 2 section below). | |
| - [ ] Benchmark Round 2 model against Round 1 and finetuned Whisper on `moo.wav`. | |
| --- | |
| ## π οΈ NeMo Environment Setup | |
| This section documents the exact steps to set up a working NeMo inference/training environment, including the fixes required for the `nvcr.io/nvidia/pytorch:25.01-py3` container. | |
| ### Installation (from scratch on pytorch:25.01-py3 base image) | |
| ```bash | |
| # 1. Clone the HF model repo (contains NeMo source with patches applied) | |
| # Skip LFS to avoid downloading the 3.6 GB model during clone | |
| GIT_LFS_SKIP_SMUDGE=1 git clone \ | |
| "https://user:${HF_TOKEN}@huggingface.co/RASMUS/Finnish-ASR-Canary-v2" \ | |
| /workspace/Finnish-ASR-Canary-v2 | |
| # 2. Install NeMo in editable mode from the patched source | |
| cd /workspace/Finnish-ASR-Canary-v2/NeMo | |
| pip install -e ".[asr]" | |
| # 3. Install pinned dependencies | |
| pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' kaldialign wandb | |
| ``` | |
| ### Required Compatibility Fixes | |
| The pytorch:25.01-py3 container ships with packages that conflict with NeMo 2.8.0rc0: | |
| ```bash | |
| # Fix 1: Downgrade lightning to the version NeMo requires (<=2.4.0) | |
| # The container ships lightning 2.4.0 but pip may upgrade it β pin it back. | |
| pip install "lightning==2.4.0" "pytorch-lightning==2.4.0" | |
| # Fix 2: Remove incompatible torchvision | |
| # The container's torchvision (0.20.0a0) was built against torch 2.6.0a0 (the original | |
| # container torch), but NeMo's install upgrades torch to ~2.10. torchvision then fails | |
| # on import and blocks NeMo. ASR does not need torchvision. | |
| pip uninstall -y torchvision | |
| ``` | |
| ### Downloading the Finetuned Model | |
| ```bash | |
| # Download the finetuned acoustic model (3.6 GB) | |
| curl -L \ | |
| -H "Authorization: Bearer ${HF_TOKEN}" \ | |
| "https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/canary-finnish.nemo" \ | |
| -o /workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo | |
| # KenLM models are also LFS β download the 5M variant (best WER): | |
| curl -L \ | |
| -H "Authorization: Bearer ${HF_TOKEN}" \ | |
| "https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/kenlm_5M.nemo" \ | |
| -o /workspace/Finnish-ASR-Canary-v2/kenlm_5M.nemo | |
| ``` | |
| ### Quick Inference Smoke Test | |
| ```python | |
| import warnings; warnings.filterwarnings('ignore') | |
| from nemo.collections.asr.models import EncDecMultiTaskModel | |
| model = EncDecMultiTaskModel.restore_from( | |
| '/workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo', | |
| map_location='cuda' | |
| ) | |
| model.eval() | |
| results = model.transcribe( | |
| audio=['path/to/audio.wav'], | |
| task='asr', source_lang='fi', target_lang='fi', pnc='yes' | |
| ) | |
| print(results[0].text) | |
| ``` | |
| ### Loading the Base Model (for comparison) | |
| ```python | |
| # Downloads ~3.6 GB on first run, cached in ~/.cache/huggingface/ | |
| model_base = EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-v2", map_location='cuda') | |
| ``` | |
| --- | |
| ## π Progress Log | |
| - **2026-01-11:** Initial project setup. | |
| - **2026-02-08:** Redesigned data pipeline for 4 real datasets (CSS10, FLEURS, VoxPopuli, Common Voice). | |
| - **2026-02-10:** **Finetuning complete.** Epoch 11 reached `val_wer=0.1258` on FLEURS. | |
| - **2026-02-13:** Mermaid diagrams and project documentation for DS team. | |
| - **2026-02-18:** **KenLM benchmarks finished.** Consolidated repository structure. Applied NeMo patches for inference stability. | |
| - **2026-02-20:** **Model Released.** Release of `Finnish-ASR-Canary-v2` on HF. Implemented VAD-based inference pipeline. Currently tuning for long-form stability on `moo.wav` with various `alpha` settings (0.0 - 0.4 tested). | |
| - **2026-02-26:** **Root-cause analysis complete.** Investigated long-form gap vs. Whisper and number handling. Key findings: (1) 77% of training data is <10s, creating distribution shift at inference chunk lengths; (2) No cross-chunk context in Canary's AED architecture; (3) Only 2.5% of training samples contain digit characters β numbers are a known weak point; (4) 28 corrupted Common Voice entries found (TSV metadata in text field); (5) `moo.wav` test file confirmed as 30 minutes. Action plan: shorten chunk_len, add chunk overlap, fix data corruption, and plan a long-form training data expansion round. | |
| - **2026-02-26:** **Live number inference + tokenizer audit completed.** Ran base Canary-v2 vs. finetuned model on 5 FLEURS samples. Confirmed: (1) base model always outputs digits (`100`, `17`); (2) finetuned model regressed to mixed output β sometimes written words, sometimes digits β due to inconsistent training transcripts; (3) en-dash (`β`) produces UNK token `β` in finetuned model, base model degrades gracefully to ASCII hyphen. Policy decision: **standardise on digit output** and fix en-dash encoding in training manifests before next training run. NeMo environment setup documented (with fixes for `torchvision` and `lightning` version conflicts). TTS long-form dataset (`canary_asr_finetune_tts_long_data`, 8GB, mean 16.5s/segment) identified as key data source for next training run. Action plan for next run: (1) normalise numbers to digits via Gemini Flash API, (2) fix en-dash β ASCII hyphen, (3) fix 28 corrupted CV entries, (4) add TTS long-form data. | |
| - **2026-03-01:** **Round 2 data pipeline complete.** Ran `normalize_manifests.py`: 2,586 Gemini 2.5 Flash API calls (~$0.62), 1,137 number changes in train + 888 in TTS, 85 en-dash and 28 corrupted CV entries fixed. Downloaded and extracted TTS long-form dataset (6,365 entries, 4.8 GB). Split TTS data into train (4,377), eval_tts (487, mean 14.5s), and long-form pool (1,501 entries >20s). Sampled 200 entries into `eval_long_form.jsonl` (seed 42) and returned 1,301 to training, yielding `train_manifest_combined.jsonl` (28,858 entries, 75.6h). Round 2 training config created (`configs/canary_finetune_finnish_v2.yaml`). **Training ready to launch.** | |
| - **2026-03-01:** **Training crash diagnosed and fixed.** Round 2 training ran 505 steps then crashed with CUDA `vectorized_gather_kernel index out of bounds`. Root cause: entry 14857 in `train_manifest_combined.jsonl` contained 11,247 chars of Python code (Gemini normalization returned a code block instead of a transcript for `voxpopuli_005371.wav`). When tokenized with the canary2 prompt format, the sequence far exceeded the decoder's `max_sequence_length=1024`, causing position-embedding OOB. Additionally, 4 entries in `eval_common_voice.json` had TSV metadata contamination (same v1 issue, not previously caught in the v2 eval set). Both manifests fixed. Config rewritten from full-architecture spec to minimal v1-style format (`tokenizer: update_tokenizer: false`) using `speech_to_text_finetune.py` (which restores the full model from the `.nemo` file). Training re-launched. Manifests synced to `canary-finnish-asr-data` HuggingFace dataset repo. | |