Upload README.md with huggingface_hub

3700d96 verified 3 days ago

14.2 kB

	---
	language:
	- fi
	license: mit
	tags:
	- automatic-speech-recognition
	- asr
	- speech-recognition
	- canary-v2
	- kenlm
	- finnish
	datasets:
	- mozilla-foundation/common_voice_17_0
	- google/fleurs
	- facebook/voxpopuli
	base_model: nvidia/canary-1b-v2
	pipeline_tag: automatic-speech-recognition
	library_name: nemo
	model-index:
	- name: Finnish ASR Canary-v2 Round 2
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Mozilla Common Voice v24.0
	type: mozilla-foundation/common_voice_17_0
	config: fi
	split: test
	metrics:
	- name: Word Error Rate (WER)
	type: wer
	value: 4.58
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: FLEURS Finnish
	type: google/fleurs
	config: fi_fi
	split: test
	metrics:
	- name: Word Error Rate (WER)
	type: wer
	value: 7.75
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: CSS10 Finnish
	type: asr-benchmark
	split: test
	metrics:
	- name: Word Error Rate (WER)
	type: wer
	value: 7.03
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: VoxPopuli Finnish
	type: facebook/voxpopuli
	config: fi
	split: test
	metrics:
	- name: Word Error Rate (WER)
	type: wer
	value: 11.65
	---

	# 🇫🇮 Finnish ASR Canary-v2: State-of-the-Art Finnish Speech Recognition

	A high-performance fine-tuned version of NVIDIA's Canary-v2 (1B parameter) model, specifically optimized for the Finnish language. This project provides a robust Finnish ASR solution through two rounds of finetuning, combined with a 6-gram KenLM language model for Shallow Fusion.

	> Round 2 (March 2026) — Improved training corpus (28,857 samples), TTS-augmented long-form data, and transcript normalization. Best overall result on Common Voice and CSS10. See [Round 2 Analysis](#round-2-analysis) below.

	---

	## 🚀 Performance Benchmarks (WER %)

	All numbers use jiwer normalization (lowercase, punctuation stripped). Lower is better.

	### Best Configuration Per Dataset

	\| Dataset \| R1 + KenLM 5M \| R2 Greedy \| R2 + KenLM 5M \| Best \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \|
	\| Common Voice \| 5.98% \| 5.41% \| 4.58% \| R2 + KenLM \|
	\| FLEURS \| 6.48% \| 8.39% \| 7.75% \| R1 + KenLM \|
	\| CSS10 (Audiobook) \| 11.85% \| 7.03% \| 12.39% \| R2 Greedy \|
	\| VoxPopuli (Parliament) \| 5.73% \| 13.91% \| 13.23% \| R1 + KenLM \|
	\| Global Average \| 7.51% \| 8.69% \| 9.49% \| R1 + KenLM \|

	> [!NOTE]
	> VoxPopuli is the one domain where R1 still leads. The R2 regression is caused by transcript normalization during training (number words → digits) while the eval manifest retains word-form numbers. This will be corrected in Round 3.

	### Full Benchmark Table

	\| Model \| CommonVoice \| FLEURS \| CSS10 \| VoxPopuli \| Avg \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Base Canary-v2 \| 17.95% \| 7.79% \| 17.07% \| 7.96% \| 12.69% \|
	\| R1 Greedy \| 12.82% \| 8.33% \| 12.19% \| 4.46% \| 9.45% \|
	\| R1 + KenLM 5M \| 5.98% \| 6.48% \| 11.85% \| 5.73% \| 7.51% \|
	\| R2 Greedy \| 5.41% \| 8.39% \| 7.03% \| 13.91% \| 8.69% \|
	\| R2 + KenLM 5M \| 4.58% \| 7.75% \| 12.39% \| 13.23% \| 9.49% \|

	### KenLM Impact Within R2

	\| Dataset \| R2 Greedy \| R2 + KenLM \| Δ \| Verdict \|
	\| :--- \| :---: \| :---: \| :---: \| :--- \|
	\| Common Voice \| 5.41% \| 4.58% \| −15.3% \| KenLM helps \|
	\| FLEURS \| 8.39% \| 7.75% \| −7.6% \| KenLM helps \|
	\| CSS10 \| 7.03% \| 12.39% \| +76% \| KenLM hurts — use greedy \|
	\| VoxPopuli \| 13.91% \| 13.23% \| −4.9% \| Marginal \|

	> [!IMPORTANT]
	> KenLM and CSS10: When the acoustic model is already very accurate (7% WER), the n-gram LM can override high-confidence acoustic decisions with mismatched web-text Finnish. Always benchmark KenLM on your target domain before deploying.

	---

	## 📖 Round 2 Analysis

	### What Changed in Round 2

	\| Change \| Detail \|
	\| :--- \| :--- \|
	\| Training corpus \| 28,857 samples (+24% vs R1's 23,180) \|
	\| TTS long-form data \| 4,377 synthesized samples (mean 14.5s, max 25s) added to shift duration distribution \|
	\| `max_duration` \| 20s → 30s to include TTS segments \|
	\| Transcript normalization \| Number words → digits, en-dash → ASCII \|
	\| Init checkpoint \| Base `canary-1b-v2.nemo` (fresh start, no R1 regressions inherited) \|
	\| New eval sets \| `eval_tts` (487 entries) and `eval_long_form` (200 entries, all >20s) \|

	### R2 Results vs R1

	\| Dataset \| R1 Greedy \| R2 Greedy \| Δ \| Why \|
	\| :--- \| :---: \| :---: \| :---: \| :--- \|
	\| Common Voice \| 12.82% \| 5.41% \| −57.8% \| TSV contamination fixed + normalization \|
	\| CSS10 \| 12.19% \| 7.03% \| −42.3% \| TTS data improved read-speech alignment \|
	\| FLEURS \| 8.33% \| 8.39% \| ≈ flat \| Clean read-speech; unchanged by TTS additions \|
	\| VoxPopuli \| 4.46% \| 13.91% \| +211% \| Normalization mismatch + TTS distribution shift \|

	### Key Lesson: Normalization Consistency

	R2 normalized training transcripts (e.g. "kaksituhattaneljätoista" → "2014") but the `eval_voxpopuli.json` evaluation manifest was not updated to match. This inflates VoxPopuli WER for R2. A forthcoming Round 3 will normalize all eval manifests consistently.

	---

	## 🏃 Running Inference

	This model requires NVIDIA NeMo (commit `557177a18d`, included in this repo with two patches applied).

	### Short Audio (< 30s)

	```python
	from nemo.collections.asr.models import EncDecMultiTaskModel
	from omegaconf import OmegaConf

	# Load R2 model (recommended for most use cases)
	model = EncDecMultiTaskModel.restore_from("models/canary-finnish-v2.nemo")
	model.eval().cuda()

	# Greedy decoding — best for audiobooks, read speech
	result = model.transcribe(
	audio=["sample.wav"],
	taskname="asr",
	source_lang="fi",
	target_lang="fi",
	pnc="yes"
	)
	print(result[0].text)
	```

	### Short Audio with KenLM (recommended for conversational / CV-style audio)

	```python
	model.change_decoding_strategy(
	decoding_cfg=OmegaConf.create({
	'strategy': 'beam',
	'beam': {
	'beam_size': 5,
	'ngram_lm_model': "models/kenlm_5M.nemo",
	'ngram_lm_alpha': 0.2,
	},
	'batch_size': 1
	})
	)
	result = model.transcribe(
	audio=["sample.wav"],
	taskname="asr",
	source_lang="fi",
	target_lang="fi",
	pnc="yes"
	)
	```

	### Long-Form Audio (podcasts, interviews, lectures)

	We provide two scripts for long-form audio. The Pyannote-based pipeline is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary.

	#### 1. Diarized Pipeline (Recommended) — `inference_pyannote.py`
	This script uses `pyannote/speaker-diarization-community-1` to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio.

	```bash
	# Optimized for podcasts/interviews (includes diarization + KenLM)
	python inference_pyannote.py \
	--audio long_recording.wav \
	--model models/canary-finnish-v2.nemo \
	--kenlm models/kenlm_5M.nemo \
	--output transcript.json
	```

	#### 2. VAD-only Pipeline — `inference_vad.py`
	A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording.

	```bash
	python inference_vad.py \
	--audio long_recording.wav \
	--model models/canary-finnish-v2.nemo \
	--output transcript.txt
	```

	#### Example Output
	See [`moo_merged_kenlm.json`](moo_merged_kenlm.json) for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps.

	---

	## ⚙️ Parameter Recommendations

	### By Content Type

	\| Content Type \| `--min_silence_ms` \| `--beam_size` \| KenLM \| Notes \|
	\| :--- \| :---: \| :---: \| :---: \| :--- \|
	\| Podcast / interview \| 150 \| 5 \| Yes \| Conversational Finnish, KenLM helps most \|
	\| Lecture / presentation \| 500–1000 \| 5 \| Yes \| Longer pauses → sentence-level VAD splits \|
	\| Audiobook / read speech \| 150 \| — \| No \| R2 greedy already at 7% WER; KenLM hurts \|
	\| Parliament / formal speech \| 150 \| 4 \| No \| Use R1 model; R2 regressed on this domain \|
	\| Unknown / mixed \| 150 (default) \| 5 \| Yes \| Safe default \|

	### KenLM Alpha Tuning

	`--alpha` controls how strongly the LM influences decoding (0 = greedy, higher = more LM):

	\| α \| Effect \|
	\| :--- \| :--- \|
	\| 0.1 \| Conservative — mostly acoustic \|
	\| 0.2 \| Recommended default \|
	\| 0.3 \| More LM correction — good for noisy audio \|
	\| 0.5+ \| Risky — LM can override correct acoustic output \|

	### Full CLI Reference

	```
	inference_vad.py
	--audio Path to input audio file (WAV, 16kHz mono)
	--model Path to .nemo acoustic model
	--kenlm Path to .nemo KenLM bundle (omit for greedy)
	--output Output path (.txt); .json written alongside automatically
	--chunk_len Max chunk duration in seconds (default: 15)
	--beam_size Beam width for KenLM decoding (default: 5)
	--alpha KenLM language model weight (default: 0.2)
	--min_silence_ms Min silence to split VAD segments (default: 150)
	--min_speech_ms Min speech duration to keep a segment (default: 250)
	--speech_pad_ms Padding added around each speech segment (default: 400)
	```

	---

	## 🏗️ Methodology & Architecture

	### Acoustic Model

	Built on NVIDIA's Canary-v2 (Fast-Conformer AED, 1B parameters). Both rounds use `speech_to_text_finetune.py` which restores the full model architecture from the base `.nemo` checkpoint — only the dataloader, optimizer, and tokenizer (kept frozen, `update_tokenizer: false`) need to be specified.

	### KenLM Language Model

	A 6-gram KenLM trained on 5 million lines of high-quality Finnish text:

	\| Source \| Lines \|
	\| :--- \| :---: \|
	\| Reddit (Finnish communities) \| 1.5M \|
	\| FinePDF (Finnish documents) \| 1.5M \|
	\| Wiki-Edu (Wikipedia + educational) \| 1.0M \|
	\| ASR transcripts \| ~23k \|

	Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed before training. The model is token-aligned with the Canary BPE tokenizer and runs on GPU via NVIDIA's NGPU-LM engine (binary `.nemo` bundle, loads in <10s).

	### Training Infrastructure

	- Hardware: RTX 6000 PRO Blackwell (96 GB VRAM), [Verda.com](https://verda.com), Finland
	- Container: `nvcr.io/nvidia/pytorch:25.01-py3`
	- NeMo: commit `557177a18d` (r2.6.0 / v2.8.0rc0), editable install

	---

	## 📂 Repository Structure

	```
	.
	├── NeMo/ # NeMo toolkit (with patches applied)
	├── models/
	│ ├── canary-finnish-v2.nemo # Round 2 finetuned model (1B)
	│ ├── canary-finnish.nemo # Round 1 finetuned model (1B)
	│ ├── canary-1b-v2.nemo # Base Canary-v2 model
	│ ├── kenlm_1M.nemo # 6-gram KenLM (1M corpus)
	│ ├── kenlm_2M.nemo # 6-gram KenLM (2M corpus)
	│ └── kenlm_5M.nemo # 6-gram KenLM (5M corpus, recommended default)
	├── inference_pyannote.py # Speaker-diarized inference (BEST for long audio)
	├── inference_vad.py # VAD-based inference (fast, single speaker)
	├── moo_merged_kenlm.json # 30-min podcast example (Diarized + KenLM)
	├── moo_merged_greedy.json # 30-min podcast example (Diarized, Greedy)
	├── PLAN_AND_PROGRESS.md # Detailed training & analysis log
	└── README.md
	```

	---

	## 🛠️ Setup

	### Prerequisites

	- NVIDIA GPU with ≥ 48 GB VRAM (tested on 96 GB RTX 6000 Pro Blackwell)
	- Docker with NVIDIA Container Toolkit
	- Container: `nvcr.io/nvidia/pytorch:25.01-py3`

	### Install

	```bash
	git clone https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2
	cd Finnish-ASR-Canary-v2

	# NeMo with required patches already applied
	cd NeMo && pip install -e .[asr]
	pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \
	kaldialign wandb soundfile editdistance
	```

	### Additional setup for long-form diarized inference (`inference_pyannote.py`)

	`inference_pyannote.py` requires pyannote + transformers components on top of base NeMo:

	```bash
	pip install pyannote.audio transformers accelerate sentencepiece

	# Required by torchaudio 2.10+ audio I/O path in this container
	pip install torchcodec
	```

	Set your Hugging Face token before running diarization (used to download `pyannote/speaker-diarization-community-1`):

	```bash
	export HF_TOKEN=your_hf_token
	```

	Or place it in `.env` as:

	```bash
	HF_TOKEN=your_hf_token
	```

	### Critical NeMo Patches (already applied in included NeMo)

	1. OneLogger Fix — makes proprietary telemetry optional for public containers
	2. Canary2 EOS Assertion Fix — relaxes a strict EOS check to allow inference with placeholder transcripts

	---

	## 🙏 Acknowledgments

	- Foundation: Built on NVIDIA's [Canary-v2](https://huggingface.co/nvidia/canary-1b-v2) architecture
	- Training Infrastructure: [Verda.com](https://verda.com) GPU cloud, Finland
	- Data Sources:
	- [Mozilla Common Voice](https://commonvoice.mozilla.org/) v24.0
	- [Google FLEURS](https://huggingface.co/datasets/google/fleurs)
	- [CSS10 Finnish](https://github.com/Kyubyong/css10)
	- [VoxPopuli](https://github.com/facebookresearch/voxpopuli) (European Parliament)

	### Citations

	```bibtex
	@article{park2019css10,
	title={CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages},
	author={Park, Kyubyong and Mulc, Thomas},
	journal={Interspeech},
	year={2019}
	}

	@inproceedings{wang2021voxpopuli,
	title={VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning,
	Semi-Supervised Learning and Interpretation},
	author={Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and
	Talnikar, Chutier and Haziza, Daniel and Williamson, Maryam and
	Pino, Juan and Dupoux, Emmanuel},
	booktitle={ACL 2021},
	year={2021}
	}
	```