RASMUS
/

Finnish-ASR-Canary-v2

@@ -195,34 +195,32 @@ result = model.transcribe(
 ### Long-Form Audio (podcasts, interviews, lectures)
-Use the included `inference_vad.py` script, which combines Silero VAD chunking with the Canary model:
 ```bash
-# Greedy — best for audiobooks, studio speech
-python inference_vad.py \
   --audio long_recording.wav \
   --model models/canary-finnish-v2.nemo \
-  --output transcript.txt
-# KenLM — best for conversational / podcast audio
 python inference_vad.py \
   --audio long_recording.wav \
   --model models/canary-finnish-v2.nemo \
-  --kenlm models/kenlm_5M.nemo \
   --output transcript.txt
 ```
-The script writes both a plain-text transcript (`.txt`) and a Whisper-compatible JSON (`.json`) with segment-level timestamps:
-```json
-{
-  "segments": [
-    { "start": 8.07, "end": 15.26, "text": "Hei armas kuulija ja tervetuloa linjoille." },
-    { "start": 15.94, "end": 25.69, "text": "Tämän podcastin tarkoitus on tarjota..." }
-  ],
-  "text": "<full transcript>"
-}
-```
 ---
@@ -300,18 +298,17 @@ Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed
 .
 ├── NeMo/                              # NeMo toolkit (with patches applied)
 ├── models/
 │   ├── canary-finnish.nemo            # Round 1 finetuned model (1B)
-│   ├── canary-finnish-v2.nemo         # Round 2 finetuned model (1B) ← new
-│   ├── kenlm_5M.nemo                  # 6-gram KenLM, 5M corpus (recommended)
-│   ├── kenlm_2M.nemo                  # 6-gram KenLM, 2M corpus
-│   └── kenlm_1M.nemo                  # 6-gram KenLM, 1M corpus
-├── results/
-│   ├── r2_benchmark_results.json      # R2 greedy + KenLM WER/CER per dataset ← new
-│   ├── details_1M_CommonVoice.jsonl
-│   ├── details_1M_CSS10.jsonl
-│   ├── details_1M_FLEURS.jsonl
-│   └── details_1M_VoxPopuli.jsonl
-├── inference_vad.py                   # Long-form VAD inference script ← new
 └── README.md
 ```
@@ -337,6 +334,29 @@ pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \
             kaldialign wandb soundfile editdistance
 ```
 ### Critical NeMo Patches (already applied in included NeMo)
 1. **OneLogger Fix** — makes proprietary telemetry optional for public containers

 ### Long-Form Audio (podcasts, interviews, lectures)
+We provide two scripts for long-form audio. The **Pyannote-based pipeline** is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary.
+#### 1. Diarized Pipeline (Recommended) — `inference_pyannote.py`
+This script uses `pyannote/speaker-diarization-community-1` to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio.
 ```bash
+# Optimized for podcasts/interviews (includes diarization + KenLM)
+python inference_pyannote.py \
   --audio long_recording.wav \
   --model models/canary-finnish-v2.nemo \
+  --kenlm models/kenlm_5M.nemo \
+  --output transcript.json
+```
+#### 2. VAD-only Pipeline — `inference_vad.py`
+A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording.
+```bash
 python inference_vad.py \
   --audio long_recording.wav \
   --model models/canary-finnish-v2.nemo \
   --output transcript.txt
 ```
+#### Example Output
+See [`moo_merged_kenlm.json`](moo_merged_kenlm.json) for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps.
 ---
 .
 ├── NeMo/                              # NeMo toolkit (with patches applied)
 ├── models/
+│   ├── canary-finnish-v2.nemo         # Round 2 finetuned model (1B)
 │   ├── canary-finnish.nemo            # Round 1 finetuned model (1B)
+│   ├── canary-1b-v2.nemo              # Base Canary-v2 model
+│   ├── kenlm_1M.nemo                  # 6-gram KenLM (1M corpus)
+│   ├── kenlm_2M.nemo                  # 6-gram KenLM (2M corpus)
+│   └── kenlm_5M.nemo                  # 6-gram KenLM (5M corpus, recommended default)
+├── inference_pyannote.py              # Speaker-diarized inference (BEST for long audio)
+├── inference_vad.py                   # VAD-based inference (fast, single speaker)
+├── moo_merged_kenlm.json              # 30-min podcast example (Diarized + KenLM)
+├── moo_merged_greedy.json             # 30-min podcast example (Diarized, Greedy)
+├── PLAN_AND_PROGRESS.md               # Detailed training & analysis log
 └── README.md
 ```
             kaldialign wandb soundfile editdistance
 ```
+### Additional setup for long-form diarized inference (`inference_pyannote.py`)
+`inference_pyannote.py` requires pyannote + transformers components on top of base NeMo:
+```bash
+pip install pyannote.audio transformers accelerate sentencepiece
+# Required by torchaudio 2.10+ audio I/O path in this container
+pip install torchcodec
+```
+Set your Hugging Face token before running diarization (used to download `pyannote/speaker-diarization-community-1`):
+```bash
+export HF_TOKEN=your_hf_token
+```
+Or place it in `.env` as:
+```bash
+HF_TOKEN=your_hf_token
+```
 ### Critical NeMo Patches (already applied in included NeMo)
 1. **OneLogger Fix** — makes proprietary telemetry optional for public containers