RASMUS commited on
Commit
3700d96
Β·
verified Β·
1 Parent(s): 67ac13f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +48 -28
README.md CHANGED
@@ -195,34 +195,32 @@ result = model.transcribe(
195
 
196
  ### Long-Form Audio (podcasts, interviews, lectures)
197
 
198
- Use the included `inference_vad.py` script, which combines Silero VAD chunking with the Canary model:
 
 
 
199
 
200
  ```bash
201
- # Greedy β€” best for audiobooks, studio speech
202
- python inference_vad.py \
203
  --audio long_recording.wav \
204
  --model models/canary-finnish-v2.nemo \
205
- --output transcript.txt
 
 
 
 
 
206
 
207
- # KenLM β€” best for conversational / podcast audio
208
  python inference_vad.py \
209
  --audio long_recording.wav \
210
  --model models/canary-finnish-v2.nemo \
211
- --kenlm models/kenlm_5M.nemo \
212
  --output transcript.txt
213
  ```
214
 
215
- The script writes both a plain-text transcript (`.txt`) and a Whisper-compatible JSON (`.json`) with segment-level timestamps:
216
-
217
- ```json
218
- {
219
- "segments": [
220
- { "start": 8.07, "end": 15.26, "text": "Hei armas kuulija ja tervetuloa linjoille." },
221
- { "start": 15.94, "end": 25.69, "text": "TΓ€mΓ€n podcastin tarkoitus on tarjota..." }
222
- ],
223
- "text": "<full transcript>"
224
- }
225
- ```
226
 
227
  ---
228
 
@@ -300,18 +298,17 @@ Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed
300
  .
301
  β”œβ”€β”€ NeMo/ # NeMo toolkit (with patches applied)
302
  β”œβ”€β”€ models/
 
303
  β”‚ β”œβ”€β”€ canary-finnish.nemo # Round 1 finetuned model (1B)
304
- β”‚ β”œβ”€β”€ canary-finnish-v2.nemo # Round 2 finetuned model (1B) ← new
305
- β”‚ β”œβ”€β”€ kenlm_5M.nemo # 6-gram KenLM, 5M corpus (recommended)
306
- β”‚ β”œβ”€β”€ kenlm_2M.nemo # 6-gram KenLM, 2M corpus
307
- β”‚ └── kenlm_1M.nemo # 6-gram KenLM, 1M corpus
308
- β”œβ”€β”€ results/
309
- β”‚ β”œβ”€β”€ r2_benchmark_results.json # R2 greedy + KenLM WER/CER per dataset ← new
310
- β”‚ β”œβ”€β”€ details_1M_CommonVoice.jsonl
311
- β”‚ β”œβ”€β”€ details_1M_CSS10.jsonl
312
- β”‚ β”œβ”€β”€ details_1M_FLEURS.jsonl
313
- β”‚ └── details_1M_VoxPopuli.jsonl
314
- β”œβ”€β”€ inference_vad.py # Long-form VAD inference script ← new
315
  └── README.md
316
  ```
317
 
@@ -337,6 +334,29 @@ pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \
337
  kaldialign wandb soundfile editdistance
338
  ```
339
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
340
  ### Critical NeMo Patches (already applied in included NeMo)
341
 
342
  1. **OneLogger Fix** β€” makes proprietary telemetry optional for public containers
 
195
 
196
  ### Long-Form Audio (podcasts, interviews, lectures)
197
 
198
+ We provide two scripts for long-form audio. The **Pyannote-based pipeline** is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary.
199
+
200
+ #### 1. Diarized Pipeline (Recommended) β€” `inference_pyannote.py`
201
+ This script uses `pyannote/speaker-diarization-community-1` to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio.
202
 
203
  ```bash
204
+ # Optimized for podcasts/interviews (includes diarization + KenLM)
205
+ python inference_pyannote.py \
206
  --audio long_recording.wav \
207
  --model models/canary-finnish-v2.nemo \
208
+ --kenlm models/kenlm_5M.nemo \
209
+ --output transcript.json
210
+ ```
211
+
212
+ #### 2. VAD-only Pipeline β€” `inference_vad.py`
213
+ A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording.
214
 
215
+ ```bash
216
  python inference_vad.py \
217
  --audio long_recording.wav \
218
  --model models/canary-finnish-v2.nemo \
 
219
  --output transcript.txt
220
  ```
221
 
222
+ #### Example Output
223
+ See [`moo_merged_kenlm.json`](moo_merged_kenlm.json) for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps.
 
 
 
 
 
 
 
 
 
224
 
225
  ---
226
 
 
298
  .
299
  β”œβ”€β”€ NeMo/ # NeMo toolkit (with patches applied)
300
  β”œβ”€β”€ models/
301
+ β”‚ β”œβ”€β”€ canary-finnish-v2.nemo # Round 2 finetuned model (1B)
302
  β”‚ β”œβ”€β”€ canary-finnish.nemo # Round 1 finetuned model (1B)
303
+ β”‚ β”œβ”€β”€ canary-1b-v2.nemo # Base Canary-v2 model
304
+ β”‚ β”œβ”€β”€ kenlm_1M.nemo # 6-gram KenLM (1M corpus)
305
+ β”‚ β”œβ”€β”€ kenlm_2M.nemo # 6-gram KenLM (2M corpus)
306
+ β”‚ └── kenlm_5M.nemo # 6-gram KenLM (5M corpus, recommended default)
307
+ β”œβ”€β”€ inference_pyannote.py # Speaker-diarized inference (BEST for long audio)
308
+ β”œβ”€β”€ inference_vad.py # VAD-based inference (fast, single speaker)
309
+ β”œβ”€β”€ moo_merged_kenlm.json # 30-min podcast example (Diarized + KenLM)
310
+ β”œβ”€β”€ moo_merged_greedy.json # 30-min podcast example (Diarized, Greedy)
311
+ β”œβ”€β”€ PLAN_AND_PROGRESS.md # Detailed training & analysis log
 
 
312
  └── README.md
313
  ```
314
 
 
334
  kaldialign wandb soundfile editdistance
335
  ```
336
 
337
+ ### Additional setup for long-form diarized inference (`inference_pyannote.py`)
338
+
339
+ `inference_pyannote.py` requires pyannote + transformers components on top of base NeMo:
340
+
341
+ ```bash
342
+ pip install pyannote.audio transformers accelerate sentencepiece
343
+
344
+ # Required by torchaudio 2.10+ audio I/O path in this container
345
+ pip install torchcodec
346
+ ```
347
+
348
+ Set your Hugging Face token before running diarization (used to download `pyannote/speaker-diarization-community-1`):
349
+
350
+ ```bash
351
+ export HF_TOKEN=your_hf_token
352
+ ```
353
+
354
+ Or place it in `.env` as:
355
+
356
+ ```bash
357
+ HF_TOKEN=your_hf_token
358
+ ```
359
+
360
  ### Critical NeMo Patches (already applied in included NeMo)
361
 
362
  1. **OneLogger Fix** β€” makes proprietary telemetry optional for public containers