Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -195,34 +195,32 @@ result = model.transcribe(
|
|
| 195 |
|
| 196 |
### Long-Form Audio (podcasts, interviews, lectures)
|
| 197 |
|
| 198 |
-
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
```bash
|
| 201 |
-
#
|
| 202 |
-
python
|
| 203 |
--audio long_recording.wav \
|
| 204 |
--model models/canary-finnish-v2.nemo \
|
| 205 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
|
| 207 |
-
|
| 208 |
python inference_vad.py \
|
| 209 |
--audio long_recording.wav \
|
| 210 |
--model models/canary-finnish-v2.nemo \
|
| 211 |
-
--kenlm models/kenlm_5M.nemo \
|
| 212 |
--output transcript.txt
|
| 213 |
```
|
| 214 |
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
```json
|
| 218 |
-
{
|
| 219 |
-
"segments": [
|
| 220 |
-
{ "start": 8.07, "end": 15.26, "text": "Hei armas kuulija ja tervetuloa linjoille." },
|
| 221 |
-
{ "start": 15.94, "end": 25.69, "text": "TΓ€mΓ€n podcastin tarkoitus on tarjota..." }
|
| 222 |
-
],
|
| 223 |
-
"text": "<full transcript>"
|
| 224 |
-
}
|
| 225 |
-
```
|
| 226 |
|
| 227 |
---
|
| 228 |
|
|
@@ -300,18 +298,17 @@ Zero eval leakage: 1,833 sentences overlapping with evaluation sets were removed
|
|
| 300 |
.
|
| 301 |
βββ NeMo/ # NeMo toolkit (with patches applied)
|
| 302 |
βββ models/
|
|
|
|
| 303 |
β βββ canary-finnish.nemo # Round 1 finetuned model (1B)
|
| 304 |
-
β βββ canary-
|
| 305 |
-
β βββ
|
| 306 |
-
β βββ kenlm_2M.nemo # 6-gram KenLM
|
| 307 |
-
β βββ
|
| 308 |
-
βββ
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
β βββ details_1M_VoxPopuli.jsonl
|
| 314 |
-
βββ inference_vad.py # Long-form VAD inference script β new
|
| 315 |
βββ README.md
|
| 316 |
```
|
| 317 |
|
|
@@ -337,6 +334,29 @@ pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' \
|
|
| 337 |
kaldialign wandb soundfile editdistance
|
| 338 |
```
|
| 339 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 340 |
### Critical NeMo Patches (already applied in included NeMo)
|
| 341 |
|
| 342 |
1. **OneLogger Fix** β makes proprietary telemetry optional for public containers
|
|
|
|
| 195 |
|
| 196 |
### Long-Form Audio (podcasts, interviews, lectures)
|
| 197 |
|
| 198 |
+
We provide two scripts for long-form audio. The **Pyannote-based pipeline** is the recommended generalized approach as it handles speaker changes and provides the most stable transcription context for Canary.
|
| 199 |
+
|
| 200 |
+
#### 1. Diarized Pipeline (Recommended) β `inference_pyannote.py`
|
| 201 |
+
This script uses `pyannote/speaker-diarization-community-1` to segment audio by speaker, then merges segments into ~25s chunks for Canary. This provides the best results for podcasts and multi-speaker audio.
|
| 202 |
|
| 203 |
```bash
|
| 204 |
+
# Optimized for podcasts/interviews (includes diarization + KenLM)
|
| 205 |
+
python inference_pyannote.py \
|
| 206 |
--audio long_recording.wav \
|
| 207 |
--model models/canary-finnish-v2.nemo \
|
| 208 |
+
--kenlm models/kenlm_5M.nemo \
|
| 209 |
+
--output transcript.json
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
#### 2. VAD-only Pipeline β `inference_vad.py`
|
| 213 |
+
A simpler pipeline using Silero VAD for basic speech-activity detection. Useful if you don't need speaker labels or have a single-speaker recording.
|
| 214 |
|
| 215 |
+
```bash
|
| 216 |
python inference_vad.py \
|
| 217 |
--audio long_recording.wav \
|
| 218 |
--model models/canary-finnish-v2.nemo \
|
|
|
|
| 219 |
--output transcript.txt
|
| 220 |
```
|
| 221 |
|
| 222 |
+
#### Example Output
|
| 223 |
+
See [`moo_merged_kenlm.json`](moo_merged_kenlm.json) for a full 30-minute podcast transcription example using the diarized pipeline. It includes segment-level speaker labels and word-level timestamps.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 224 |
|
| 225 |
---
|
| 226 |
|
|
|
|
| 298 |
.
|
| 299 |
βββ NeMo/ # NeMo toolkit (with patches applied)
|
| 300 |
βββ models/
|
| 301 |
+
β βββ canary-finnish-v2.nemo # Round 2 finetuned model (1B)
|
| 302 |
β βββ canary-finnish.nemo # Round 1 finetuned model (1B)
|
| 303 |
+
β βββ canary-1b-v2.nemo # Base Canary-v2 model
|
| 304 |
+
β βββ kenlm_1M.nemo # 6-gram KenLM (1M corpus)
|
| 305 |
+
β βββ kenlm_2M.nemo # 6-gram KenLM (2M corpus)
|
| 306 |
+
β βββ kenlm_5M.nemo # 6-gram KenLM (5M corpus, recommended default)
|
| 307 |
+
βββ inference_pyannote.py # Speaker-diarized inference (BEST for long audio)
|
| 308 |
+
βββ inference_vad.py # VAD-based inference (fast, single speaker)
|
| 309 |
+
βββ moo_merged_kenlm.json # 30-min podcast example (Diarized + KenLM)
|
| 310 |
+
βββ moo_merged_greedy.json # 30-min podcast example (Diarized, Greedy)
|
| 311 |
+
βββ PLAN_AND_PROGRESS.md # Detailed training & analysis log
|
|
|
|
|
|
|
| 312 |
βββ README.md
|
| 313 |
```
|
| 314 |
|
|
|
|
| 334 |
kaldialign wandb soundfile editdistance
|
| 335 |
```
|
| 336 |
|
| 337 |
+
### Additional setup for long-form diarized inference (`inference_pyannote.py`)
|
| 338 |
+
|
| 339 |
+
`inference_pyannote.py` requires pyannote + transformers components on top of base NeMo:
|
| 340 |
+
|
| 341 |
+
```bash
|
| 342 |
+
pip install pyannote.audio transformers accelerate sentencepiece
|
| 343 |
+
|
| 344 |
+
# Required by torchaudio 2.10+ audio I/O path in this container
|
| 345 |
+
pip install torchcodec
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
Set your Hugging Face token before running diarization (used to download `pyannote/speaker-diarization-community-1`):
|
| 349 |
+
|
| 350 |
+
```bash
|
| 351 |
+
export HF_TOKEN=your_hf_token
|
| 352 |
+
```
|
| 353 |
+
|
| 354 |
+
Or place it in `.env` as:
|
| 355 |
+
|
| 356 |
+
```bash
|
| 357 |
+
HF_TOKEN=your_hf_token
|
| 358 |
+
```
|
| 359 |
+
|
| 360 |
### Critical NeMo Patches (already applied in included NeMo)
|
| 361 |
|
| 362 |
1. **OneLogger Fix** β makes proprietary telemetry optional for public containers
|