Spaces:

Luminia
/

audiosplitter_whisper

Sleeping

App Files Files Community

audiosplitter_whisper / README.md

Nekochu

readme color

2ad161c about 1 month ago

preview code

raw

history blame contribute delete

3.27 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: Audio Splitter with Diarization
emoji: 🎙️
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 6.10.0
python_version: 3.10.8
app_file: app.py
full_width: true
pinned: false
license: mit
short_description: Separate each isolated voice for RVC training.

Audio Splitter with Speaker Diarization

Split audio into segments organized by speaker. Uses faster-whisper for ASR with two speaker identification modes.

Speaker ID Modes

1. Diarization (Default)

Best for any number of speakers. Labels who speaks when using clustering-based approach.

Audio → ASR (full audio) → Diarization → Assign speakers to segments

2. Speech Separation (2 speakers only)

Physically separates audio into 2 clean speaker tracks using Dual-Path-RNN neural network.

Audio → Dual-Path-RNN → [Speaker1 track, Speaker2 track] → ASR on each

Features

No HuggingFace token required - Pure ONNX inference
Fast - Optimized for CPU
Auto speaker detection - Diarization mode detects speaker count automatically
Clean separation - Speech separation mode provides isolated speaker tracks
Speaker Folders - Output organized by speaker (SPEAKER_00/, SPEAKER_01/, etc.)
Paired Output - Each segment has .wav + .txt files

Output Structure

output/
├── SPEAKER_00/
│   ├── 0001.wav
│   ├── 0001.txt
│   └── ...
├── SPEAKER_01/
│   └── ...
└── transcript.txt

Notes

Diarization mode - Best for multi-speaker conversations with varying speaker counts
Separation mode - Best for 2-speaker audio where clean tracks are needed (e.g., interviews)
faster-whisper - Uses large-v3 model for best accuracy

Usage

Upload audio file (MP3, WAV, etc.)
Select Whisper model (large-v3 recommended)
Enable speaker diarization
Set duration filter (Min/Max) - segments outside range are discarded
Click "Process"
Download ZIP with organized segments

Duration Filter

Min (s): Discard segments shorter than this (filters noise/single words)
Max (s): Discard segments longer than this (useful for training data)

Models Used

Component	Model	Purpose
ASR	faster-whisper	Speech-to-text with timestamps
Segmentation	pyannote-segmentation-3.0	Detect speech regions
Embeddings	3dspeaker	Speaker identity vectors
Clustering	sklearn AgglomerativeClustering	Group same speakers
Separation	Dual-Path-RNN	Speech separation (2 speakers)

Credits

faster-whisper - CTranslate2 Whisper
pyannote-segmentation-3.0 - ONNX segmentation
3D-Speaker - Speaker embeddings
Dual-Path-RNN - Speech separation
Original Audio Splitter by JarodMica