Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
metadata
title: Audio Splitter with Diarization
emoji: ποΈ
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 6.10.0
python_version: 3.10.8
app_file: app.py
full_width: true
pinned: false
license: mit
short_description: Separate each isolated voice for RVC training.
Audio Splitter with Speaker Diarization
Split audio into segments organized by speaker. Uses faster-whisper for ASR with two speaker identification modes.
Speaker ID Modes
1. Diarization (Default)
Best for any number of speakers. Labels who speaks when using clustering-based approach.
Audio β ASR (full audio) β Diarization β Assign speakers to segments
2. Speech Separation (2 speakers only)
Physically separates audio into 2 clean speaker tracks using Dual-Path-RNN neural network.
Audio β Dual-Path-RNN β [Speaker1 track, Speaker2 track] β ASR on each
Features
- No HuggingFace token required - Pure ONNX inference
- Fast - Optimized for CPU
- Auto speaker detection - Diarization mode detects speaker count automatically
- Clean separation - Speech separation mode provides isolated speaker tracks
- Speaker Folders - Output organized by speaker (SPEAKER_00/, SPEAKER_01/, etc.)
- Paired Output - Each segment has .wav + .txt files
Output Structure
output/
βββ SPEAKER_00/
β βββ 0001.wav
β βββ 0001.txt
β βββ ...
βββ SPEAKER_01/
β βββ ...
βββ transcript.txt
Notes
- Diarization mode - Best for multi-speaker conversations with varying speaker counts
- Separation mode - Best for 2-speaker audio where clean tracks are needed (e.g., interviews)
- faster-whisper - Uses large-v3 model for best accuracy
Usage
- Upload audio file (MP3, WAV, etc.)
- Select Whisper model (large-v3 recommended)
- Enable speaker diarization
- Set duration filter (Min/Max) - segments outside range are discarded
- Click "Process"
- Download ZIP with organized segments
Duration Filter
- Min (s): Discard segments shorter than this (filters noise/single words)
- Max (s): Discard segments longer than this (useful for training data)
Models Used
| Component | Model | Purpose |
|---|---|---|
| ASR | faster-whisper | Speech-to-text with timestamps |
| Segmentation | pyannote-segmentation-3.0 | Detect speech regions |
| Embeddings | 3dspeaker | Speaker identity vectors |
| Clustering | sklearn AgglomerativeClustering | Group same speakers |
| Separation | Dual-Path-RNN | Speech separation (2 speakers) |
Credits
- faster-whisper - CTranslate2 Whisper
- pyannote-segmentation-3.0 - ONNX segmentation
- 3D-Speaker - Speaker embeddings
- Dual-Path-RNN - Speech separation
- Original Audio Splitter by JarodMica