Nekochu's picture
readme color
2ad161c

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: Audio Splitter with Diarization
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 6.10.0
python_version: 3.10.8
app_file: app.py
full_width: true
pinned: false
license: mit
short_description: Separate each isolated voice for RVC training.

Audio Splitter with Speaker Diarization

Split audio into segments organized by speaker. Uses faster-whisper for ASR with two speaker identification modes.

Speaker ID Modes

1. Diarization (Default)

Best for any number of speakers. Labels who speaks when using clustering-based approach.

Audio β†’ ASR (full audio) β†’ Diarization β†’ Assign speakers to segments

2. Speech Separation (2 speakers only)

Physically separates audio into 2 clean speaker tracks using Dual-Path-RNN neural network.

Audio β†’ Dual-Path-RNN β†’ [Speaker1 track, Speaker2 track] β†’ ASR on each

Features

  • No HuggingFace token required - Pure ONNX inference
  • Fast - Optimized for CPU
  • Auto speaker detection - Diarization mode detects speaker count automatically
  • Clean separation - Speech separation mode provides isolated speaker tracks
  • Speaker Folders - Output organized by speaker (SPEAKER_00/, SPEAKER_01/, etc.)
  • Paired Output - Each segment has .wav + .txt files

Output Structure

output/
β”œβ”€β”€ SPEAKER_00/
β”‚   β”œβ”€β”€ 0001.wav
β”‚   β”œβ”€β”€ 0001.txt
β”‚   └── ...
β”œβ”€β”€ SPEAKER_01/
β”‚   └── ...
└── transcript.txt

Notes

  • Diarization mode - Best for multi-speaker conversations with varying speaker counts
  • Separation mode - Best for 2-speaker audio where clean tracks are needed (e.g., interviews)
  • faster-whisper - Uses large-v3 model for best accuracy

Usage

  1. Upload audio file (MP3, WAV, etc.)
  2. Select Whisper model (large-v3 recommended)
  3. Enable speaker diarization
  4. Set duration filter (Min/Max) - segments outside range are discarded
  5. Click "Process"
  6. Download ZIP with organized segments

Duration Filter

  • Min (s): Discard segments shorter than this (filters noise/single words)
  • Max (s): Discard segments longer than this (useful for training data)

Models Used

Component Model Purpose
ASR faster-whisper Speech-to-text with timestamps
Segmentation pyannote-segmentation-3.0 Detect speech regions
Embeddings 3dspeaker Speaker identity vectors
Clustering sklearn AgglomerativeClustering Group same speakers
Separation Dual-Path-RNN Speech separation (2 speakers)

Credits