batispeak-diarize

Korean telephone (8 kHz) speaker diarization — a fine-tuned segmentation model for who-spoke-when on Korean phone calls. 한국어 전화(8kHz) 화자분리 — 누가 언제 말했는지(who-spoke-when)를 위한 segmentation 파인튜닝 모델.

batispeak-diarize fine-tunes only the segmentation stage of the pyannote diarization pipeline. The embedding (pyannote/wespeaker-voxceleb-resnet34-LM) and clustering (AgglomerativeClustering) stages are used as-is from open source.

batispeak-diarize 는 pyannote 화자분리 파이프라인에서 segmentation 단계만 파인튜닝합니다. embedding(pyannote/wespeaker-voxceleb-resnet34-LM)과 clustering(AgglomerativeClustering)은 오픈소스 그대로 사용합니다.

Base / 베이스: pyannote/segmentation-3.0 (MIT, 1.47M params)
Target / 타겟: Korean telephone audio, 8 kHz / 한국어 전화 음성, 8kHz
Checkpoint: seg-ft-final.ckpt (17.7 MB)

⭐ Real-call DER results / 실통화 DER 결과

Measured on Mac. Only the segmentation model was swapped to the fine-tuned checkpoint (rest of the pipeline identical), with num_speakers=2, collar 0.5, against the same Clova RTTM reference.

Mac 측정. segmentation 모델만 파인튜닝 ckpt 로 교체(나머지 파이프라인 동일), num_speakers=2, collar 0.5, 동일 Clova RTTM 기준.

Call / 통화	Length / 길이	Base (seg-3.0) DER	batispeak-diarize DER
call_03	3.5 min	24.9%	14.2%
call_05	7.8 min	17.5%	5.6%
call_06	2.3 min	35.4%	19.7%
Average / 평균		25.9%	13.2%

−12.8%p absolute improvement — reaching the level of English meeting SOTA (~~15%). 영어 회의 SOTA(~~15%) 수준에 도달.

How it works & our contribution / 동작 원리 및 우리 기여

Why synthetic data / 왜 합성 데이터인가

Labeled Korean speaker-diarization data is effectively nonexistent. We break through with synthetic simulation — the standard approach for telephone diarization.

한국어 화자분리 라벨 데이터는 사실상 존재하지 않습니다. 합성 시뮬레이션으로 이를 돌파했으며, 이는 전화 화자분리의 정석적 접근입니다.

Pipeline / 파이프라인

Segmentation — fine-tuned (batispeak-diarize) ← our contribution / 우리 기여
Embedding — pyannote/wespeaker-voxceleb-resnet34-LM (open source, as-is / 오픈소스 그대로)
Clustering — AgglomerativeClustering (open source, as-is / 오픈소스 그대로)

We fine-tuned segmentation only; the embedding and clustering stages are unchanged open-source components. segmentation 만 학습 교체했고, embedding 과 clustering 은 오픈소스 그대로입니다.

Usage / 사용법

from pyannote.audio import Model
from pyannote.audio.pipelines import SpeakerDiarization

ft = Model.from_pretrained("batiai/batispeak-diarize")
pipeline = SpeakerDiarization(
    segmentation=ft,
    embedding="pyannote/wespeaker-voxceleb-resnet34-LM",
    clustering="AgglomerativeClustering",
)
pipeline.instantiate({
    "segmentation": {"min_duration_off": 0.25},   # gap-bridging free win (Mac-verified: avg DER 13.16→13.10, no per-call regression)
    "clustering": {"method": "centroid", "min_cluster_size": 12, "threshold": 0.7045},
})
diarization = pipeline("call.wav")

min_duration_off: 0.25 bridges short non-speech gaps, recovering a little Miss on call-domain audio — Mac-verified real-call free win (avg DER 13.16% → 13.10%, every call non-regressing; 0.0 also fine). threshold tuning has no effect under num_speakers=2 (2-cluster re-search). min_duration_off: 0.25 — 짧은 비음성 갭을 메워 통화 Miss 일부 회수(Mac 실통화 검증 free win, 평균 DER 13.16→13.10, 전 call 비퇴행). threshold 는 num_speakers=2 에선 무효.

ONNX (native / Swift / on-device) / ONNX (네이티브·Swift·온디바이스)

For framework-free or on-device deployment (e.g. Swift + onnxruntime), the pipeline's two neural stages are provided as ONNX under onnx/:

프레임워크 없이 또는 온디바이스(예: Swift + onnxruntime) 배포용으로, 파이프라인의 두 신경망 단계를 onnx/에 ONNX로 제공합니다:

File	Input → Output	Notes
`onnx/segmentation.onnx`	waveform (1, 1, 80000) → powerset (1, 293, 7)	fine-tuned segmentation (= `seg-ft-final.ckpt`), batch fixed
`onnx/segmentation-dynamic.onnx`	waveform (B, 1, 80000) → powerset (B, 293, 7)	batch-dynamic (window batching), pytorch parity 1e-4 — recommended for speed
`onnx/resnet34lm-feats.onnx`	feats (B, frames, 80) + weights (B, frames) → embedding (B, 256)	base `wespeaker-resnet34-LM` (VoxCeleb 16kHz), cosine; bit-exact
`onnx/resnet34lm-feats-8k-ko.onnx`	feats (B, frames, 80) + weights (B, frames) → embedding (B, 256)	8kHz Korean telephone fine-tuned — recommended for calls. Same I/O, drop-in swap.

⭐ 8kHz Korean telephone embedding (resnet34lm-feats-8k-ko.onnx) — the wespeaker-resnet34-LM embedding fine-tuned for 8 kHz Korean telephone (AAM-Softmax + augmentation incl. resample/codec). Held-out 8kHz speaker EER 16.98% → 7.68% (−9.3 pp) — the robust, model-team-measured gain. Real-call DER: under identical (ffmpeg) preprocessing it matches the baseline embedding (both 13.2%); under the app's on-device resampling (AVFoundation), the fine-tuned embedding is more robust — app-measured baseline 14.4% → FT 13.2%. That ~1pp is resample robustness (the baseline loses it to the resampler), not a clean-pipeline DER gain. Recommended for on-device deployment. Same I/O (weights = 293-frame seg mask), drop-in for resnet34lm-feats.onnx. 8kHz 한국어 전화 임베딩 — wespeaker resnet34-LM 을 8kHz 한국어 통화로 파인튜닝(증강에 리샘플/codec 포함). held-out EER 16.98→7.68%(−9.3pp) 가 robust 한 핵심 이득. 실통화 DER: 동일 전처리(ffmpeg)에선 baseline 과 같음(둘 다 13.2%), 단 앱 온디바이스 리샘플(AVFoundation)에선 FT 가 더 robust — 앱 측정 baseline 14.4→FT 13.2%. 이 ~1pp 는 리샘플 강건성(baseline 이 리샘플러에 손해)이지 clean 파이프라인 DER 이득 아님. 온디바이스 배포 권장. 동일 I/O(weights=293 seg 마스크) drop-in.

The embedding ONNX takes kaldi fbank feats (not raw waveform), since the kaldi fbank front-end is not ONNX-exportable. Compute feats as: waveform × 32768 → kaldi.fbank(num_mel=80, frame_length=25ms, frame_shift=10ms, window=hamming, dither=0, use_energy=False) → CMN (subtract per-utterance frame mean).

embedding ONNX는 raw waveform이 아니라 **kaldi fbank feats**를 입력으로 받습니다(kaldi fbank 전처리는 ONNX 변환 불가). feats 계산: waveform × 32768 → kaldi.fbank(80 mel / 25ms / 10ms / hamming / dither 0 / use_energy=False) → CMN(발화 프레임 평균 차감).

The clustering / binarization / stitching stages (binarize → embedding → AgglomerativeClustering @ threshold 0.7045 → reconstruct) are deterministic pipeline logic to be re-implemented natively. The ONNX files cover only the neural stages. clustering / binarization / stitching 단계(binarize → embedding → AgglomerativeClustering 0.7045 → reconstruct)는 결정적 파이프라인 로직으로 네이티브 재구현 대상입니다. ONNX는 신경망 단계만 커버합니다.

Training details / 학습 상세

Data / 데이터 (synthetic / 합성)

Source / 소스: KconfSpeech (Korean meeting / 한국어 회의) → per-speaker clips / 화자별 클립 (366 speakers / 366화자, 5,549 clips / 5,549클립)
Synthesis / 합성: multi-speaker 8 kHz telephone calls / 다화자 8kHz 전화통화 — overlap / gap / turn control + 8 kHz band-limiting + noise (overlap/gap/turn 제어 + 8kHz 대역제한 + 노이즈)
Scale / 규모: ~~300 sessions / 24.1 h / RTTM 13,908 turns (~~300세션 / 24.1시간 / RTTM 13,908 turn)

Fine-tuning / 파인튜닝

Fine-tuned segmentation-3.0 for 12 epochs / segmentation-3.0 파인튜닝 12 epoch
Synthetic dev DER: 0.049 → 0.029 / 합성 dev DER 0.049 → 0.029

Environment / 환경

Training: pyannote.audio 3.1.1 + torch 2.5.1+cu124
Compatibility / 호환: load verified with Model.from_pretrained on pyannote 4.0.4 (seg-3.0 architecture compatible / seg-3.0 아키텍처 호환)

Limitations & notes / 한계 및 주의

Reference label noise / 라벨 노이즈: Clova RTTM uses 1-second granularity, so absolute DER values are coarse. However, since the same RTTM and collar are used for both base and fine-tuned, the −12.8%p relative improvement is reliable. Clova 1초 라벨노이즈 때문에 DER 절대값은 거칩니다. 단 동일 RTTM/collar 기준이라 상대개선 12.8%p 는 신뢰할 수 있습니다.
Synthetic gap / 합성 갭: training data is meeting speech converted to 8 kHz, which differs from real telephone timbre and background noise. 학습 데이터는 회의 화자를 8kHz 로 변환한 것이라 실제 전화의 음색·배경음과 차이가 있습니다.
Evaluation note / 평가 결론: an increased-overlap 2nd iteration and off-the-shelf embedding swaps (ECAPA, CAM++ multilingual/English) were all A/B-evaluated against this v1 — none improved over it on real Korean 8 kHz calls. The released configuration (fine-tuned segmentation-3.0 + wespeaker-resnet34-LM embedding) is the validated optimum for this domain at 13.2%. overlap↑ 2차 및 기성 임베딩 교체(ECAPA, CAM++ 다국어/영어)를 모두 A/B 평가했으나 실통화에서 v1을 넘지 못함 — 현 구성이 8kHz 한국어 전화 도메인 최적(13.2%). 추가 개선은 한국어 8kHz 임베딩 학습(별도 프로젝트) 영역입니다.

License / 라이선스

📥 공개 배포 — 게이트/로그인 없이 자유 다운로드 (BatiFlow 앱 무토큰 배포). 단, 아래 BatiAI Community License v2.0 가 그대로 적용됩니다. 💼 상업적 활용(외부 SaaS·재판매·매출 10억+) = support@bati.ai 협의 필수.

Base model / 베이스: pyannote/segmentation-3.0 — MIT (commercial use and derivative redistribution permitted; attribution retained / 상업·파생 재배포 허용, 출처표기 유지).
This model / 본 모델: distributed under BatiAI Community License v2.0 (Tier 2 — public, ungated, commercial 협의).
- 매출 10억 미만 / 24개월 미만 / 비상업 = 자유. 10억+ 외부 SaaS = support@bati.ai 협의.
- License: https://github.com/batiai/batiai-models/blob/main/LICENSE-BATIAI-COMMUNITY.md

This model is a derivative of pyannote/segmentation-3.0 (MIT); the original author's attribution is retained. 본 모델은 pyannote/segmentation-3.0(MIT)의 파생물이며, 원저작자 표기를 유지합니다.

Pairing / 페어링

Combine with batisay-ko-turbo (STT) for per-speaker transcription — diarization assigns who, STT provides what. batisay-ko-turbo(STT)와 조합하면 화자별 전사가 가능합니다.

Bundled in the HF Collection "한국어 음성 스위트". HF Collection **"한국어 음성 스위트"**로 묶여 있습니다.

Downloads last month: 22

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including batiai/batispeak-diarize

🎙️ 한국어 음성 스위트 — STT + 화자분리

Collection

batisay(STT, 무엇을 말했나) + batispeak(화자분리, 누가 말했나) = 통화·회의 화자별 전사. 16GB Mac on-device. • 4 items • Updated 11 days ago