batispeak-diarize
Korean telephone (8 kHz) speaker diarization โ a fine-tuned segmentation model for who-spoke-when on Korean phone calls. ํ๊ตญ์ด ์ ํ(8kHz) ํ์๋ถ๋ฆฌ โ ๋๊ฐ ์ธ์ ๋งํ๋์ง(who-spoke-when)๋ฅผ ์ํ segmentation ํ์ธํ๋ ๋ชจ๋ธ.
batispeak-diarize fine-tunes only the segmentation stage of the pyannote diarization pipeline. The embedding (pyannote/wespeaker-voxceleb-resnet34-LM) and clustering (AgglomerativeClustering) stages are used as-is from open source.
batispeak-diarize ๋ pyannote ํ์๋ถ๋ฆฌ ํ์ดํ๋ผ์ธ์์ segmentation ๋จ๊ณ๋ง ํ์ธํ๋ํฉ๋๋ค. embedding(pyannote/wespeaker-voxceleb-resnet34-LM)๊ณผ clustering(AgglomerativeClustering)์ ์คํ์์ค ๊ทธ๋๋ก ์ฌ์ฉํฉ๋๋ค.
- Base / ๋ฒ ์ด์ค:
pyannote/segmentation-3.0(MIT, 1.47M params) - Target / ํ๊ฒ: Korean telephone audio, 8 kHz / ํ๊ตญ์ด ์ ํ ์์ฑ, 8kHz
- Checkpoint:
seg-ft-final.ckpt(17.7 MB)
โญ Real-call DER results / ์คํตํ DER ๊ฒฐ๊ณผ
Measured on Mac. Only the segmentation model was swapped to the fine-tuned checkpoint (rest of the pipeline identical), with num_speakers=2, collar 0.5, against the same Clova RTTM reference.
Mac ์ธก์ . segmentation ๋ชจ๋ธ๋ง ํ์ธํ๋ ckpt ๋ก ๊ต์ฒด(๋๋จธ์ง ํ์ดํ๋ผ์ธ ๋์ผ), num_speakers=2, collar 0.5, ๋์ผ Clova RTTM ๊ธฐ์ค.
| Call / ํตํ | Length / ๊ธธ์ด | Base (seg-3.0) DER | batispeak-diarize DER |
|---|---|---|---|
| call_03 | 3.5 min | 24.9% | 14.2% |
| call_05 | 7.8 min | 17.5% | 5.6% |
| call_06 | 2.3 min | 35.4% | 19.7% |
| Average / ํ๊ท | 25.9% | 13.2% |
โ12.8%p absolute improvement โ reaching the level of English meeting SOTA (15%).
์์ด ํ์ SOTA(15%) ์์ค์ ๋๋ฌ.
How it works & our contribution / ๋์ ์๋ฆฌ ๋ฐ ์ฐ๋ฆฌ ๊ธฐ์ฌ
Why synthetic data / ์ ํฉ์ฑ ๋ฐ์ดํฐ์ธ๊ฐ
Labeled Korean speaker-diarization data is effectively nonexistent. We break through with synthetic simulation โ the standard approach for telephone diarization.
ํ๊ตญ์ด ํ์๋ถ๋ฆฌ ๋ผ๋ฒจ ๋ฐ์ดํฐ๋ ์ฌ์ค์ ์กด์ฌํ์ง ์์ต๋๋ค. ํฉ์ฑ ์๋ฎฌ๋ ์ด์ ์ผ๋ก ์ด๋ฅผ ๋ํํ์ผ๋ฉฐ, ์ด๋ ์ ํ ํ์๋ถ๋ฆฌ์ ์ ์์ ์ ๊ทผ์ ๋๋ค.
Pipeline / ํ์ดํ๋ผ์ธ
- Segmentation โ fine-tuned (
batispeak-diarize) โ our contribution / ์ฐ๋ฆฌ ๊ธฐ์ฌ - Embedding โ
pyannote/wespeaker-voxceleb-resnet34-LM(open source, as-is / ์คํ์์ค ๊ทธ๋๋ก) - Clustering โ
AgglomerativeClustering(open source, as-is / ์คํ์์ค ๊ทธ๋๋ก)
We fine-tuned segmentation only; the embedding and clustering stages are unchanged open-source components. segmentation ๋ง ํ์ต ๊ต์ฒดํ๊ณ , embedding ๊ณผ clustering ์ ์คํ์์ค ๊ทธ๋๋ก์ ๋๋ค.
Usage / ์ฌ์ฉ๋ฒ
from pyannote.audio import Model
from pyannote.audio.pipelines import SpeakerDiarization
ft = Model.from_pretrained("batiai/batispeak-diarize")
pipeline = SpeakerDiarization(
segmentation=ft,
embedding="pyannote/wespeaker-voxceleb-resnet34-LM",
clustering="AgglomerativeClustering",
)
pipeline.instantiate({
"segmentation": {"min_duration_off": 0.25}, # gap-bridging free win (Mac-verified: avg DER 13.16โ13.10, no per-call regression)
"clustering": {"method": "centroid", "min_cluster_size": 12, "threshold": 0.7045},
})
diarization = pipeline("call.wav")
min_duration_off: 0.25bridges short non-speech gaps, recovering a little Miss on call-domain audio โ Mac-verified real-call free win (avg DER 13.16% โ 13.10%, every call non-regressing;0.0also fine).thresholdtuning has no effect undernum_speakers=2(2-cluster re-search).min_duration_off: 0.25โ ์งง์ ๋น์์ฑ ๊ฐญ์ ๋ฉ์ ํตํ Miss ์ผ๋ถ ํ์(Mac ์คํตํ ๊ฒ์ฆ free win, ํ๊ท DER 13.16โ13.10, ์ call ๋นํดํ).threshold๋ num_speakers=2 ์์ ๋ฌดํจ.
ONNX (native / Swift / on-device) / ONNX (๋ค์ดํฐ๋ธยทSwiftยท์จ๋๋ฐ์ด์ค)
For framework-free or on-device deployment (e.g. Swift + onnxruntime), the pipeline's two neural stages are provided as ONNX under onnx/:
ํ๋ ์์ํฌ ์์ด ๋๋ ์จ๋๋ฐ์ด์ค(์: Swift + onnxruntime) ๋ฐฐํฌ์ฉ์ผ๋ก, ํ์ดํ๋ผ์ธ์ ๋ ์ ๊ฒฝ๋ง ๋จ๊ณ๋ฅผ onnx/์ ONNX๋ก ์ ๊ณตํฉ๋๋ค:
| File | Input โ Output | Notes |
|---|---|---|
onnx/segmentation.onnx |
waveform (1, 1, 80000) โ powerset (1, 293, 7) | fine-tuned segmentation (= seg-ft-final.ckpt), batch fixed |
onnx/segmentation-dynamic.onnx |
waveform (B, 1, 80000) โ powerset (B, 293, 7) | batch-dynamic (window batching), pytorch parity 1e-4 โ recommended for speed |
onnx/resnet34lm-feats.onnx |
feats (B, frames, 80) + weights (B, frames) โ embedding (B, 256) | base wespeaker-resnet34-LM (VoxCeleb 16kHz), cosine; bit-exact |
onnx/resnet34lm-feats-8k-ko.onnx |
feats (B, frames, 80) + weights (B, frames) โ embedding (B, 256) | 8kHz Korean telephone fine-tuned โ recommended for calls. Same I/O, drop-in swap. |
โญ 8kHz Korean telephone embedding (
resnet34lm-feats-8k-ko.onnx) โ thewespeaker-resnet34-LMembedding fine-tuned for 8 kHz Korean telephone (AAM-Softmax + augmentation incl. resample/codec). Held-out 8kHz speaker EER 16.98% โ 7.68% (โ9.3 pp) โ the robust, model-team-measured gain. Real-call DER: under identical (ffmpeg) preprocessing it matches the baseline embedding (both 13.2%); under the app's on-device resampling (AVFoundation), the fine-tuned embedding is more robust โ app-measured baseline 14.4% โ FT 13.2%. That ~1pp is resample robustness (the baseline loses it to the resampler), not a clean-pipeline DER gain. Recommended for on-device deployment. Same I/O (weights = 293-frame seg mask), drop-in forresnet34lm-feats.onnx. 8kHz ํ๊ตญ์ด ์ ํ ์๋ฒ ๋ฉ โ wespeaker resnet34-LM ์ 8kHz ํ๊ตญ์ด ํตํ๋ก ํ์ธํ๋(์ฆ๊ฐ์ ๋ฆฌ์ํ/codec ํฌํจ). held-out EER 16.98โ7.68%(โ9.3pp) ๊ฐ robust ํ ํต์ฌ ์ด๋. ์คํตํ DER: ๋์ผ ์ ์ฒ๋ฆฌ(ffmpeg)์์ baseline ๊ณผ ๊ฐ์(๋ ๋ค 13.2%), ๋จ ์ฑ ์จ๋๋ฐ์ด์ค ๋ฆฌ์ํ(AVFoundation)์์ FT ๊ฐ ๋ robust โ ์ฑ ์ธก์ baseline 14.4โFT 13.2%. ์ด ~1pp ๋ ๋ฆฌ์ํ ๊ฐ๊ฑด์ฑ(baseline ์ด ๋ฆฌ์ํ๋ฌ์ ์ํด)์ด์ง clean ํ์ดํ๋ผ์ธ DER ์ด๋ ์๋. ์จ๋๋ฐ์ด์ค ๋ฐฐํฌ ๊ถ์ฅ. ๋์ผ I/O(weights=293 seg ๋ง์คํฌ) drop-in.
The embedding ONNX takes kaldi fbank feats (not raw waveform), since the kaldi fbank front-end is not ONNX-exportable. Compute feats as: waveform ร 32768 โ kaldi.fbank(num_mel=80, frame_length=25ms, frame_shift=10ms, window=hamming, dither=0, use_energy=False) โ CMN (subtract per-utterance frame mean).
embedding ONNX๋ raw waveform์ด ์๋๋ผ **kaldi fbank feats**๋ฅผ ์
๋ ฅ์ผ๋ก ๋ฐ์ต๋๋ค(kaldi fbank ์ ์ฒ๋ฆฌ๋ ONNX ๋ณํ ๋ถ๊ฐ). feats ๊ณ์ฐ: waveform ร 32768 โ kaldi.fbank(80 mel / 25ms / 10ms / hamming / dither 0 / use_energy=False) โ CMN(๋ฐํ ํ๋ ์ ํ๊ท ์ฐจ๊ฐ).
The clustering / binarization / stitching stages (binarize โ embedding โ AgglomerativeClustering @ threshold 0.7045 โ reconstruct) are deterministic pipeline logic to be re-implemented natively. The ONNX files cover only the neural stages. clustering / binarization / stitching ๋จ๊ณ(binarize โ embedding โ AgglomerativeClustering 0.7045 โ reconstruct)๋ ๊ฒฐ์ ์ ํ์ดํ๋ผ์ธ ๋ก์ง์ผ๋ก ๋ค์ดํฐ๋ธ ์ฌ๊ตฌํ ๋์์ ๋๋ค. ONNX๋ ์ ๊ฒฝ๋ง ๋จ๊ณ๋ง ์ปค๋ฒํฉ๋๋ค.
Training details / ํ์ต ์์ธ
Data / ๋ฐ์ดํฐ (synthetic / ํฉ์ฑ)
- Source / ์์ค: KconfSpeech (Korean meeting / ํ๊ตญ์ด ํ์) โ per-speaker clips / ํ์๋ณ ํด๋ฆฝ (366 speakers / 366ํ์, 5,549 clips / 5,549ํด๋ฆฝ)
- Synthesis / ํฉ์ฑ: multi-speaker 8 kHz telephone calls / ๋คํ์ 8kHz ์ ํํตํ โ overlap / gap / turn control + 8 kHz band-limiting + noise (overlap/gap/turn ์ ์ด + 8kHz ๋์ญ์ ํ + ๋ ธ์ด์ฆ)
- Scale / ๊ท๋ชจ:
300 sessions / 24.1 h / RTTM 13,908 turns (300์ธ์ / 24.1์๊ฐ / RTTM 13,908 turn)
Fine-tuning / ํ์ธํ๋
- Fine-tuned
segmentation-3.0for 12 epochs /segmentation-3.0ํ์ธํ๋ 12 epoch - Synthetic dev DER: 0.049 โ 0.029 / ํฉ์ฑ dev DER 0.049 โ 0.029
Environment / ํ๊ฒฝ
- Training:
pyannote.audio3.1.1 +torch2.5.1+cu124 - Compatibility / ํธํ: load verified with
Model.from_pretrainedonpyannote4.0.4 (seg-3.0 architecture compatible / seg-3.0 ์ํคํ ์ฒ ํธํ)
Limitations & notes / ํ๊ณ ๋ฐ ์ฃผ์
- Reference label noise / ๋ผ๋ฒจ ๋ ธ์ด์ฆ: Clova RTTM uses 1-second granularity, so absolute DER values are coarse. However, since the same RTTM and collar are used for both base and fine-tuned, the โ12.8%p relative improvement is reliable. Clova 1์ด ๋ผ๋ฒจ๋ ธ์ด์ฆ ๋๋ฌธ์ DER ์ ๋๊ฐ์ ๊ฑฐ์นฉ๋๋ค. ๋จ ๋์ผ RTTM/collar ๊ธฐ์ค์ด๋ผ ์๋๊ฐ์ 12.8%p ๋ ์ ๋ขฐํ ์ ์์ต๋๋ค.
- Synthetic gap / ํฉ์ฑ ๊ฐญ: training data is meeting speech converted to 8 kHz, which differs from real telephone timbre and background noise. ํ์ต ๋ฐ์ดํฐ๋ ํ์ ํ์๋ฅผ 8kHz ๋ก ๋ณํํ ๊ฒ์ด๋ผ ์ค์ ์ ํ์ ์์ยท๋ฐฐ๊ฒฝ์๊ณผ ์ฐจ์ด๊ฐ ์์ต๋๋ค.
- Evaluation note / ํ๊ฐ ๊ฒฐ๋ก : an increased-overlap 2nd iteration and off-the-shelf embedding swaps (ECAPA, CAM++ multilingual/English) were all A/B-evaluated against this v1 โ none improved over it on real Korean 8 kHz calls. The released configuration (fine-tuned
segmentation-3.0+wespeaker-resnet34-LMembedding) is the validated optimum for this domain at 13.2%. overlapโ 2์ฐจ ๋ฐ ๊ธฐ์ฑ ์๋ฒ ๋ฉ ๊ต์ฒด(ECAPA, CAM++ ๋ค๊ตญ์ด/์์ด)๋ฅผ ๋ชจ๋ A/B ํ๊ฐํ์ผ๋ ์คํตํ์์ v1์ ๋์ง ๋ชปํจ โ ํ ๊ตฌ์ฑ์ด 8kHz ํ๊ตญ์ด ์ ํ ๋๋ฉ์ธ ์ต์ (13.2%). ์ถ๊ฐ ๊ฐ์ ์ ํ๊ตญ์ด 8kHz ์๋ฒ ๋ฉ ํ์ต(๋ณ๋ ํ๋ก์ ํธ) ์์ญ์ ๋๋ค.
License / ๋ผ์ด์ ์ค
๐ฅ ๊ณต๊ฐ ๋ฐฐํฌ โ ๊ฒ์ดํธ/๋ก๊ทธ์ธ ์์ด ์์ ๋ค์ด๋ก๋ (BatiFlow ์ฑ ๋ฌดํ ํฐ ๋ฐฐํฌ). ๋จ, ์๋ BatiAI Community License v2.0 ๊ฐ ๊ทธ๋๋ก ์ ์ฉ๋ฉ๋๋ค. ๐ผ ์์ ์ ํ์ฉ(์ธ๋ถ SaaSยท์ฌํ๋งคยท๋งค์ถ 10์ต+) =
support@bati.aiํ์ ํ์.
- Base model / ๋ฒ ์ด์ค:
pyannote/segmentation-3.0โ MIT (commercial use and derivative redistribution permitted; attribution retained / ์์ ยทํ์ ์ฌ๋ฐฐํฌ ํ์ฉ, ์ถ์ฒํ๊ธฐ ์ ์ง). - This model / ๋ณธ ๋ชจ๋ธ: distributed under BatiAI Community License v2.0 (Tier 2 โ public, ungated, commercial ํ์).
- ๋งค์ถ 10์ต ๋ฏธ๋ง / 24๊ฐ์ ๋ฏธ๋ง / ๋น์์ = ์์ . 10์ต+ ์ธ๋ถ SaaS = support@bati.ai ํ์.
- License: https://github.com/batiai/batiai-models/blob/main/LICENSE-BATIAI-COMMUNITY.md
This model is a derivative of pyannote/segmentation-3.0 (MIT); the original author's attribution is retained.
๋ณธ ๋ชจ๋ธ์ pyannote/segmentation-3.0(MIT)์ ํ์๋ฌผ์ด๋ฉฐ, ์์ ์์ ํ๊ธฐ๋ฅผ ์ ์งํฉ๋๋ค.
Pairing / ํ์ด๋ง
Combine with batisay-ko-turbo (STT) for per-speaker transcription โ diarization assigns who, STT provides what.
batisay-ko-turbo(STT)์ ์กฐํฉํ๋ฉด ํ์๋ณ ์ ์ฌ๊ฐ ๊ฐ๋ฅํฉ๋๋ค.
Bundled in the HF Collection "ํ๊ตญ์ด ์์ฑ ์ค์ํธ". HF Collection **"ํ๊ตญ์ด ์์ฑ ์ค์ํธ"**๋ก ๋ฌถ์ฌ ์์ต๋๋ค.
- Downloads last month
- 22