batispeak-diarize

Korean telephone (8 kHz) speaker diarization โ€” a fine-tuned segmentation model for who-spoke-when on Korean phone calls. ํ•œ๊ตญ์–ด ์ „ํ™”(8kHz) ํ™”์ž๋ถ„๋ฆฌ โ€” ๋ˆ„๊ฐ€ ์–ธ์ œ ๋งํ–ˆ๋Š”์ง€(who-spoke-when)๋ฅผ ์œ„ํ•œ segmentation ํŒŒ์ธํŠœ๋‹ ๋ชจ๋ธ.

batispeak-diarize fine-tunes only the segmentation stage of the pyannote diarization pipeline. The embedding (pyannote/wespeaker-voxceleb-resnet34-LM) and clustering (AgglomerativeClustering) stages are used as-is from open source.

batispeak-diarize ๋Š” pyannote ํ™”์ž๋ถ„๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ segmentation ๋‹จ๊ณ„๋งŒ ํŒŒ์ธํŠœ๋‹ํ•ฉ๋‹ˆ๋‹ค. embedding(pyannote/wespeaker-voxceleb-resnet34-LM)๊ณผ clustering(AgglomerativeClustering)์€ ์˜คํ”ˆ์†Œ์Šค ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • Base / ๋ฒ ์ด์Šค: pyannote/segmentation-3.0 (MIT, 1.47M params)
  • Target / ํƒ€๊ฒŸ: Korean telephone audio, 8 kHz / ํ•œ๊ตญ์–ด ์ „ํ™” ์Œ์„ฑ, 8kHz
  • Checkpoint: seg-ft-final.ckpt (17.7 MB)

โญ Real-call DER results / ์‹คํ†ตํ™” DER ๊ฒฐ๊ณผ

Measured on Mac. Only the segmentation model was swapped to the fine-tuned checkpoint (rest of the pipeline identical), with num_speakers=2, collar 0.5, against the same Clova RTTM reference.

Mac ์ธก์ •. segmentation ๋ชจ๋ธ๋งŒ ํŒŒ์ธํŠœ๋‹ ckpt ๋กœ ๊ต์ฒด(๋‚˜๋จธ์ง€ ํŒŒ์ดํ”„๋ผ์ธ ๋™์ผ), num_speakers=2, collar 0.5, ๋™์ผ Clova RTTM ๊ธฐ์ค€.

Call / ํ†ตํ™” Length / ๊ธธ์ด Base (seg-3.0) DER batispeak-diarize DER
call_03 3.5 min 24.9% 14.2%
call_05 7.8 min 17.5% 5.6%
call_06 2.3 min 35.4% 19.7%
Average / ํ‰๊ท  25.9% 13.2%

โˆ’12.8%p absolute improvement โ€” reaching the level of English meeting SOTA (15%). ์˜์–ด ํšŒ์˜ SOTA(15%) ์ˆ˜์ค€์— ๋„๋‹ฌ.


How it works & our contribution / ๋™์ž‘ ์›๋ฆฌ ๋ฐ ์šฐ๋ฆฌ ๊ธฐ์—ฌ

Why synthetic data / ์™œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์ธ๊ฐ€

Labeled Korean speaker-diarization data is effectively nonexistent. We break through with synthetic simulation โ€” the standard approach for telephone diarization.

ํ•œ๊ตญ์–ด ํ™”์ž๋ถ„๋ฆฌ ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ๋Š” ์‚ฌ์‹ค์ƒ ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•ฉ์„ฑ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์œผ๋กœ ์ด๋ฅผ ๋ŒํŒŒํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ์ „ํ™” ํ™”์ž๋ถ„๋ฆฌ์˜ ์ •์„์  ์ ‘๊ทผ์ž…๋‹ˆ๋‹ค.

Pipeline / ํŒŒ์ดํ”„๋ผ์ธ

  1. Segmentation โ€” fine-tuned (batispeak-diarize) โ† our contribution / ์šฐ๋ฆฌ ๊ธฐ์—ฌ
  2. Embedding โ€” pyannote/wespeaker-voxceleb-resnet34-LM (open source, as-is / ์˜คํ”ˆ์†Œ์Šค ๊ทธ๋Œ€๋กœ)
  3. Clustering โ€” AgglomerativeClustering (open source, as-is / ์˜คํ”ˆ์†Œ์Šค ๊ทธ๋Œ€๋กœ)

We fine-tuned segmentation only; the embedding and clustering stages are unchanged open-source components. segmentation ๋งŒ ํ•™์Šต ๊ต์ฒดํ–ˆ๊ณ , embedding ๊ณผ clustering ์€ ์˜คํ”ˆ์†Œ์Šค ๊ทธ๋Œ€๋กœ์ž…๋‹ˆ๋‹ค.


Usage / ์‚ฌ์šฉ๋ฒ•

from pyannote.audio import Model
from pyannote.audio.pipelines import SpeakerDiarization

ft = Model.from_pretrained("batiai/batispeak-diarize")
pipeline = SpeakerDiarization(
    segmentation=ft,
    embedding="pyannote/wespeaker-voxceleb-resnet34-LM",
    clustering="AgglomerativeClustering",
)
pipeline.instantiate({
    "segmentation": {"min_duration_off": 0.25},   # gap-bridging free win (Mac-verified: avg DER 13.16โ†’13.10, no per-call regression)
    "clustering": {"method": "centroid", "min_cluster_size": 12, "threshold": 0.7045},
})
diarization = pipeline("call.wav")

min_duration_off: 0.25 bridges short non-speech gaps, recovering a little Miss on call-domain audio โ€” Mac-verified real-call free win (avg DER 13.16% โ†’ 13.10%, every call non-regressing; 0.0 also fine). threshold tuning has no effect under num_speakers=2 (2-cluster re-search). min_duration_off: 0.25 โ€” ์งง์€ ๋น„์Œ์„ฑ ๊ฐญ์„ ๋ฉ”์›Œ ํ†ตํ™” Miss ์ผ๋ถ€ ํšŒ์ˆ˜(Mac ์‹คํ†ตํ™” ๊ฒ€์ฆ free win, ํ‰๊ท  DER 13.16โ†’13.10, ์ „ call ๋น„ํ‡ดํ–‰). threshold ๋Š” num_speakers=2 ์—์„  ๋ฌดํšจ.

ONNX (native / Swift / on-device) / ONNX (๋„ค์ดํ‹ฐ๋ธŒยทSwiftยท์˜จ๋””๋ฐ”์ด์Šค)

For framework-free or on-device deployment (e.g. Swift + onnxruntime), the pipeline's two neural stages are provided as ONNX under onnx/:

ํ”„๋ ˆ์ž„์›Œํฌ ์—†์ด ๋˜๋Š” ์˜จ๋””๋ฐ”์ด์Šค(์˜ˆ: Swift + onnxruntime) ๋ฐฐํฌ์šฉ์œผ๋กœ, ํŒŒ์ดํ”„๋ผ์ธ์˜ ๋‘ ์‹ ๊ฒฝ๋ง ๋‹จ๊ณ„๋ฅผ onnx/์— ONNX๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

File Input โ†’ Output Notes
onnx/segmentation.onnx waveform (1, 1, 80000) โ†’ powerset (1, 293, 7) fine-tuned segmentation (= seg-ft-final.ckpt), batch fixed
onnx/segmentation-dynamic.onnx waveform (B, 1, 80000) โ†’ powerset (B, 293, 7) batch-dynamic (window batching), pytorch parity 1e-4 โ€” recommended for speed
onnx/resnet34lm-feats.onnx feats (B, frames, 80) + weights (B, frames) โ†’ embedding (B, 256) base wespeaker-resnet34-LM (VoxCeleb 16kHz), cosine; bit-exact
onnx/resnet34lm-feats-8k-ko.onnx feats (B, frames, 80) + weights (B, frames) โ†’ embedding (B, 256) 8kHz Korean telephone fine-tuned โ€” recommended for calls. Same I/O, drop-in swap.

โญ 8kHz Korean telephone embedding (resnet34lm-feats-8k-ko.onnx) โ€” the wespeaker-resnet34-LM embedding fine-tuned for 8 kHz Korean telephone (AAM-Softmax + augmentation incl. resample/codec). Held-out 8kHz speaker EER 16.98% โ†’ 7.68% (โˆ’9.3 pp) โ€” the robust, model-team-measured gain. Real-call DER: under identical (ffmpeg) preprocessing it matches the baseline embedding (both 13.2%); under the app's on-device resampling (AVFoundation), the fine-tuned embedding is more robust โ€” app-measured baseline 14.4% โ†’ FT 13.2%. That ~1pp is resample robustness (the baseline loses it to the resampler), not a clean-pipeline DER gain. Recommended for on-device deployment. Same I/O (weights = 293-frame seg mask), drop-in for resnet34lm-feats.onnx. 8kHz ํ•œ๊ตญ์–ด ์ „ํ™” ์ž„๋ฒ ๋”ฉ โ€” wespeaker resnet34-LM ์„ 8kHz ํ•œ๊ตญ์–ด ํ†ตํ™”๋กœ ํŒŒ์ธํŠœ๋‹(์ฆ๊ฐ•์— ๋ฆฌ์ƒ˜ํ”Œ/codec ํฌํ•จ). held-out EER 16.98โ†’7.68%(โˆ’9.3pp) ๊ฐ€ robust ํ•œ ํ•ต์‹ฌ ์ด๋“. ์‹คํ†ตํ™” DER: ๋™์ผ ์ „์ฒ˜๋ฆฌ(ffmpeg)์—์„  baseline ๊ณผ ๊ฐ™์Œ(๋‘˜ ๋‹ค 13.2%), ๋‹จ ์•ฑ ์˜จ๋””๋ฐ”์ด์Šค ๋ฆฌ์ƒ˜ํ”Œ(AVFoundation)์—์„  FT ๊ฐ€ ๋” robust โ€” ์•ฑ ์ธก์ • baseline 14.4โ†’FT 13.2%. ์ด ~1pp ๋Š” ๋ฆฌ์ƒ˜ํ”Œ ๊ฐ•๊ฑด์„ฑ(baseline ์ด ๋ฆฌ์ƒ˜ํ”Œ๋Ÿฌ์— ์†ํ•ด)์ด์ง€ clean ํŒŒ์ดํ”„๋ผ์ธ DER ์ด๋“ ์•„๋‹˜. ์˜จ๋””๋ฐ”์ด์Šค ๋ฐฐํฌ ๊ถŒ์žฅ. ๋™์ผ I/O(weights=293 seg ๋งˆ์Šคํฌ) drop-in.

The embedding ONNX takes kaldi fbank feats (not raw waveform), since the kaldi fbank front-end is not ONNX-exportable. Compute feats as: waveform ร— 32768 โ†’ kaldi.fbank(num_mel=80, frame_length=25ms, frame_shift=10ms, window=hamming, dither=0, use_energy=False) โ†’ CMN (subtract per-utterance frame mean).

embedding ONNX๋Š” raw waveform์ด ์•„๋‹ˆ๋ผ **kaldi fbank feats**๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Šต๋‹ˆ๋‹ค(kaldi fbank ์ „์ฒ˜๋ฆฌ๋Š” ONNX ๋ณ€ํ™˜ ๋ถˆ๊ฐ€). feats ๊ณ„์‚ฐ: waveform ร— 32768 โ†’ kaldi.fbank(80 mel / 25ms / 10ms / hamming / dither 0 / use_energy=False) โ†’ CMN(๋ฐœํ™” ํ”„๋ ˆ์ž„ ํ‰๊ท  ์ฐจ๊ฐ).

The clustering / binarization / stitching stages (binarize โ†’ embedding โ†’ AgglomerativeClustering @ threshold 0.7045 โ†’ reconstruct) are deterministic pipeline logic to be re-implemented natively. The ONNX files cover only the neural stages. clustering / binarization / stitching ๋‹จ๊ณ„(binarize โ†’ embedding โ†’ AgglomerativeClustering 0.7045 โ†’ reconstruct)๋Š” ๊ฒฐ์ •์  ํŒŒ์ดํ”„๋ผ์ธ ๋กœ์ง์œผ๋กœ ๋„ค์ดํ‹ฐ๋ธŒ ์žฌ๊ตฌํ˜„ ๋Œ€์ƒ์ž…๋‹ˆ๋‹ค. ONNX๋Š” ์‹ ๊ฒฝ๋ง ๋‹จ๊ณ„๋งŒ ์ปค๋ฒ„ํ•ฉ๋‹ˆ๋‹ค.


Training details / ํ•™์Šต ์ƒ์„ธ

Data / ๋ฐ์ดํ„ฐ (synthetic / ํ•ฉ์„ฑ)

  • Source / ์†Œ์Šค: KconfSpeech (Korean meeting / ํ•œ๊ตญ์–ด ํšŒ์˜) โ†’ per-speaker clips / ํ™”์ž๋ณ„ ํด๋ฆฝ (366 speakers / 366ํ™”์ž, 5,549 clips / 5,549ํด๋ฆฝ)
  • Synthesis / ํ•ฉ์„ฑ: multi-speaker 8 kHz telephone calls / ๋‹คํ™”์ž 8kHz ์ „ํ™”ํ†ตํ™” โ€” overlap / gap / turn control + 8 kHz band-limiting + noise (overlap/gap/turn ์ œ์–ด + 8kHz ๋Œ€์—ญ์ œํ•œ + ๋…ธ์ด์ฆˆ)
  • Scale / ๊ทœ๋ชจ: 300 sessions / 24.1 h / RTTM 13,908 turns (300์„ธ์…˜ / 24.1์‹œ๊ฐ„ / RTTM 13,908 turn)

Fine-tuning / ํŒŒ์ธํŠœ๋‹

  • Fine-tuned segmentation-3.0 for 12 epochs / segmentation-3.0 ํŒŒ์ธํŠœ๋‹ 12 epoch
  • Synthetic dev DER: 0.049 โ†’ 0.029 / ํ•ฉ์„ฑ dev DER 0.049 โ†’ 0.029

Environment / ํ™˜๊ฒฝ

  • Training: pyannote.audio 3.1.1 + torch 2.5.1+cu124
  • Compatibility / ํ˜ธํ™˜: load verified with Model.from_pretrained on pyannote 4.0.4 (seg-3.0 architecture compatible / seg-3.0 ์•„ํ‚คํ…์ฒ˜ ํ˜ธํ™˜)

Limitations & notes / ํ•œ๊ณ„ ๋ฐ ์ฃผ์˜

  • Reference label noise / ๋ผ๋ฒจ ๋…ธ์ด์ฆˆ: Clova RTTM uses 1-second granularity, so absolute DER values are coarse. However, since the same RTTM and collar are used for both base and fine-tuned, the โˆ’12.8%p relative improvement is reliable. Clova 1์ดˆ ๋ผ๋ฒจ๋…ธ์ด์ฆˆ ๋•Œ๋ฌธ์— DER ์ ˆ๋Œ€๊ฐ’์€ ๊ฑฐ์นฉ๋‹ˆ๋‹ค. ๋‹จ ๋™์ผ RTTM/collar ๊ธฐ์ค€์ด๋ผ ์ƒ๋Œ€๊ฐœ์„  12.8%p ๋Š” ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Synthetic gap / ํ•ฉ์„ฑ ๊ฐญ: training data is meeting speech converted to 8 kHz, which differs from real telephone timbre and background noise. ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ํšŒ์˜ ํ™”์ž๋ฅผ 8kHz ๋กœ ๋ณ€ํ™˜ํ•œ ๊ฒƒ์ด๋ผ ์‹ค์ œ ์ „ํ™”์˜ ์Œ์ƒ‰ยท๋ฐฐ๊ฒฝ์Œ๊ณผ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Evaluation note / ํ‰๊ฐ€ ๊ฒฐ๋ก : an increased-overlap 2nd iteration and off-the-shelf embedding swaps (ECAPA, CAM++ multilingual/English) were all A/B-evaluated against this v1 โ€” none improved over it on real Korean 8 kHz calls. The released configuration (fine-tuned segmentation-3.0 + wespeaker-resnet34-LM embedding) is the validated optimum for this domain at 13.2%. overlapโ†‘ 2์ฐจ ๋ฐ ๊ธฐ์„ฑ ์ž„๋ฒ ๋”ฉ ๊ต์ฒด(ECAPA, CAM++ ๋‹ค๊ตญ์–ด/์˜์–ด)๋ฅผ ๋ชจ๋‘ A/B ํ‰๊ฐ€ํ–ˆ์œผ๋‚˜ ์‹คํ†ตํ™”์—์„œ v1์„ ๋„˜์ง€ ๋ชปํ•จ โ€” ํ˜„ ๊ตฌ์„ฑ์ด 8kHz ํ•œ๊ตญ์–ด ์ „ํ™” ๋„๋ฉ”์ธ ์ตœ์ (13.2%). ์ถ”๊ฐ€ ๊ฐœ์„ ์€ ํ•œ๊ตญ์–ด 8kHz ์ž„๋ฒ ๋”ฉ ํ•™์Šต(๋ณ„๋„ ํ”„๋กœ์ ํŠธ) ์˜์—ญ์ž…๋‹ˆ๋‹ค.

License / ๋ผ์ด์„ ์Šค

๐Ÿ“ฅ ๊ณต๊ฐœ ๋ฐฐํฌ โ€” ๊ฒŒ์ดํŠธ/๋กœ๊ทธ์ธ ์—†์ด ์ž์œ  ๋‹ค์šด๋กœ๋“œ (BatiFlow ์•ฑ ๋ฌดํ† ํฐ ๋ฐฐํฌ). ๋‹จ, ์•„๋ž˜ BatiAI Community License v2.0 ๊ฐ€ ๊ทธ๋Œ€๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ๐Ÿ’ผ ์ƒ์—…์  ํ™œ์šฉ(์™ธ๋ถ€ SaaSยท์žฌํŒ๋งคยท๋งค์ถœ 10์–ต+) = support@bati.ai ํ˜‘์˜ ํ•„์ˆ˜.

  • Base model / ๋ฒ ์ด์Šค: pyannote/segmentation-3.0 โ€” MIT (commercial use and derivative redistribution permitted; attribution retained / ์ƒ์—…ยทํŒŒ์ƒ ์žฌ๋ฐฐํฌ ํ—ˆ์šฉ, ์ถœ์ฒ˜ํ‘œ๊ธฐ ์œ ์ง€).
  • This model / ๋ณธ ๋ชจ๋ธ: distributed under BatiAI Community License v2.0 (Tier 2 โ€” public, ungated, commercial ํ˜‘์˜).

This model is a derivative of pyannote/segmentation-3.0 (MIT); the original author's attribution is retained. ๋ณธ ๋ชจ๋ธ์€ pyannote/segmentation-3.0(MIT)์˜ ํŒŒ์ƒ๋ฌผ์ด๋ฉฐ, ์›์ €์ž‘์ž ํ‘œ๊ธฐ๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.


Pairing / ํŽ˜์–ด๋ง

Combine with batisay-ko-turbo (STT) for per-speaker transcription โ€” diarization assigns who, STT provides what. batisay-ko-turbo(STT)์™€ ์กฐํ•ฉํ•˜๋ฉด ํ™”์ž๋ณ„ ์ „์‚ฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Bundled in the HF Collection "ํ•œ๊ตญ์–ด ์Œ์„ฑ ์Šค์œ„ํŠธ". HF Collection **"ํ•œ๊ตญ์–ด ์Œ์„ฑ ์Šค์œ„ํŠธ"**๋กœ ๋ฌถ์—ฌ ์žˆ์Šต๋‹ˆ๋‹ค.

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including batiai/batispeak-diarize