Upload folder using huggingface_hub
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- .gitattributes +30 -0
- .gitignore +11 -0
- 03-finetune-pipecat-pt/01_download_pipecat.py +176 -0
- 03-finetune-pipecat-pt/02_generate_labels.py +474 -0
- 03-finetune-pipecat-pt/03_generate_audio.py +697 -0
- 03-finetune-pipecat-pt/04_finetune.py +1202 -0
- 03-finetune-pipecat-pt/05_evaluate.py +511 -0
- 03-finetune-pipecat-pt/06_inference.py +560 -0
- 03-finetune-pipecat-pt/README.md +489 -0
- 03-finetune-pipecat-pt/modal_run.py +267 -0
- 03-finetune-pipecat-pt/references/blogs/assemblyai_turn_detection.md +137 -0
- 03-finetune-pipecat-pt/references/blogs/deepgram_eot_evaluation.md +58 -0
- 03-finetune-pipecat-pt/references/blogs/krisp_turn_taking_v1.md +78 -0
- 03-finetune-pipecat-pt/references/blogs/krisp_turn_taking_v2.md +54 -0
- 03-finetune-pipecat-pt/references/blogs/livekit_eot_v0_4.md +104 -0
- 03-finetune-pipecat-pt/references/blogs/livekit_transformer_turn_detection.md +87 -0
- 03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3.md +75 -0
- 03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3_1.md +69 -0
- 03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3_2.md +41 -0
- 03-finetune-pipecat-pt/references/blogs/speechmatics_semantic_turn.md +103 -0
- 03-finetune-pipecat-pt/references/guides/focal_loss_calibration_emnlp_2022.md +209 -0
- 03-finetune-pipecat-pt/references/guides/livekit_turn_detector.md +170 -0
- 03-finetune-pipecat-pt/references/guides/pipecat_data_generation_guide.md +33 -0
- 03-finetune-pipecat-pt/references/guides/pipecat_dataset_v3_2.md +93 -0
- 03-finetune-pipecat-pt/references/guides/pipecat_smart_turn_readme.md +115 -0
- 03-finetune-pipecat-pt/references/guides/pipecat_train_py.md +179 -0
- 03-finetune-pipecat-pt/references/guides/vogent_turn_80m.md +171 -0
- 03-finetune-pipecat-pt/references/hesitation_turn_taking_l2_review.md +453 -0
- 03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.md +158 -0
- 03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.md +103 -0
- 03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.md +74 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.md +68 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.md +77 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.md +76 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/05-kosmala-2023-filled-pauses-pragmatic-markers-gaze-gesture.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.md +76 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/07-ekstedt-2023-turn-taking-cues-speech-synthesis.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/08-inoue-2024-realtime-turn-taking-vap.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/09-inoue-2024-multilingual-turn-taking-vap.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/10-hutin-2024-filled-pauses-conversational-convergence.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.md +67 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.pdf +3 -0
- 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/12-lingref-code-switching-bilinguals.pdf +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,33 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.pdf filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.pdf filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.pdf filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.pdf filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.pdf filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.pdf filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/05-kosmala-2023-filled-pauses-pragmatic-markers-gaze-gesture.pdf filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.pdf filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/07-ekstedt-2023-turn-taking-cues-speech-synthesis.pdf filter=lfs diff=lfs merge=lfs -text
|
| 45 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/08-inoue-2024-realtime-turn-taking-vap.pdf filter=lfs diff=lfs merge=lfs -text
|
| 46 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/09-inoue-2024-multilingual-turn-taking-vap.pdf filter=lfs diff=lfs merge=lfs -text
|
| 47 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/10-hutin-2024-filled-pauses-conversational-convergence.pdf filter=lfs diff=lfs merge=lfs -text
|
| 48 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.pdf filter=lfs diff=lfs merge=lfs -text
|
| 49 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/12-lingref-code-switching-bilinguals.pdf filter=lfs diff=lfs merge=lfs -text
|
| 50 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/13-christodoulides-avanzi-2014-DisMo-disfluency-french.pdf filter=lfs diff=lfs merge=lfs -text
|
| 51 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/14-jouvet-2019-speech-processing-prosody.pdf filter=lfs diff=lfs merge=lfs -text
|
| 52 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/15-filler-particles-cross-linguistic-heritage-2024.pdf filter=lfs diff=lfs merge=lfs -text
|
| 53 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/16-hal-peters-2017-L2-fluency-french.pdf filter=lfs diff=lfs merge=lfs -text
|
| 54 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/17-christodoulides-2015-automatic-disfluency-detection-french.pdf filter=lfs diff=lfs merge=lfs -text
|
| 55 |
+
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/19-hal-multimodal-self-other-repair-L2.pdf filter=lfs diff=lfs merge=lfs -text
|
| 56 |
+
03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/conversar_mixed_reality_l2.pdf filter=lfs diff=lfs merge=lfs -text
|
| 57 |
+
03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/hesitation_tagging_l2_whisper_lora.pdf filter=lfs diff=lfs merge=lfs -text
|
| 58 |
+
03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/iwsds2025_survey_turn_taking.pdf filter=lfs diff=lfs merge=lfs -text
|
| 59 |
+
03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/multilingual_vap_2024.pdf filter=lfs diff=lfs merge=lfs -text
|
| 60 |
+
03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/speak_improve_corpus_2025.pdf filter=lfs diff=lfs merge=lfs -text
|
| 61 |
+
03-finetune-pipecat-pt/references/papers/soda_dialog_distillation_2023.pdf filter=lfs diff=lfs merge=lfs -text
|
| 62 |
+
03-finetune-pipecat-pt/references/papers/speculative_etd_2025.pdf filter=lfs diff=lfs merge=lfs -text
|
| 63 |
+
03-finetune-pipecat-pt/references/papers/turn_taking_review_skantze_2021.pdf filter=lfs diff=lfs merge=lfs -text
|
| 64 |
+
03-finetune-pipecat-pt/references/papers/vap_turn_taking_ekstedt_2024.pdf filter=lfs diff=lfs merge=lfs -text
|
| 65 |
+
previous-experiments/01-benchmarks/report/figures/radar_chart.png filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
data/
|
| 2 |
+
checkpoints/
|
| 3 |
+
__pycache__/
|
| 4 |
+
*.pyc
|
| 5 |
+
.deploy_state.json
|
| 6 |
+
benchmark.log
|
| 7 |
+
*.egg-info/
|
| 8 |
+
hf_cache/
|
| 9 |
+
results/
|
| 10 |
+
*.pt
|
| 11 |
+
*.onnx
|
03-finetune-pipecat-pt/01_download_pipecat.py
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Download Pipecat Smart Turn v3 dataset + model architecture.
|
| 2 |
+
|
| 3 |
+
The Pipecat model is only published as ONNX (no PyTorch weights).
|
| 4 |
+
So we initialize from openai/whisper-tiny and train using their dataset
|
| 5 |
+
which already includes Portuguese samples.
|
| 6 |
+
|
| 7 |
+
This script:
|
| 8 |
+
1. Downloads Pipecat's training dataset (270K samples, 23 langs)
|
| 9 |
+
2. Filters Portuguese samples
|
| 10 |
+
3. Downloads the ONNX model for reference/benchmarking
|
| 11 |
+
4. Saves Portuguese subset locally
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import json
|
| 17 |
+
import logging
|
| 18 |
+
from pathlib import Path
|
| 19 |
+
|
| 20 |
+
log = logging.getLogger(__name__)
|
| 21 |
+
|
| 22 |
+
DATA_DIR = Path(__file__).parent / "data"
|
| 23 |
+
CACHE_DIR = Path("/workspace/hf_cache") if Path("/workspace").exists() else DATA_DIR / "hf_cache"
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def download_pipecat_dataset(max_pt_samples: int = 5000) -> Path:
|
| 27 |
+
"""Download Pipecat v3.2 training data and extract Portuguese samples."""
|
| 28 |
+
from datasets import load_dataset
|
| 29 |
+
|
| 30 |
+
DATA_DIR.mkdir(parents=True, exist_ok=True)
|
| 31 |
+
|
| 32 |
+
# Download full dataset (streaming to avoid 37GB download)
|
| 33 |
+
log.info("Loading Pipecat Smart Turn v3.2 training data (streaming)...")
|
| 34 |
+
ds = load_dataset(
|
| 35 |
+
"pipecat-ai/smart-turn-data-v3.2-train",
|
| 36 |
+
split="train",
|
| 37 |
+
streaming=True,
|
| 38 |
+
cache_dir=str(CACHE_DIR),
|
| 39 |
+
)
|
| 40 |
+
|
| 41 |
+
# Filter Portuguese samples
|
| 42 |
+
pt_samples = []
|
| 43 |
+
other_count = 0
|
| 44 |
+
for row in ds:
|
| 45 |
+
lang = row.get("language", "")
|
| 46 |
+
if lang == "por":
|
| 47 |
+
pt_samples.append({
|
| 48 |
+
"id": row["id"],
|
| 49 |
+
"language": lang,
|
| 50 |
+
"endpoint_bool": row["endpoint_bool"],
|
| 51 |
+
"midfiller": row.get("midfiller", False),
|
| 52 |
+
"endfiller": row.get("endfiller", False),
|
| 53 |
+
"synthetic": row.get("synthetic", True),
|
| 54 |
+
"dataset": row.get("dataset", ""),
|
| 55 |
+
"spoken_text": row.get("spoken_text", ""),
|
| 56 |
+
"audio": row["audio"],
|
| 57 |
+
})
|
| 58 |
+
if len(pt_samples) % 100 == 0:
|
| 59 |
+
log.info(" Found %d Portuguese samples (scanned %d others)",
|
| 60 |
+
len(pt_samples), other_count)
|
| 61 |
+
if len(pt_samples) >= max_pt_samples:
|
| 62 |
+
break
|
| 63 |
+
else:
|
| 64 |
+
other_count += 1
|
| 65 |
+
|
| 66 |
+
log.info("Total: %d Portuguese samples found (scanned %d other languages)",
|
| 67 |
+
len(pt_samples), other_count)
|
| 68 |
+
|
| 69 |
+
# Save metadata (without audio)
|
| 70 |
+
meta_path = DATA_DIR / "pipecat_pt_metadata.json"
|
| 71 |
+
meta = [{k: v for k, v in s.items() if k != "audio"} for s in pt_samples]
|
| 72 |
+
with open(meta_path, "w") as f:
|
| 73 |
+
json.dump(meta, f, indent=2, ensure_ascii=False)
|
| 74 |
+
log.info("Metadata saved to %s", meta_path)
|
| 75 |
+
|
| 76 |
+
# Save audio files
|
| 77 |
+
audio_dir = DATA_DIR / "pipecat_pt_audio"
|
| 78 |
+
audio_dir.mkdir(exist_ok=True)
|
| 79 |
+
|
| 80 |
+
import soundfile as sf
|
| 81 |
+
import numpy as np
|
| 82 |
+
|
| 83 |
+
complete = 0
|
| 84 |
+
incomplete = 0
|
| 85 |
+
for s in pt_samples:
|
| 86 |
+
audio_data = s["audio"]
|
| 87 |
+
audio = np.array(audio_data["array"], dtype=np.float32)
|
| 88 |
+
sr = audio_data["sampling_rate"]
|
| 89 |
+
|
| 90 |
+
label = "complete" if s["endpoint_bool"] else "incomplete"
|
| 91 |
+
if s["endpoint_bool"]:
|
| 92 |
+
complete += 1
|
| 93 |
+
else:
|
| 94 |
+
incomplete += 1
|
| 95 |
+
|
| 96 |
+
out_path = audio_dir / f"{s['id']}_{label}.wav"
|
| 97 |
+
sf.write(str(out_path), audio, sr)
|
| 98 |
+
|
| 99 |
+
log.info("Audio saved: %d complete, %d incomplete → %s",
|
| 100 |
+
complete, incomplete, audio_dir)
|
| 101 |
+
|
| 102 |
+
return audio_dir
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
def download_pipecat_test_data(max_pt_samples: int = 2000) -> Path:
|
| 106 |
+
"""Download Pipecat test data for evaluation."""
|
| 107 |
+
from datasets import load_dataset
|
| 108 |
+
|
| 109 |
+
log.info("Loading Pipecat Smart Turn v3.2 test data (streaming)...")
|
| 110 |
+
ds = load_dataset(
|
| 111 |
+
"pipecat-ai/smart-turn-data-v3.2-test",
|
| 112 |
+
split="train",
|
| 113 |
+
streaming=True,
|
| 114 |
+
cache_dir=str(CACHE_DIR),
|
| 115 |
+
)
|
| 116 |
+
|
| 117 |
+
pt_samples = []
|
| 118 |
+
for row in ds:
|
| 119 |
+
if row.get("language", "") == "por":
|
| 120 |
+
pt_samples.append(row)
|
| 121 |
+
if len(pt_samples) >= max_pt_samples:
|
| 122 |
+
break
|
| 123 |
+
|
| 124 |
+
log.info("Found %d Portuguese test samples", len(pt_samples))
|
| 125 |
+
|
| 126 |
+
# Save
|
| 127 |
+
test_dir = DATA_DIR / "pipecat_pt_test"
|
| 128 |
+
test_dir.mkdir(exist_ok=True)
|
| 129 |
+
|
| 130 |
+
import soundfile as sf
|
| 131 |
+
import numpy as np
|
| 132 |
+
|
| 133 |
+
for s in pt_samples:
|
| 134 |
+
audio = np.array(s["audio"]["array"], dtype=np.float32)
|
| 135 |
+
sr = s["audio"]["sampling_rate"]
|
| 136 |
+
label = "complete" if s["endpoint_bool"] else "incomplete"
|
| 137 |
+
sf.write(str(test_dir / f"{s['id']}_{label}.wav"), audio, sr)
|
| 138 |
+
|
| 139 |
+
log.info("Test audio saved → %s", test_dir)
|
| 140 |
+
return test_dir
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
def download_onnx_model() -> Path:
|
| 144 |
+
"""Download Pipecat ONNX model for benchmarking."""
|
| 145 |
+
from huggingface_hub import hf_hub_download
|
| 146 |
+
|
| 147 |
+
model_dir = DATA_DIR / "pipecat_model"
|
| 148 |
+
model_dir.mkdir(exist_ok=True)
|
| 149 |
+
|
| 150 |
+
for fname in ["smart-turn-v3.2-cpu.onnx", "smart-turn-v3.2-gpu.onnx"]:
|
| 151 |
+
path = hf_hub_download(
|
| 152 |
+
"pipecat-ai/smart-turn-v3",
|
| 153 |
+
fname,
|
| 154 |
+
local_dir=str(model_dir),
|
| 155 |
+
)
|
| 156 |
+
log.info("Downloaded %s → %s", fname, path)
|
| 157 |
+
|
| 158 |
+
return model_dir
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
if __name__ == "__main__":
|
| 162 |
+
logging.basicConfig(
|
| 163 |
+
level=logging.INFO,
|
| 164 |
+
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
|
| 165 |
+
)
|
| 166 |
+
|
| 167 |
+
log.info("=== Step 1: Download Pipecat Portuguese training data ===")
|
| 168 |
+
download_pipecat_dataset(max_pt_samples=5000)
|
| 169 |
+
|
| 170 |
+
log.info("\n=== Step 2: Download Pipecat Portuguese test data ===")
|
| 171 |
+
download_pipecat_test_data(max_pt_samples=2000)
|
| 172 |
+
|
| 173 |
+
log.info("\n=== Step 3: Download ONNX model for benchmarking ===")
|
| 174 |
+
download_onnx_model()
|
| 175 |
+
|
| 176 |
+
log.info("\nDone!")
|
03-finetune-pipecat-pt/02_generate_labels.py
ADDED
|
@@ -0,0 +1,474 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Generate high-quality turn-taking labels using Claude API.
|
| 2 |
+
|
| 3 |
+
Based on the Pipecat v3.1 pipeline:
|
| 4 |
+
1. Load Portuguese transcripts from CORAA MuPe + NURC-SP
|
| 5 |
+
2. Claude filters bad/ambiguous sentences (Pipecat: Gemini removed 50-80%)
|
| 6 |
+
3. Claude classifies: COMPLETE vs INCOMPLETE (semantic, not punctuation-based)
|
| 7 |
+
4. Claude inserts Brazilian Portuguese fillers
|
| 8 |
+
5. Claude generates French-accented Portuguese variants
|
| 9 |
+
|
| 10 |
+
Uses Claude Haiku for cost efficiency (~$0.001 per 20 sentences).
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import json
|
| 16 |
+
import logging
|
| 17 |
+
import os
|
| 18 |
+
import re
|
| 19 |
+
import time
|
| 20 |
+
from dataclasses import dataclass, field
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
|
| 23 |
+
log = logging.getLogger(__name__)
|
| 24 |
+
|
| 25 |
+
DATA_DIR = Path(__file__).parent / "data"
|
| 26 |
+
CACHE_DIR = Path("/workspace/hf_cache") if Path("/workspace").exists() else DATA_DIR / "hf_cache"
|
| 27 |
+
|
| 28 |
+
# Brazilian Portuguese fillers (from research: Claude + GPT generate these for Pipecat)
|
| 29 |
+
PT_BR_FILLERS = [
|
| 30 |
+
"hum", "eh", "ah", "tipo", "ne", "entao", "assim", "quer dizer",
|
| 31 |
+
"como e que eu falo", "deixa eu pensar", "olha", "bom", "pois e",
|
| 32 |
+
"sabe", "veja bem", "na verdade", "digamos", "e...",
|
| 33 |
+
]
|
| 34 |
+
|
| 35 |
+
# French speaker fillers when speaking Portuguese
|
| 36 |
+
FR_PT_FILLERS = [
|
| 37 |
+
"euh", "alors", "comment dire", "como se diz", "enfin",
|
| 38 |
+
"c'est-a-dire", "voila", "bon", "donc",
|
| 39 |
+
]
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
@dataclass
|
| 43 |
+
class LabeledSentence:
|
| 44 |
+
text: str
|
| 45 |
+
label: str # "complete" or "incomplete"
|
| 46 |
+
confidence: float # 0-1
|
| 47 |
+
source: str # "coraa", "mupe", "claude_generated"
|
| 48 |
+
has_filler: bool = False
|
| 49 |
+
filler_type: str = "" # "pt_br" or "fr_pt"
|
| 50 |
+
original_text: str = ""
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def load_coraa_transcripts(max_samples: int = 10000) -> list[dict]:
|
| 54 |
+
"""Load transcripts from CORAA MuPe (365h interviews, diarization kappa 0.947)."""
|
| 55 |
+
from datasets import load_dataset
|
| 56 |
+
|
| 57 |
+
log.info("Loading CORAA MuPe transcripts (streaming)...")
|
| 58 |
+
ds = load_dataset(
|
| 59 |
+
"nilc-nlp/CORAA-MUPE-ASR",
|
| 60 |
+
split="train",
|
| 61 |
+
streaming=True,
|
| 62 |
+
cache_dir=str(CACHE_DIR),
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
transcripts = []
|
| 66 |
+
for i, row in enumerate(ds):
|
| 67 |
+
text = str(row.get("text", row.get("normalized_text", "")))
|
| 68 |
+
if not text or len(text.split()) < 3:
|
| 69 |
+
continue
|
| 70 |
+
|
| 71 |
+
transcripts.append({
|
| 72 |
+
"text": text,
|
| 73 |
+
"speaker_type": row.get("speaker_type", ""),
|
| 74 |
+
"speaker_code": str(row.get("speaker_code", f"mupe_{i}")),
|
| 75 |
+
"source": "mupe",
|
| 76 |
+
})
|
| 77 |
+
|
| 78 |
+
if len(transcripts) >= max_samples:
|
| 79 |
+
break
|
| 80 |
+
|
| 81 |
+
if i % 5000 == 0 and i > 0:
|
| 82 |
+
log.info(" Scanned %d rows, collected %d transcripts", i, len(transcripts))
|
| 83 |
+
|
| 84 |
+
log.info("CORAA MuPe: %d transcripts loaded", len(transcripts))
|
| 85 |
+
return transcripts
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def load_nurc_transcripts(max_samples: int = 10000) -> list[dict]:
|
| 89 |
+
"""Load transcripts from CORAA NURC-SP (239h dialogues, speaker IDs)."""
|
| 90 |
+
from datasets import load_dataset
|
| 91 |
+
|
| 92 |
+
log.info("Loading CORAA NURC-SP transcripts (streaming)...")
|
| 93 |
+
try:
|
| 94 |
+
ds = load_dataset(
|
| 95 |
+
"nilc-nlp/CORAA-NURC-SP-Audio-Corpus",
|
| 96 |
+
split="train",
|
| 97 |
+
streaming=True,
|
| 98 |
+
cache_dir=str(CACHE_DIR),
|
| 99 |
+
)
|
| 100 |
+
except Exception as e:
|
| 101 |
+
log.warning("Failed to load NURC-SP: %s", e)
|
| 102 |
+
return []
|
| 103 |
+
|
| 104 |
+
transcripts = []
|
| 105 |
+
for i, row in enumerate(ds):
|
| 106 |
+
text = str(row.get("text", row.get("sentence", "")))
|
| 107 |
+
if not text or len(text.split()) < 3:
|
| 108 |
+
continue
|
| 109 |
+
|
| 110 |
+
transcripts.append({
|
| 111 |
+
"text": text,
|
| 112 |
+
"speaker_type": row.get("speech_genre", ""),
|
| 113 |
+
"speaker_code": str(row.get("speaker_id", f"nurc_{i}")),
|
| 114 |
+
"source": "nurc_sp",
|
| 115 |
+
})
|
| 116 |
+
|
| 117 |
+
if len(transcripts) >= max_samples:
|
| 118 |
+
break
|
| 119 |
+
|
| 120 |
+
log.info("NURC-SP: %d transcripts loaded", len(transcripts))
|
| 121 |
+
return transcripts
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
def classify_with_claude(
|
| 125 |
+
sentences: list[str],
|
| 126 |
+
batch_size: int = 20,
|
| 127 |
+
model: str = "claude-haiku-4-5-20251001",
|
| 128 |
+
) -> list[dict]:
|
| 129 |
+
"""Use Claude to classify sentences as complete/incomplete.
|
| 130 |
+
|
| 131 |
+
Based on Pipecat's approach: Gemini 2.5 Flash filtered 50-80% of sentences.
|
| 132 |
+
We use Claude Haiku for cost efficiency.
|
| 133 |
+
|
| 134 |
+
Returns list of {text, label, confidence, keep}.
|
| 135 |
+
"""
|
| 136 |
+
import anthropic
|
| 137 |
+
|
| 138 |
+
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
| 139 |
+
if not api_key:
|
| 140 |
+
log.warning("ANTHROPIC_API_KEY not set — using rule-based fallback")
|
| 141 |
+
return _rule_based_classify(sentences)
|
| 142 |
+
|
| 143 |
+
client = anthropic.Anthropic(api_key=api_key)
|
| 144 |
+
results = []
|
| 145 |
+
|
| 146 |
+
for batch_start in range(0, len(sentences), batch_size):
|
| 147 |
+
batch = sentences[batch_start:batch_start + batch_size]
|
| 148 |
+
numbered = "\n".join(f"{i+1}. {s}" for i, s in enumerate(batch))
|
| 149 |
+
|
| 150 |
+
prompt = f"""Voce e um anotador de turn-taking para portugues brasileiro.
|
| 151 |
+
Para cada frase abaixo, classifique:
|
| 152 |
+
- COMPLETO: o falante terminou o pensamento (pode comecar a traduzir)
|
| 153 |
+
- INCOMPLETO: o falante vai continuar falando (nao traduzir ainda)
|
| 154 |
+
- RUIM: frase com erro, ambigua, ou inutilizavel (descartar)
|
| 155 |
+
|
| 156 |
+
Responda APENAS em JSON, sem explicacao:
|
| 157 |
+
[{{"n": 1, "label": "COMPLETO", "confidence": 0.95}}, ...]
|
| 158 |
+
|
| 159 |
+
Frases:
|
| 160 |
+
{numbered}"""
|
| 161 |
+
|
| 162 |
+
try:
|
| 163 |
+
response = client.messages.create(
|
| 164 |
+
model=model,
|
| 165 |
+
max_tokens=2000,
|
| 166 |
+
messages=[{"role": "user", "content": prompt}],
|
| 167 |
+
)
|
| 168 |
+
text = response.content[0].text
|
| 169 |
+
|
| 170 |
+
# Parse JSON from response
|
| 171 |
+
json_match = re.search(r'\[.*\]', text, re.DOTALL)
|
| 172 |
+
if json_match:
|
| 173 |
+
batch_results = json.loads(json_match.group())
|
| 174 |
+
for item in batch_results:
|
| 175 |
+
idx = item["n"] - 1
|
| 176 |
+
if 0 <= idx < len(batch):
|
| 177 |
+
results.append({
|
| 178 |
+
"text": batch[idx],
|
| 179 |
+
"label": item["label"].lower(),
|
| 180 |
+
"confidence": item.get("confidence", 0.5),
|
| 181 |
+
"keep": item["label"].upper() != "RUIM",
|
| 182 |
+
})
|
| 183 |
+
else:
|
| 184 |
+
log.warning("No JSON in Claude response for batch %d", batch_start)
|
| 185 |
+
results.extend(_rule_based_classify(batch))
|
| 186 |
+
|
| 187 |
+
except Exception as e:
|
| 188 |
+
log.warning("Claude API error at batch %d: %s — falling back to rules", batch_start, e)
|
| 189 |
+
results.extend(_rule_based_classify(batch))
|
| 190 |
+
|
| 191 |
+
# Rate limiting
|
| 192 |
+
if batch_start % 100 == 0 and batch_start > 0:
|
| 193 |
+
log.info(" Classified %d/%d sentences", batch_start, len(sentences))
|
| 194 |
+
time.sleep(0.5)
|
| 195 |
+
|
| 196 |
+
return results
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
def _rule_based_classify(sentences: list[str]) -> list[dict]:
|
| 200 |
+
"""Fallback rule-based classification (less accurate than Claude)."""
|
| 201 |
+
results = []
|
| 202 |
+
for s in sentences:
|
| 203 |
+
text = s.strip()
|
| 204 |
+
if not text or len(text) < 5:
|
| 205 |
+
results.append({"text": text, "label": "ruim", "confidence": 0.5, "keep": False})
|
| 206 |
+
continue
|
| 207 |
+
|
| 208 |
+
if re.search(r'[.!?]+\s*$', text):
|
| 209 |
+
results.append({"text": text, "label": "completo", "confidence": 0.7, "keep": True})
|
| 210 |
+
elif re.search(r'[,;:\-]\s*$', text):
|
| 211 |
+
results.append({"text": text, "label": "incompleto", "confidence": 0.6, "keep": True})
|
| 212 |
+
elif re.search(r'\b(e|que|mas|porque|quando|se|como|pra|para)\s*$', text, re.I):
|
| 213 |
+
results.append({"text": text, "label": "incompleto", "confidence": 0.8, "keep": True})
|
| 214 |
+
else:
|
| 215 |
+
results.append({"text": text, "label": "completo", "confidence": 0.5, "keep": True})
|
| 216 |
+
|
| 217 |
+
return results
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def insert_fillers_with_claude(
|
| 221 |
+
sentences: list[str],
|
| 222 |
+
filler_type: str = "pt_br",
|
| 223 |
+
model: str = "claude-haiku-4-5-20251001",
|
| 224 |
+
) -> list[dict]:
|
| 225 |
+
"""Use Claude to insert fillers at natural points.
|
| 226 |
+
|
| 227 |
+
Based on Pipecat: Gemini Flash inserted fillers at natural break points.
|
| 228 |
+
|
| 229 |
+
filler_type: "pt_br" for native Brazilian, "fr_pt" for French-accented
|
| 230 |
+
"""
|
| 231 |
+
import anthropic
|
| 232 |
+
|
| 233 |
+
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
| 234 |
+
if not api_key:
|
| 235 |
+
log.warning("ANTHROPIC_API_KEY not set — using simple filler insertion")
|
| 236 |
+
return _simple_filler_insert(sentences, filler_type)
|
| 237 |
+
|
| 238 |
+
client = anthropic.Anthropic(api_key=api_key)
|
| 239 |
+
fillers = PT_BR_FILLERS if filler_type == "pt_br" else FR_PT_FILLERS
|
| 240 |
+
|
| 241 |
+
filler_list = ", ".join(f'"{f}"' for f in fillers)
|
| 242 |
+
|
| 243 |
+
if filler_type == "fr_pt":
|
| 244 |
+
context = """O falante e um frances de nivel B1 falando portugues.
|
| 245 |
+
Ele hesita ao conjugar verbos, busca palavras, e as vezes usa fillers em frances.
|
| 246 |
+
Insira hesitacoes naturais como: """ + filler_list
|
| 247 |
+
else:
|
| 248 |
+
context = """O falante e brasileiro nativo.
|
| 249 |
+
Insira fillers naturais do portugues brasileiro como: """ + filler_list
|
| 250 |
+
|
| 251 |
+
results = []
|
| 252 |
+
batch_size = 15
|
| 253 |
+
|
| 254 |
+
for batch_start in range(0, len(sentences), batch_size):
|
| 255 |
+
batch = sentences[batch_start:batch_start + batch_size]
|
| 256 |
+
numbered = "\n".join(f"{i+1}. {s}" for i, s in enumerate(batch))
|
| 257 |
+
|
| 258 |
+
prompt = f"""{context}
|
| 259 |
+
|
| 260 |
+
Para cada frase, crie uma versao com 1-2 fillers inseridos em pontos naturais.
|
| 261 |
+
A frase deve parecer INCOMPLETA (como se o falante fosse continuar).
|
| 262 |
+
|
| 263 |
+
Responda APENAS em JSON:
|
| 264 |
+
[{{"n": 1, "original": "frase original", "with_filler": "frase com filler"}}]
|
| 265 |
+
|
| 266 |
+
Frases:
|
| 267 |
+
{numbered}"""
|
| 268 |
+
|
| 269 |
+
try:
|
| 270 |
+
response = client.messages.create(
|
| 271 |
+
model=model,
|
| 272 |
+
max_tokens=3000,
|
| 273 |
+
messages=[{"role": "user", "content": prompt}],
|
| 274 |
+
)
|
| 275 |
+
text = response.content[0].text
|
| 276 |
+
json_match = re.search(r'\[.*\]', text, re.DOTALL)
|
| 277 |
+
if json_match:
|
| 278 |
+
batch_results = json.loads(json_match.group())
|
| 279 |
+
for item in batch_results:
|
| 280 |
+
results.append({
|
| 281 |
+
"original": item.get("original", ""),
|
| 282 |
+
"with_filler": item.get("with_filler", ""),
|
| 283 |
+
"filler_type": filler_type,
|
| 284 |
+
})
|
| 285 |
+
|
| 286 |
+
except Exception as e:
|
| 287 |
+
log.warning("Claude filler API error: %s", e)
|
| 288 |
+
results.extend(_simple_filler_insert(batch, filler_type))
|
| 289 |
+
|
| 290 |
+
time.sleep(0.3)
|
| 291 |
+
|
| 292 |
+
return results
|
| 293 |
+
|
| 294 |
+
|
| 295 |
+
def _simple_filler_insert(sentences: list[str], filler_type: str) -> list[dict]:
|
| 296 |
+
"""Simple rule-based filler insertion fallback."""
|
| 297 |
+
import random
|
| 298 |
+
|
| 299 |
+
fillers = PT_BR_FILLERS if filler_type == "pt_br" else FR_PT_FILLERS
|
| 300 |
+
results = []
|
| 301 |
+
|
| 302 |
+
for s in sentences:
|
| 303 |
+
words = s.split()
|
| 304 |
+
if len(words) < 4:
|
| 305 |
+
results.append({"original": s, "with_filler": s, "filler_type": filler_type})
|
| 306 |
+
continue
|
| 307 |
+
|
| 308 |
+
# Insert filler at ~40% of sentence
|
| 309 |
+
pos = max(1, len(words) * 2 // 5)
|
| 310 |
+
filler = random.choice(fillers)
|
| 311 |
+
words.insert(pos, f"{filler}...")
|
| 312 |
+
results.append({
|
| 313 |
+
"original": s,
|
| 314 |
+
"with_filler": " ".join(words),
|
| 315 |
+
"filler_type": filler_type,
|
| 316 |
+
})
|
| 317 |
+
|
| 318 |
+
return results
|
| 319 |
+
|
| 320 |
+
|
| 321 |
+
def generate_french_portuguese_sentences(
|
| 322 |
+
n_sentences: int = 500,
|
| 323 |
+
model: str = "claude-haiku-4-5-20251001",
|
| 324 |
+
) -> list[dict]:
|
| 325 |
+
"""Generate sentences typical of French speakers learning Portuguese.
|
| 326 |
+
|
| 327 |
+
Based on Pipecat method: use LLM to generate diverse training text.
|
| 328 |
+
"""
|
| 329 |
+
import anthropic
|
| 330 |
+
|
| 331 |
+
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
| 332 |
+
if not api_key:
|
| 333 |
+
log.warning("ANTHROPIC_API_KEY not set — cannot generate French-PT sentences")
|
| 334 |
+
return []
|
| 335 |
+
|
| 336 |
+
client = anthropic.Anthropic(api_key=api_key)
|
| 337 |
+
results = []
|
| 338 |
+
|
| 339 |
+
contexts = [
|
| 340 |
+
"reuniao de trabalho",
|
| 341 |
+
"conversa informal com amigos",
|
| 342 |
+
"apresentacao de projeto",
|
| 343 |
+
"negociacao comercial",
|
| 344 |
+
"aula de portugues",
|
| 345 |
+
"pedindo informacoes na rua",
|
| 346 |
+
"restaurante pedindo comida",
|
| 347 |
+
"entrevista de emprego",
|
| 348 |
+
"ligacao telefonica profissional",
|
| 349 |
+
"discussao tecnica sobre software",
|
| 350 |
+
]
|
| 351 |
+
|
| 352 |
+
for ctx in contexts:
|
| 353 |
+
prompt = f"""Gere {n_sentences // len(contexts)} frases que um FRANCES de nivel B1-B2 diria em portugues durante: {ctx}
|
| 354 |
+
|
| 355 |
+
Inclua variedade:
|
| 356 |
+
- Frases COMPLETAS (pensamento terminado, pode traduzir)
|
| 357 |
+
- Frases INCOMPLETAS (vai continuar falando, nao traduzir)
|
| 358 |
+
- Com hesitacoes tipicas (euh, alors, como se diz, tipo)
|
| 359 |
+
- Com erros de conjugacao comuns de franceses
|
| 360 |
+
- Com code-switching involuntario (palavra em frances no meio)
|
| 361 |
+
|
| 362 |
+
Responda em JSON:
|
| 363 |
+
[{{"text": "frase", "label": "completo" ou "incompleto", "notes": "o que torna dificil classificar"}}]"""
|
| 364 |
+
|
| 365 |
+
try:
|
| 366 |
+
response = client.messages.create(
|
| 367 |
+
model=model,
|
| 368 |
+
max_tokens=4000,
|
| 369 |
+
messages=[{"role": "user", "content": prompt}],
|
| 370 |
+
)
|
| 371 |
+
text = response.content[0].text
|
| 372 |
+
json_match = re.search(r'\[.*\]', text, re.DOTALL)
|
| 373 |
+
if json_match:
|
| 374 |
+
batch = json.loads(json_match.group())
|
| 375 |
+
for item in batch:
|
| 376 |
+
results.append({
|
| 377 |
+
"text": item["text"],
|
| 378 |
+
"label": item["label"],
|
| 379 |
+
"source": "claude_fr_pt",
|
| 380 |
+
"context": ctx,
|
| 381 |
+
"notes": item.get("notes", ""),
|
| 382 |
+
})
|
| 383 |
+
log.info(" Generated %d sentences for context: %s", len(batch), ctx)
|
| 384 |
+
|
| 385 |
+
except Exception as e:
|
| 386 |
+
log.warning("Error generating for %s: %s", ctx, e)
|
| 387 |
+
|
| 388 |
+
time.sleep(1)
|
| 389 |
+
|
| 390 |
+
log.info("Total French-PT sentences generated: %d", len(results))
|
| 391 |
+
return results
|
| 392 |
+
|
| 393 |
+
|
| 394 |
+
def run_full_pipeline(
|
| 395 |
+
max_transcripts: int = 5000,
|
| 396 |
+
max_fr_sentences: int = 500,
|
| 397 |
+
) -> Path:
|
| 398 |
+
"""Run the full label generation pipeline."""
|
| 399 |
+
output_dir = DATA_DIR / "claude_labeled"
|
| 400 |
+
output_dir.mkdir(parents=True, exist_ok=True)
|
| 401 |
+
|
| 402 |
+
# Step 1: Load transcripts
|
| 403 |
+
log.info("=== Step 1: Loading Portuguese transcripts ===")
|
| 404 |
+
mupe = load_coraa_transcripts(max_samples=max_transcripts)
|
| 405 |
+
nurc = load_nurc_transcripts(max_samples=max_transcripts)
|
| 406 |
+
all_transcripts = mupe + nurc
|
| 407 |
+
log.info("Total transcripts: %d (MuPe=%d, NURC=%d)", len(all_transcripts), len(mupe), len(nurc))
|
| 408 |
+
|
| 409 |
+
# Step 2: Claude classifies
|
| 410 |
+
log.info("\n=== Step 2: Claude classifies complete/incomplete ===")
|
| 411 |
+
sentences = [t["text"] for t in all_transcripts]
|
| 412 |
+
classified = classify_with_claude(sentences)
|
| 413 |
+
|
| 414 |
+
# Filter bad sentences (Pipecat removed 50-80%)
|
| 415 |
+
kept = [c for c in classified if c["keep"]]
|
| 416 |
+
removed = len(classified) - len(kept)
|
| 417 |
+
log.info("Classified: %d total, %d kept, %d removed (%.0f%%)",
|
| 418 |
+
len(classified), len(kept), removed, 100 * removed / max(len(classified), 1))
|
| 419 |
+
|
| 420 |
+
complete = [c for c in kept if c["label"] == "completo"]
|
| 421 |
+
incomplete = [c for c in kept if c["label"] == "incompleto"]
|
| 422 |
+
log.info(" Complete: %d, Incomplete: %d", len(complete), len(incomplete))
|
| 423 |
+
|
| 424 |
+
# Save classified
|
| 425 |
+
with open(output_dir / "classified_pt.json", "w") as f:
|
| 426 |
+
json.dump(kept, f, indent=2, ensure_ascii=False)
|
| 427 |
+
|
| 428 |
+
# Step 3: Insert fillers (creates INCOMPLETE variants)
|
| 429 |
+
log.info("\n=== Step 3: Insert fillers (PT-BR + FR-PT) ===")
|
| 430 |
+
complete_texts = [c["text"] for c in complete[:1000]]
|
| 431 |
+
|
| 432 |
+
pt_fillers = insert_fillers_with_claude(complete_texts, filler_type="pt_br")
|
| 433 |
+
fr_fillers = insert_fillers_with_claude(complete_texts[:500], filler_type="fr_pt")
|
| 434 |
+
|
| 435 |
+
with open(output_dir / "fillers_pt_br.json", "w") as f:
|
| 436 |
+
json.dump(pt_fillers, f, indent=2, ensure_ascii=False)
|
| 437 |
+
with open(output_dir / "fillers_fr_pt.json", "w") as f:
|
| 438 |
+
json.dump(fr_fillers, f, indent=2, ensure_ascii=False)
|
| 439 |
+
|
| 440 |
+
log.info("Fillers: %d PT-BR, %d FR-PT", len(pt_fillers), len(fr_fillers))
|
| 441 |
+
|
| 442 |
+
# Step 4: Generate French-Portuguese sentences
|
| 443 |
+
log.info("\n=== Step 4: Generate French-Portuguese sentences ===")
|
| 444 |
+
fr_sentences = generate_french_portuguese_sentences(n_sentences=max_fr_sentences)
|
| 445 |
+
with open(output_dir / "french_portuguese.json", "w") as f:
|
| 446 |
+
json.dump(fr_sentences, f, indent=2, ensure_ascii=False)
|
| 447 |
+
|
| 448 |
+
# Summary
|
| 449 |
+
summary = {
|
| 450 |
+
"total_transcripts": len(all_transcripts),
|
| 451 |
+
"classified_kept": len(kept),
|
| 452 |
+
"classified_removed": removed,
|
| 453 |
+
"complete": len(complete),
|
| 454 |
+
"incomplete": len(incomplete),
|
| 455 |
+
"fillers_pt_br": len(pt_fillers),
|
| 456 |
+
"fillers_fr_pt": len(fr_fillers),
|
| 457 |
+
"french_portuguese": len(fr_sentences),
|
| 458 |
+
}
|
| 459 |
+
with open(output_dir / "summary.json", "w") as f:
|
| 460 |
+
json.dump(summary, f, indent=2)
|
| 461 |
+
|
| 462 |
+
log.info("\n=== Summary ===")
|
| 463 |
+
for k, v in summary.items():
|
| 464 |
+
log.info(" %s: %d", k, v)
|
| 465 |
+
|
| 466 |
+
return output_dir
|
| 467 |
+
|
| 468 |
+
|
| 469 |
+
if __name__ == "__main__":
|
| 470 |
+
logging.basicConfig(
|
| 471 |
+
level=logging.INFO,
|
| 472 |
+
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
|
| 473 |
+
)
|
| 474 |
+
run_full_pipeline(max_transcripts=5000, max_fr_sentences=500)
|
03-finetune-pipecat-pt/03_generate_audio.py
ADDED
|
@@ -0,0 +1,697 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Generate TTS audio for the turn-taking dataset.
|
| 2 |
+
|
| 3 |
+
Based on the Pipecat v3.1 pipeline:
|
| 4 |
+
1. Native PT-BR voices via Kokoro (pf_dora, pm_alex, pm_santa)
|
| 5 |
+
2. French-accented Portuguese via XTTS v2 voice cloning
|
| 6 |
+
3. Speed/pitch variation + noise augmentation
|
| 7 |
+
4. Hesitation pause injection (1.5-3s silence after fillers) — SpeculativeETD V3
|
| 8 |
+
5. Short utterance dataset ("sim", "nao", "ok") — Pipecat v3.2 (-40% errors)
|
| 9 |
+
|
| 10 |
+
Pipecat used Google Chirp3 TTS; we use Kokoro (open-source, runs locally)
|
| 11 |
+
+ XTTS v2 (voice cloning for accent simulation).
|
| 12 |
+
|
| 13 |
+
Run locally or on Modal GPU:
|
| 14 |
+
python 03_generate_audio.py
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
from __future__ import annotations
|
| 18 |
+
|
| 19 |
+
import json
|
| 20 |
+
import logging
|
| 21 |
+
import random
|
| 22 |
+
import re
|
| 23 |
+
from dataclasses import dataclass
|
| 24 |
+
from pathlib import Path
|
| 25 |
+
|
| 26 |
+
import numpy as np
|
| 27 |
+
|
| 28 |
+
log = logging.getLogger(__name__)
|
| 29 |
+
|
| 30 |
+
DATA_DIR = Path(__file__).parent / "data"
|
| 31 |
+
SAMPLE_RATE = 16000 # Pipecat model expects 16kHz
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
@dataclass
|
| 35 |
+
class AudioSample:
|
| 36 |
+
"""A single audio sample for the dataset."""
|
| 37 |
+
audio: np.ndarray # float32, 16kHz
|
| 38 |
+
text: str
|
| 39 |
+
label: str # "complete" or "incomplete"
|
| 40 |
+
voice: str # e.g. "pf_dora", "fr_clone_1"
|
| 41 |
+
accent: str # "native_pt_br" or "french_pt"
|
| 42 |
+
source: str # "kokoro", "xtts", "pipecat"
|
| 43 |
+
speed: float = 1.0
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
# ---------------------------------------------------------------------------
|
| 47 |
+
# Kokoro TTS — native PT-BR voices
|
| 48 |
+
# ---------------------------------------------------------------------------
|
| 49 |
+
|
| 50 |
+
def generate_kokoro_audio(
|
| 51 |
+
sentences: list[dict],
|
| 52 |
+
voices: list[str] | None = None,
|
| 53 |
+
speed_range: tuple[float, float] = (0.85, 1.15),
|
| 54 |
+
) -> list[AudioSample]:
|
| 55 |
+
"""Generate audio using Kokoro TTS (PT-BR voices).
|
| 56 |
+
|
| 57 |
+
sentences: list of {"text": str, "label": "complete"|"incomplete"}
|
| 58 |
+
voices: Kokoro voice IDs (default: PT-BR voices)
|
| 59 |
+
"""
|
| 60 |
+
try:
|
| 61 |
+
from kokoro import KPipeline
|
| 62 |
+
except ImportError:
|
| 63 |
+
log.error("kokoro not installed — pip install kokoro")
|
| 64 |
+
return []
|
| 65 |
+
|
| 66 |
+
if voices is None:
|
| 67 |
+
voices = ["pf_dora", "pm_alex", "pm_santa"]
|
| 68 |
+
|
| 69 |
+
log.info("Initializing Kokoro pipeline (lang=pt)...")
|
| 70 |
+
pipeline = KPipeline(lang_code="p") # Portuguese
|
| 71 |
+
|
| 72 |
+
samples = []
|
| 73 |
+
errors = 0
|
| 74 |
+
|
| 75 |
+
for i, sent in enumerate(sentences):
|
| 76 |
+
text = sent["text"]
|
| 77 |
+
label = sent["label"]
|
| 78 |
+
voice = random.choice(voices)
|
| 79 |
+
speed = random.uniform(*speed_range)
|
| 80 |
+
|
| 81 |
+
try:
|
| 82 |
+
# Generate audio
|
| 83 |
+
generator = pipeline(text, voice=voice, speed=speed)
|
| 84 |
+
audio_chunks = []
|
| 85 |
+
for _, _, chunk in generator:
|
| 86 |
+
audio_chunks.append(chunk.numpy() if hasattr(chunk, 'numpy') else np.array(chunk))
|
| 87 |
+
|
| 88 |
+
if not audio_chunks:
|
| 89 |
+
errors += 1
|
| 90 |
+
continue
|
| 91 |
+
|
| 92 |
+
audio = np.concatenate(audio_chunks).astype(np.float32)
|
| 93 |
+
|
| 94 |
+
# Resample to 16kHz if needed (Kokoro outputs 24kHz)
|
| 95 |
+
if hasattr(pipeline, 'sample_rate') and pipeline.sample_rate != SAMPLE_RATE:
|
| 96 |
+
import torchaudio
|
| 97 |
+
import torch
|
| 98 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 99 |
+
audio = torchaudio.functional.resample(
|
| 100 |
+
tensor, pipeline.sample_rate, SAMPLE_RATE
|
| 101 |
+
).squeeze().numpy()
|
| 102 |
+
|
| 103 |
+
# Normalize
|
| 104 |
+
peak = np.max(np.abs(audio))
|
| 105 |
+
if peak > 0:
|
| 106 |
+
audio = audio / peak * 0.9
|
| 107 |
+
|
| 108 |
+
samples.append(AudioSample(
|
| 109 |
+
audio=audio,
|
| 110 |
+
text=text,
|
| 111 |
+
label=label,
|
| 112 |
+
voice=voice,
|
| 113 |
+
accent="native_pt_br",
|
| 114 |
+
source="kokoro",
|
| 115 |
+
speed=speed,
|
| 116 |
+
))
|
| 117 |
+
|
| 118 |
+
except Exception as e:
|
| 119 |
+
errors += 1
|
| 120 |
+
if errors <= 5:
|
| 121 |
+
log.warning("Kokoro error on sentence %d: %s", i, e)
|
| 122 |
+
|
| 123 |
+
if (i + 1) % 100 == 0:
|
| 124 |
+
log.info(" Kokoro: %d/%d generated (%d errors)", len(samples), i + 1, errors)
|
| 125 |
+
|
| 126 |
+
log.info("Kokoro: %d samples generated (%d errors)", len(samples), errors)
|
| 127 |
+
return samples
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
# ---------------------------------------------------------------------------
|
| 131 |
+
# XTTS v2 — French-accented Portuguese via voice cloning
|
| 132 |
+
# ---------------------------------------------------------------------------
|
| 133 |
+
|
| 134 |
+
def generate_xtts_audio(
|
| 135 |
+
sentences: list[dict],
|
| 136 |
+
reference_audio_dir: Path | None = None,
|
| 137 |
+
speed_range: tuple[float, float] = (0.9, 1.1),
|
| 138 |
+
) -> list[AudioSample]:
|
| 139 |
+
"""Generate audio with French accent using XTTS v2 voice cloning.
|
| 140 |
+
|
| 141 |
+
Uses short French speech samples as reference to clone the accent,
|
| 142 |
+
then synthesizes Portuguese text with that voice.
|
| 143 |
+
|
| 144 |
+
sentences: list of {"text": str, "label": "complete"|"incomplete"}
|
| 145 |
+
reference_audio_dir: directory with .wav files of French speakers
|
| 146 |
+
"""
|
| 147 |
+
try:
|
| 148 |
+
from TTS.api import TTS
|
| 149 |
+
except ImportError:
|
| 150 |
+
log.error("TTS (coqui) not installed — pip install TTS")
|
| 151 |
+
return []
|
| 152 |
+
|
| 153 |
+
log.info("Initializing XTTS v2...")
|
| 154 |
+
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
|
| 155 |
+
|
| 156 |
+
# Find reference audio files (French speakers)
|
| 157 |
+
ref_files = []
|
| 158 |
+
if reference_audio_dir and reference_audio_dir.exists():
|
| 159 |
+
ref_files = list(reference_audio_dir.glob("*.wav"))
|
| 160 |
+
|
| 161 |
+
if not ref_files:
|
| 162 |
+
log.warning("No French reference audio found in %s — using default voice", reference_audio_dir)
|
| 163 |
+
# Fall back to generating without cloning (still works, but no accent)
|
| 164 |
+
return _generate_xtts_no_clone(tts, sentences, speed_range)
|
| 165 |
+
|
| 166 |
+
log.info("Found %d French reference voices", len(ref_files))
|
| 167 |
+
|
| 168 |
+
samples = []
|
| 169 |
+
errors = 0
|
| 170 |
+
|
| 171 |
+
for i, sent in enumerate(sentences):
|
| 172 |
+
text = sent["text"]
|
| 173 |
+
label = sent["label"]
|
| 174 |
+
ref = random.choice(ref_files)
|
| 175 |
+
|
| 176 |
+
try:
|
| 177 |
+
audio = tts.tts(
|
| 178 |
+
text=text,
|
| 179 |
+
speaker_wav=str(ref),
|
| 180 |
+
language="pt",
|
| 181 |
+
)
|
| 182 |
+
audio = np.array(audio, dtype=np.float32)
|
| 183 |
+
|
| 184 |
+
# Resample to 16kHz if needed
|
| 185 |
+
if hasattr(tts, 'synthesizer') and tts.synthesizer.output_sample_rate != SAMPLE_RATE:
|
| 186 |
+
import torchaudio
|
| 187 |
+
import torch
|
| 188 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 189 |
+
audio = torchaudio.functional.resample(
|
| 190 |
+
tensor, tts.synthesizer.output_sample_rate, SAMPLE_RATE
|
| 191 |
+
).squeeze().numpy()
|
| 192 |
+
|
| 193 |
+
# Normalize
|
| 194 |
+
peak = np.max(np.abs(audio))
|
| 195 |
+
if peak > 0:
|
| 196 |
+
audio = audio / peak * 0.9
|
| 197 |
+
|
| 198 |
+
# Random speed variation
|
| 199 |
+
speed = random.uniform(*speed_range)
|
| 200 |
+
if abs(speed - 1.0) > 0.05:
|
| 201 |
+
indices = np.arange(0, len(audio), speed).astype(int)
|
| 202 |
+
indices = indices[indices < len(audio)]
|
| 203 |
+
audio = audio[indices]
|
| 204 |
+
|
| 205 |
+
samples.append(AudioSample(
|
| 206 |
+
audio=audio,
|
| 207 |
+
text=text,
|
| 208 |
+
label=label,
|
| 209 |
+
voice=f"fr_clone_{ref.stem}",
|
| 210 |
+
accent="french_pt",
|
| 211 |
+
source="xtts",
|
| 212 |
+
speed=speed,
|
| 213 |
+
))
|
| 214 |
+
|
| 215 |
+
except Exception as e:
|
| 216 |
+
errors += 1
|
| 217 |
+
if errors <= 5:
|
| 218 |
+
log.warning("XTTS error on sentence %d: %s", i, e)
|
| 219 |
+
|
| 220 |
+
if (i + 1) % 50 == 0:
|
| 221 |
+
log.info(" XTTS: %d/%d generated (%d errors)", len(samples), i + 1, errors)
|
| 222 |
+
|
| 223 |
+
log.info("XTTS: %d samples generated (%d errors)", len(samples), errors)
|
| 224 |
+
return samples
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
def _generate_xtts_no_clone(
|
| 228 |
+
tts,
|
| 229 |
+
sentences: list[dict],
|
| 230 |
+
speed_range: tuple[float, float],
|
| 231 |
+
) -> list[AudioSample]:
|
| 232 |
+
"""XTTS generation without voice cloning (fallback)."""
|
| 233 |
+
samples = []
|
| 234 |
+
errors = 0
|
| 235 |
+
|
| 236 |
+
for i, sent in enumerate(sentences):
|
| 237 |
+
try:
|
| 238 |
+
audio = tts.tts(text=sent["text"], language="pt")
|
| 239 |
+
audio = np.array(audio, dtype=np.float32)
|
| 240 |
+
|
| 241 |
+
peak = np.max(np.abs(audio))
|
| 242 |
+
if peak > 0:
|
| 243 |
+
audio = audio / peak * 0.9
|
| 244 |
+
|
| 245 |
+
samples.append(AudioSample(
|
| 246 |
+
audio=audio,
|
| 247 |
+
text=sent["text"],
|
| 248 |
+
label=sent["label"],
|
| 249 |
+
voice="xtts_default",
|
| 250 |
+
accent="french_pt",
|
| 251 |
+
source="xtts",
|
| 252 |
+
speed=1.0,
|
| 253 |
+
))
|
| 254 |
+
except Exception as e:
|
| 255 |
+
errors += 1
|
| 256 |
+
if errors <= 3:
|
| 257 |
+
log.warning("XTTS fallback error: %s", e)
|
| 258 |
+
|
| 259 |
+
log.info("XTTS (no clone): %d samples (%d errors)", len(samples), errors)
|
| 260 |
+
return samples
|
| 261 |
+
|
| 262 |
+
|
| 263 |
+
# ---------------------------------------------------------------------------
|
| 264 |
+
# Hesitation pause injection (SpeculativeETD V3 — best data variant)
|
| 265 |
+
# ---------------------------------------------------------------------------
|
| 266 |
+
|
| 267 |
+
def inject_hesitation_pause(
|
| 268 |
+
audio: np.ndarray,
|
| 269 |
+
pause_duration_range: tuple[float, float] = (1.5, 3.0),
|
| 270 |
+
position: str = "end",
|
| 271 |
+
) -> np.ndarray:
|
| 272 |
+
"""Inject a realistic hesitation pause into audio.
|
| 273 |
+
|
| 274 |
+
SpeculativeETD V3 showed that inserting 1.5-3.0s pauses after fillers
|
| 275 |
+
was the most effective data variant for training ETD models.
|
| 276 |
+
|
| 277 |
+
For French speakers learning Portuguese, these long pauses happen when:
|
| 278 |
+
- Searching for a word ("Eu preciso de... [2s pause] ...uma tesoura")
|
| 279 |
+
- Thinking about conjugation ("Ontem eu... [2s pause] ...fui ao mercado")
|
| 280 |
+
- Code-switching hesitation ("Eu gosto de... [1.5s] ...euh... praia")
|
| 281 |
+
|
| 282 |
+
position: "end" = pause at end (simulates mid-utterance stop)
|
| 283 |
+
"mid" = pause in the middle (simulates hesitation)
|
| 284 |
+
"""
|
| 285 |
+
pause_s = random.uniform(*pause_duration_range)
|
| 286 |
+
pause_samples = int(pause_s * SAMPLE_RATE)
|
| 287 |
+
|
| 288 |
+
# Add very low-level noise to the pause (not pure silence — more realistic)
|
| 289 |
+
pause = np.random.randn(pause_samples).astype(np.float32) * 0.001
|
| 290 |
+
|
| 291 |
+
if position == "end":
|
| 292 |
+
# Pause at end: speaker stops mid-sentence (INCOMPLETE)
|
| 293 |
+
return np.concatenate([audio, pause])
|
| 294 |
+
else:
|
| 295 |
+
# Pause in middle: split audio and insert pause
|
| 296 |
+
if len(audio) < SAMPLE_RATE: # too short to split
|
| 297 |
+
return np.concatenate([audio, pause])
|
| 298 |
+
split_point = random.randint(len(audio) // 3, 2 * len(audio) // 3)
|
| 299 |
+
return np.concatenate([audio[:split_point], pause, audio[split_point:]])
|
| 300 |
+
|
| 301 |
+
|
| 302 |
+
def create_hesitation_variants(
|
| 303 |
+
samples: list[AudioSample],
|
| 304 |
+
fraction: float = 0.3,
|
| 305 |
+
) -> list[AudioSample]:
|
| 306 |
+
"""Create hesitation-pause variants from existing INCOMPLETE samples.
|
| 307 |
+
|
| 308 |
+
Takes a fraction of incomplete samples and adds 1.5-3s pauses,
|
| 309 |
+
simulating French speakers hesitating in Portuguese.
|
| 310 |
+
"""
|
| 311 |
+
incomplete = [s for s in samples if s.label == "incomplete"]
|
| 312 |
+
n_variants = int(len(incomplete) * fraction)
|
| 313 |
+
|
| 314 |
+
variants = []
|
| 315 |
+
for s in random.sample(incomplete, min(n_variants, len(incomplete))):
|
| 316 |
+
# Variant 1: long pause at end (speaker stopped to think)
|
| 317 |
+
audio_end = inject_hesitation_pause(s.audio, position="end")
|
| 318 |
+
variants.append(AudioSample(
|
| 319 |
+
audio=audio_end,
|
| 320 |
+
text=s.text,
|
| 321 |
+
label="incomplete", # still incomplete — speaker will continue
|
| 322 |
+
voice=s.voice,
|
| 323 |
+
accent=s.accent,
|
| 324 |
+
source=f"{s.source}_hesitation_end",
|
| 325 |
+
speed=s.speed,
|
| 326 |
+
))
|
| 327 |
+
|
| 328 |
+
# Variant 2: pause in the middle (thinking mid-sentence)
|
| 329 |
+
if random.random() < 0.5:
|
| 330 |
+
audio_mid = inject_hesitation_pause(s.audio, position="mid")
|
| 331 |
+
variants.append(AudioSample(
|
| 332 |
+
audio=audio_mid,
|
| 333 |
+
text=s.text,
|
| 334 |
+
label="incomplete",
|
| 335 |
+
voice=s.voice,
|
| 336 |
+
accent=s.accent,
|
| 337 |
+
source=f"{s.source}_hesitation_mid",
|
| 338 |
+
speed=s.speed,
|
| 339 |
+
))
|
| 340 |
+
|
| 341 |
+
log.info("Created %d hesitation-pause variants from %d incomplete samples",
|
| 342 |
+
len(variants), len(incomplete))
|
| 343 |
+
return variants
|
| 344 |
+
|
| 345 |
+
|
| 346 |
+
# ---------------------------------------------------------------------------
|
| 347 |
+
# Short utterance generation (Pipecat v3.2: -40% errors on short responses)
|
| 348 |
+
# ---------------------------------------------------------------------------
|
| 349 |
+
|
| 350 |
+
# Common short Portuguese responses in meetings
|
| 351 |
+
SHORT_UTTERANCES_COMPLETE = [
|
| 352 |
+
"Sim.", "Não.", "Ok.", "Tá.", "Pode.", "Beleza.", "Certo.", "Claro.",
|
| 353 |
+
"Entendi.", "Combinado.", "Perfeito.", "Exato.", "Isso.", "Verdade.",
|
| 354 |
+
"Com certeza.", "Sem dúvida.", "Tá bom.", "Pode ser.", "Vamos lá.",
|
| 355 |
+
"Concordo.", "Fechado.", "Ótimo.", "Legal.", "Tranquilo.", "Valeu.",
|
| 356 |
+
"Obrigado.", "De nada.", "Até logo.", "Tchau.", "Bom dia.", "Boa tarde.",
|
| 357 |
+
# French speaker variants
|
| 358 |
+
"Oui... quer dizer, sim.", "Sim, euh, concordo.", "Ok, d'accord.",
|
| 359 |
+
"Bon, tá bom.", "Voilà, é isso.", "Exactement, exato.",
|
| 360 |
+
]
|
| 361 |
+
|
| 362 |
+
SHORT_UTTERANCES_INCOMPLETE = [
|
| 363 |
+
"Sim, mas...", "Não, porque...", "Então...", "Tipo...", "Olha...",
|
| 364 |
+
"Bom...", "Na verdade...", "Quer dizer...", "Pois é...", "Sabe...",
|
| 365 |
+
"É que...", "O problema é...", "A questão é...", "Eu acho que...",
|
| 366 |
+
# French speaker variants (hesitating after short start)
|
| 367 |
+
"Oui... euh...", "Sim, mas... comment dire...", "Alors...",
|
| 368 |
+
"Bon, eu acho que... euh...", "C'est... quer dizer...",
|
| 369 |
+
"Não, enfin... na verdade...", "Sim, mas... como se diz...",
|
| 370 |
+
]
|
| 371 |
+
|
| 372 |
+
|
| 373 |
+
def generate_short_utterances(
|
| 374 |
+
voices: list[str] | None = None,
|
| 375 |
+
n_per_utterance: int = 3,
|
| 376 |
+
) -> list[AudioSample]:
|
| 377 |
+
"""Generate audio for short utterances using Kokoro TTS.
|
| 378 |
+
|
| 379 |
+
Pipecat v3.2 showed that a dedicated short utterance dataset
|
| 380 |
+
reduced misclassification by 40%. Short responses like "sim", "não"
|
| 381 |
+
are very common in Portuguese meetings.
|
| 382 |
+
"""
|
| 383 |
+
try:
|
| 384 |
+
from kokoro import KPipeline
|
| 385 |
+
except ImportError:
|
| 386 |
+
log.warning("kokoro not installed — skipping short utterances")
|
| 387 |
+
return []
|
| 388 |
+
|
| 389 |
+
if voices is None:
|
| 390 |
+
voices = ["pf_dora", "pm_alex", "pm_santa"]
|
| 391 |
+
|
| 392 |
+
pipeline = KPipeline(lang_code="p")
|
| 393 |
+
samples = []
|
| 394 |
+
errors = 0
|
| 395 |
+
|
| 396 |
+
all_short = (
|
| 397 |
+
[(text, "complete") for text in SHORT_UTTERANCES_COMPLETE]
|
| 398 |
+
+ [(text, "incomplete") for text in SHORT_UTTERANCES_INCOMPLETE]
|
| 399 |
+
)
|
| 400 |
+
|
| 401 |
+
for text, label in all_short:
|
| 402 |
+
for _ in range(n_per_utterance):
|
| 403 |
+
voice = random.choice(voices)
|
| 404 |
+
speed = random.uniform(0.85, 1.15)
|
| 405 |
+
|
| 406 |
+
try:
|
| 407 |
+
generator = pipeline(text, voice=voice, speed=speed)
|
| 408 |
+
audio_chunks = []
|
| 409 |
+
for _, _, chunk in generator:
|
| 410 |
+
audio_chunks.append(chunk.numpy() if hasattr(chunk, 'numpy') else np.array(chunk))
|
| 411 |
+
|
| 412 |
+
if not audio_chunks:
|
| 413 |
+
errors += 1
|
| 414 |
+
continue
|
| 415 |
+
|
| 416 |
+
audio = np.concatenate(audio_chunks).astype(np.float32)
|
| 417 |
+
|
| 418 |
+
# Resample to 16kHz if needed
|
| 419 |
+
if hasattr(pipeline, 'sample_rate') and pipeline.sample_rate != SAMPLE_RATE:
|
| 420 |
+
import torchaudio
|
| 421 |
+
import torch
|
| 422 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 423 |
+
audio = torchaudio.functional.resample(
|
| 424 |
+
tensor, pipeline.sample_rate, SAMPLE_RATE
|
| 425 |
+
).squeeze().numpy()
|
| 426 |
+
|
| 427 |
+
# Normalize
|
| 428 |
+
peak = np.max(np.abs(audio))
|
| 429 |
+
if peak > 0:
|
| 430 |
+
audio = audio / peak * 0.9
|
| 431 |
+
|
| 432 |
+
# For incomplete short utterances, add hesitation pause
|
| 433 |
+
if label == "incomplete" and random.random() < 0.7:
|
| 434 |
+
audio = inject_hesitation_pause(audio, pause_duration_range=(1.0, 2.5), position="end")
|
| 435 |
+
|
| 436 |
+
samples.append(AudioSample(
|
| 437 |
+
audio=audio,
|
| 438 |
+
text=text,
|
| 439 |
+
label=label,
|
| 440 |
+
voice=voice,
|
| 441 |
+
accent="native_pt_br",
|
| 442 |
+
source="short_utterance",
|
| 443 |
+
speed=speed,
|
| 444 |
+
))
|
| 445 |
+
|
| 446 |
+
except Exception as e:
|
| 447 |
+
errors += 1
|
| 448 |
+
if errors <= 5:
|
| 449 |
+
log.warning("Short utterance error: %s", e)
|
| 450 |
+
|
| 451 |
+
n_c = sum(1 for s in samples if s.label == "complete")
|
| 452 |
+
n_i = sum(1 for s in samples if s.label == "incomplete")
|
| 453 |
+
log.info("Short utterances: %d samples (%d complete, %d incomplete, %d errors)",
|
| 454 |
+
len(samples), n_c, n_i, errors)
|
| 455 |
+
return samples
|
| 456 |
+
|
| 457 |
+
|
| 458 |
+
# ---------------------------------------------------------------------------
|
| 459 |
+
# Audio augmentation
|
| 460 |
+
# ---------------------------------------------------------------------------
|
| 461 |
+
|
| 462 |
+
def augment_sample(audio: np.ndarray) -> np.ndarray:
|
| 463 |
+
"""Apply augmentation to diversify training data.
|
| 464 |
+
|
| 465 |
+
Based on Pipecat/SpeculativeETD augmentation strategies.
|
| 466 |
+
"""
|
| 467 |
+
aug = audio.copy()
|
| 468 |
+
|
| 469 |
+
# Add background noise (simulates meeting room)
|
| 470 |
+
if random.random() < 0.4:
|
| 471 |
+
noise_level = random.uniform(0.002, 0.015)
|
| 472 |
+
aug += np.random.randn(len(aug)).astype(np.float32) * noise_level
|
| 473 |
+
|
| 474 |
+
# Volume variation (simulates different mic distances)
|
| 475 |
+
if random.random() < 0.5:
|
| 476 |
+
scale = random.uniform(0.6, 1.4)
|
| 477 |
+
aug *= scale
|
| 478 |
+
|
| 479 |
+
# Speed perturbation (simulates speaking rate variation)
|
| 480 |
+
if random.random() < 0.3:
|
| 481 |
+
speed = random.uniform(0.92, 1.08)
|
| 482 |
+
indices = np.arange(0, len(aug), speed).astype(int)
|
| 483 |
+
indices = indices[indices < len(aug)]
|
| 484 |
+
aug = aug[indices]
|
| 485 |
+
|
| 486 |
+
return np.clip(aug, -1.0, 1.0).astype(np.float32)
|
| 487 |
+
|
| 488 |
+
|
| 489 |
+
# ---------------------------------------------------------------------------
|
| 490 |
+
# Save dataset
|
| 491 |
+
# ---------------------------------------------------------------------------
|
| 492 |
+
|
| 493 |
+
def save_dataset(
|
| 494 |
+
samples: list[AudioSample],
|
| 495 |
+
output_dir: Path,
|
| 496 |
+
augment_copies: int = 2,
|
| 497 |
+
) -> Path:
|
| 498 |
+
"""Save audio samples as WAV files + metadata JSON.
|
| 499 |
+
|
| 500 |
+
augment_copies: number of augmented copies per sample (0 = no augmentation)
|
| 501 |
+
"""
|
| 502 |
+
import soundfile as sf
|
| 503 |
+
|
| 504 |
+
output_dir.mkdir(parents=True, exist_ok=True)
|
| 505 |
+
audio_dir = output_dir / "audio"
|
| 506 |
+
audio_dir.mkdir(exist_ok=True)
|
| 507 |
+
|
| 508 |
+
metadata = []
|
| 509 |
+
total_saved = 0
|
| 510 |
+
|
| 511 |
+
for i, sample in enumerate(samples):
|
| 512 |
+
# Save original
|
| 513 |
+
fname = f"{i:05d}_{sample.label}_{sample.accent}_{sample.voice}.wav"
|
| 514 |
+
sf.write(str(audio_dir / fname), sample.audio, SAMPLE_RATE)
|
| 515 |
+
metadata.append({
|
| 516 |
+
"file": fname,
|
| 517 |
+
"text": sample.text,
|
| 518 |
+
"label": sample.label,
|
| 519 |
+
"voice": sample.voice,
|
| 520 |
+
"accent": sample.accent,
|
| 521 |
+
"source": sample.source,
|
| 522 |
+
"speed": sample.speed,
|
| 523 |
+
"augmented": False,
|
| 524 |
+
"duration_s": round(len(sample.audio) / SAMPLE_RATE, 2),
|
| 525 |
+
})
|
| 526 |
+
total_saved += 1
|
| 527 |
+
|
| 528 |
+
# Save augmented copies
|
| 529 |
+
for aug_idx in range(augment_copies):
|
| 530 |
+
aug_audio = augment_sample(sample.audio)
|
| 531 |
+
aug_fname = f"{i:05d}_{sample.label}_{sample.accent}_{sample.voice}_aug{aug_idx}.wav"
|
| 532 |
+
sf.write(str(audio_dir / aug_fname), aug_audio, SAMPLE_RATE)
|
| 533 |
+
metadata.append({
|
| 534 |
+
"file": aug_fname,
|
| 535 |
+
"text": sample.text,
|
| 536 |
+
"label": sample.label,
|
| 537 |
+
"voice": sample.voice,
|
| 538 |
+
"accent": sample.accent,
|
| 539 |
+
"source": sample.source,
|
| 540 |
+
"speed": sample.speed,
|
| 541 |
+
"augmented": True,
|
| 542 |
+
"duration_s": round(len(aug_audio) / SAMPLE_RATE, 2),
|
| 543 |
+
})
|
| 544 |
+
total_saved += 1
|
| 545 |
+
|
| 546 |
+
# Save metadata
|
| 547 |
+
meta_path = output_dir / "metadata.json"
|
| 548 |
+
with open(meta_path, "w") as f:
|
| 549 |
+
json.dump(metadata, f, indent=2, ensure_ascii=False)
|
| 550 |
+
|
| 551 |
+
# Stats
|
| 552 |
+
n_complete = sum(1 for m in metadata if m["label"] == "complete")
|
| 553 |
+
n_incomplete = sum(1 for m in metadata if m["label"] == "incomplete")
|
| 554 |
+
n_native = sum(1 for m in metadata if m["accent"] == "native_pt_br")
|
| 555 |
+
n_french = sum(1 for m in metadata if m["accent"] == "french_pt")
|
| 556 |
+
|
| 557 |
+
log.info("Dataset saved to %s:", output_dir)
|
| 558 |
+
log.info(" Total: %d samples (%d original + %d augmented)",
|
| 559 |
+
total_saved, len(samples), total_saved - len(samples))
|
| 560 |
+
log.info(" Complete: %d, Incomplete: %d", n_complete, n_incomplete)
|
| 561 |
+
log.info(" Native PT-BR: %d, French-accented: %d", n_native, n_french)
|
| 562 |
+
|
| 563 |
+
return output_dir
|
| 564 |
+
|
| 565 |
+
|
| 566 |
+
# ---------------------------------------------------------------------------
|
| 567 |
+
# Main pipeline
|
| 568 |
+
# ---------------------------------------------------------------------------
|
| 569 |
+
|
| 570 |
+
def run_audio_generation(
|
| 571 |
+
max_native_sentences: int = 3000,
|
| 572 |
+
max_french_sentences: int = 1000,
|
| 573 |
+
augment_copies: int = 2,
|
| 574 |
+
) -> Path:
|
| 575 |
+
"""Run the full audio generation pipeline.
|
| 576 |
+
|
| 577 |
+
Loads labeled sentences from 02_generate_labels.py output,
|
| 578 |
+
generates audio via Kokoro + XTTS, saves dataset.
|
| 579 |
+
"""
|
| 580 |
+
labeled_dir = DATA_DIR / "claude_labeled"
|
| 581 |
+
|
| 582 |
+
# Load labeled sentences
|
| 583 |
+
all_sentences = []
|
| 584 |
+
|
| 585 |
+
# 1. Classified sentences (from CORAA transcripts)
|
| 586 |
+
classified_path = labeled_dir / "classified_pt.json"
|
| 587 |
+
if classified_path.exists():
|
| 588 |
+
with open(classified_path) as f:
|
| 589 |
+
classified = json.load(f)
|
| 590 |
+
for c in classified:
|
| 591 |
+
if c.get("label") in ("completo", "incompleto"):
|
| 592 |
+
all_sentences.append({
|
| 593 |
+
"text": c["text"],
|
| 594 |
+
"label": "complete" if c["label"] == "completo" else "incomplete",
|
| 595 |
+
"source": "classified",
|
| 596 |
+
})
|
| 597 |
+
log.info("Loaded %d classified sentences", len(all_sentences))
|
| 598 |
+
|
| 599 |
+
# 2. Filler sentences (all are INCOMPLETE)
|
| 600 |
+
for filler_file, filler_type in [
|
| 601 |
+
("fillers_pt_br.json", "pt_br"),
|
| 602 |
+
("fillers_fr_pt.json", "fr_pt"),
|
| 603 |
+
]:
|
| 604 |
+
fpath = labeled_dir / filler_file
|
| 605 |
+
if fpath.exists():
|
| 606 |
+
with open(fpath) as f:
|
| 607 |
+
fillers = json.load(f)
|
| 608 |
+
for fl in fillers:
|
| 609 |
+
if fl.get("with_filler"):
|
| 610 |
+
all_sentences.append({
|
| 611 |
+
"text": fl["with_filler"],
|
| 612 |
+
"label": "incomplete",
|
| 613 |
+
"source": f"filler_{filler_type}",
|
| 614 |
+
})
|
| 615 |
+
log.info("Loaded %d %s filler sentences", len(fillers), filler_type)
|
| 616 |
+
|
| 617 |
+
# 3. French-Portuguese sentences
|
| 618 |
+
frpt_path = labeled_dir / "french_portuguese.json"
|
| 619 |
+
if frpt_path.exists():
|
| 620 |
+
with open(frpt_path) as f:
|
| 621 |
+
frpt = json.load(f)
|
| 622 |
+
for s in frpt:
|
| 623 |
+
label = "complete" if s.get("label", "") == "completo" else "incomplete"
|
| 624 |
+
all_sentences.append({
|
| 625 |
+
"text": s["text"],
|
| 626 |
+
"label": label,
|
| 627 |
+
"source": "claude_fr_pt",
|
| 628 |
+
})
|
| 629 |
+
log.info("Loaded %d French-Portuguese sentences", len(frpt))
|
| 630 |
+
|
| 631 |
+
if not all_sentences:
|
| 632 |
+
log.error("No labeled sentences found in %s — run 02_generate_labels.py first", labeled_dir)
|
| 633 |
+
return DATA_DIR
|
| 634 |
+
|
| 635 |
+
log.info("Total sentences to synthesize: %d", len(all_sentences))
|
| 636 |
+
|
| 637 |
+
# Split: native PT-BR vs French-accented
|
| 638 |
+
native_sentences = [s for s in all_sentences if s["source"] != "claude_fr_pt"]
|
| 639 |
+
french_sentences = [s for s in all_sentences if s["source"] == "claude_fr_pt"]
|
| 640 |
+
|
| 641 |
+
# Also use some filler_fr_pt sentences with XTTS
|
| 642 |
+
fr_filler = [s for s in all_sentences if s["source"] == "filler_fr_pt"]
|
| 643 |
+
french_sentences.extend(fr_filler)
|
| 644 |
+
|
| 645 |
+
random.shuffle(native_sentences)
|
| 646 |
+
random.shuffle(french_sentences)
|
| 647 |
+
native_sentences = native_sentences[:max_native_sentences]
|
| 648 |
+
french_sentences = french_sentences[:max_french_sentences]
|
| 649 |
+
|
| 650 |
+
# Generate audio
|
| 651 |
+
log.info("\n=== Generating native PT-BR audio (Kokoro) ===")
|
| 652 |
+
native_samples = generate_kokoro_audio(native_sentences)
|
| 653 |
+
|
| 654 |
+
log.info("\n=== Generating French-accented audio (XTTS) ===")
|
| 655 |
+
ref_dir = DATA_DIR / "french_reference_audio"
|
| 656 |
+
french_samples = generate_xtts_audio(french_sentences, reference_audio_dir=ref_dir)
|
| 657 |
+
|
| 658 |
+
# NEW: Short utterances (Pipecat v3.2: -40% errors)
|
| 659 |
+
log.info("\n=== Generating short utterance dataset ===")
|
| 660 |
+
short_samples = generate_short_utterances(n_per_utterance=3)
|
| 661 |
+
|
| 662 |
+
# NEW: Hesitation pause variants (SpeculativeETD V3: best data variant)
|
| 663 |
+
# Critical for French speakers: long pauses mid-sentence are NOT end-of-turn
|
| 664 |
+
log.info("\n=== Creating hesitation-pause variants ===")
|
| 665 |
+
all_pre_hesitation = native_samples + french_samples + short_samples
|
| 666 |
+
hesitation_variants = create_hesitation_variants(all_pre_hesitation, fraction=0.3)
|
| 667 |
+
|
| 668 |
+
all_samples = native_samples + french_samples + short_samples + hesitation_variants
|
| 669 |
+
random.shuffle(all_samples)
|
| 670 |
+
|
| 671 |
+
log.info("Total samples before augmentation: %d", len(all_samples))
|
| 672 |
+
log.info(" Native PT-BR: %d", len(native_samples))
|
| 673 |
+
log.info(" French-accented: %d", len(french_samples))
|
| 674 |
+
log.info(" Short utterances: %d", len(short_samples))
|
| 675 |
+
log.info(" Hesitation variants: %d", len(hesitation_variants))
|
| 676 |
+
|
| 677 |
+
# Save
|
| 678 |
+
log.info("\n=== Saving dataset ===")
|
| 679 |
+
output_dir = DATA_DIR / "tts_dataset"
|
| 680 |
+
save_dataset(all_samples, output_dir, augment_copies=augment_copies)
|
| 681 |
+
|
| 682 |
+
return output_dir
|
| 683 |
+
|
| 684 |
+
|
| 685 |
+
if __name__ == "__main__":
|
| 686 |
+
logging.basicConfig(
|
| 687 |
+
level=logging.INFO,
|
| 688 |
+
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
|
| 689 |
+
)
|
| 690 |
+
random.seed(42)
|
| 691 |
+
np.random.seed(42)
|
| 692 |
+
|
| 693 |
+
run_audio_generation(
|
| 694 |
+
max_native_sentences=3000,
|
| 695 |
+
max_french_sentences=1000,
|
| 696 |
+
augment_copies=2,
|
| 697 |
+
)
|
03-finetune-pipecat-pt/04_finetune.py
ADDED
|
@@ -0,0 +1,1202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Fine-tune Pipecat Smart Turn v3 for Portuguese + French-accented Portuguese.
|
| 2 |
+
|
| 3 |
+
Architecture: exact SmartTurnV3Model from Pipecat's train.py
|
| 4 |
+
- WhisperEncoder (openai/whisper-tiny) with max_source_positions=400 (8s)
|
| 5 |
+
- Attention pooling: Linear(384→256) → Tanh → Linear(256→1)
|
| 6 |
+
- Classifier: Linear(384→256) → LayerNorm → GELU → Dropout(0.1)
|
| 7 |
+
→ Linear(256→64) → GELU → Linear(64→1)
|
| 8 |
+
|
| 9 |
+
Training strategy:
|
| 10 |
+
- Initialize from openai/whisper-tiny (no Pipecat PyTorch weights available)
|
| 11 |
+
- Use Pipecat's Portuguese data as primary training data
|
| 12 |
+
- Add our Claude-labeled + TTS-generated data for PT-BR + French accent
|
| 13 |
+
- Pipecat hyperparams: lr=5e-5, warmup=0.2, cosine schedule, epochs=4
|
| 14 |
+
|
| 15 |
+
Changes vs initial plan (based on reference analysis — see references/):
|
| 16 |
+
- alpha=0.25 (not 0.6) per Lin et al. 2017 optimal for gamma=2
|
| 17 |
+
- batch_size=128 (not 32) per Pipecat train.py (they use 384)
|
| 18 |
+
- Removed label smoothing (double-regularization with focal loss, per EMNLP 2022)
|
| 19 |
+
- Added BCEWithLogitsLoss option for comparison (Pipecat's original loss)
|
| 20 |
+
- Added real noise augmentation (not just Gaussian) per Pipecat v3.2
|
| 21 |
+
- Added short utterance dataset loading per Pipecat v3.2 findings
|
| 22 |
+
- ONNX opset 18 (not 17) per Pipecat train.py
|
| 23 |
+
- Added INT8 static quantization step per Pipecat's deployment pipeline
|
| 24 |
+
|
| 25 |
+
Language-learning-specific improvements (March 2026 research):
|
| 26 |
+
- fp_penalty=2.0: asymmetric cost — interrupting a learner costs 2x more than
|
| 27 |
+
waiting too long (ConversAR 2025, Praktika approach)
|
| 28 |
+
- Speak & Improve L2 corpus: real L2 learner speech with disfluencies (340h)
|
| 29 |
+
- Dual threshold evaluation: eager (speculative) + final (confirmed) per Deepgram Flux
|
| 30 |
+
|
| 31 |
+
Data sources:
|
| 32 |
+
1. Pipecat v3.2 Portuguese subset (~5K samples from 270K)
|
| 33 |
+
2. Our custom TTS dataset (03_generate_audio.py output)
|
| 34 |
+
3. CORAA real audio (01_download_pipecat.py output)
|
| 35 |
+
4. Speak & Improve L2 corpus (~2K samples, cross-lingual hesitation patterns)
|
| 36 |
+
"""
|
| 37 |
+
|
| 38 |
+
from __future__ import annotations
|
| 39 |
+
|
| 40 |
+
import gc
|
| 41 |
+
import json
|
| 42 |
+
import logging
|
| 43 |
+
import random
|
| 44 |
+
import time
|
| 45 |
+
from dataclasses import dataclass
|
| 46 |
+
from pathlib import Path
|
| 47 |
+
|
| 48 |
+
import numpy as np
|
| 49 |
+
import torch
|
| 50 |
+
import torch.nn as nn
|
| 51 |
+
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
|
| 52 |
+
from transformers import WhisperFeatureExtractor, WhisperPreTrainedModel, WhisperConfig
|
| 53 |
+
|
| 54 |
+
log = logging.getLogger(__name__)
|
| 55 |
+
|
| 56 |
+
SAMPLE_RATE = 16000
|
| 57 |
+
WINDOW_SECONDS = 8
|
| 58 |
+
WINDOW_SAMPLES = WINDOW_SECONDS * SAMPLE_RATE
|
| 59 |
+
|
| 60 |
+
_workspace = Path("/workspace") if Path("/workspace").exists() else Path(".")
|
| 61 |
+
OUTPUT_DIR = _workspace / "results"
|
| 62 |
+
CACHE_DIR = _workspace / "hf_cache"
|
| 63 |
+
DATA_DIR = Path(__file__).parent / "data"
|
| 64 |
+
|
| 65 |
+
# Label smoothing REMOVED — focal loss already provides implicit calibration
|
| 66 |
+
# via entropy regularization (EMNLP 2022: focal_loss_calibration_emnlp_2022.md)
|
| 67 |
+
# Double-regularization (FL + LS) reduces discriminative power.
|
| 68 |
+
LABEL_SMOOTH = 0.0
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
# ---------------------------------------------------------------------------
|
| 72 |
+
# Model — SmartTurnV3Model (exact Pipecat architecture)
|
| 73 |
+
# ---------------------------------------------------------------------------
|
| 74 |
+
|
| 75 |
+
class SmartTurnV3Model(nn.Module):
|
| 76 |
+
"""Pipecat Smart Turn v3 model architecture.
|
| 77 |
+
|
| 78 |
+
Matches the exact architecture from:
|
| 79 |
+
https://github.com/pipecat-ai/smart-turn/blob/main/train/train.py
|
| 80 |
+
|
| 81 |
+
WhisperEncoder (whisper-tiny, 384-dim) → attention pooling → classifier
|
| 82 |
+
"""
|
| 83 |
+
|
| 84 |
+
def __init__(self, whisper_model: str = "openai/whisper-tiny"):
|
| 85 |
+
super().__init__()
|
| 86 |
+
from transformers import WhisperModel
|
| 87 |
+
|
| 88 |
+
# Load pretrained encoder
|
| 89 |
+
whisper = WhisperModel.from_pretrained(
|
| 90 |
+
whisper_model, cache_dir=str(CACHE_DIR)
|
| 91 |
+
)
|
| 92 |
+
self.encoder = whisper.encoder
|
| 93 |
+
|
| 94 |
+
# Resize position embeddings: 1500 (30s) → 400 (8s)
|
| 95 |
+
max_pos = 400
|
| 96 |
+
old_embed = self.encoder.embed_positions.weight.data
|
| 97 |
+
new_embed = old_embed[:max_pos, :]
|
| 98 |
+
self.encoder.embed_positions = nn.Embedding(max_pos, old_embed.shape[1])
|
| 99 |
+
self.encoder.embed_positions.weight.data = new_embed
|
| 100 |
+
self.encoder.config.max_source_positions = max_pos
|
| 101 |
+
|
| 102 |
+
hidden_size = self.encoder.config.d_model # 384 for whisper-tiny
|
| 103 |
+
|
| 104 |
+
# Attention pooling (exact Pipecat architecture)
|
| 105 |
+
self.attention = nn.Sequential(
|
| 106 |
+
nn.Linear(hidden_size, 256),
|
| 107 |
+
nn.Tanh(),
|
| 108 |
+
nn.Linear(256, 1),
|
| 109 |
+
)
|
| 110 |
+
|
| 111 |
+
# Classifier head (exact Pipecat architecture)
|
| 112 |
+
self.classifier = nn.Sequential(
|
| 113 |
+
nn.Linear(hidden_size, 256),
|
| 114 |
+
nn.LayerNorm(256),
|
| 115 |
+
nn.GELU(),
|
| 116 |
+
nn.Dropout(0.1),
|
| 117 |
+
nn.Linear(256, 64),
|
| 118 |
+
nn.GELU(),
|
| 119 |
+
nn.Linear(64, 1),
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
def forward(self, input_features: torch.Tensor) -> torch.Tensor:
|
| 123 |
+
encoder_output = self.encoder(input_features).last_hidden_state
|
| 124 |
+
attn_weights = self.attention(encoder_output)
|
| 125 |
+
attn_weights = torch.softmax(attn_weights, dim=1)
|
| 126 |
+
pooled = (encoder_output * attn_weights).sum(dim=1)
|
| 127 |
+
logits = self.classifier(pooled)
|
| 128 |
+
return logits.squeeze(-1)
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
# ---------------------------------------------------------------------------
|
| 132 |
+
# Focal Loss
|
| 133 |
+
# ---------------------------------------------------------------------------
|
| 134 |
+
|
| 135 |
+
class FocalLoss(nn.Module):
|
| 136 |
+
"""Focal Loss (Lin et al. 2017) — focuses on hard boundary cases.
|
| 137 |
+
|
| 138 |
+
alpha=0.25 is optimal for gamma=2.0 per the original paper (Table 1a).
|
| 139 |
+
Our initial alpha=0.6 over-weighted positives and hurt calibration.
|
| 140 |
+
|
| 141 |
+
fp_penalty: extra multiplier on false-positive loss (model says "complete"
|
| 142 |
+
when speaker is still talking). For a language-learning avatar, interrupting
|
| 143 |
+
the learner mid-thought is much worse than waiting too long.
|
| 144 |
+
- ConversAR (2025) gives learners "infinite thinking period"
|
| 145 |
+
- Praktika uses extended silence tolerance for L2 speech
|
| 146 |
+
- Default 2.0 means FP errors cost 2x more than FN errors
|
| 147 |
+
"""
|
| 148 |
+
|
| 149 |
+
def __init__(self, gamma: float = 2.0, alpha: float = 0.25, fp_penalty: float = 2.0):
|
| 150 |
+
super().__init__()
|
| 151 |
+
self.gamma = gamma
|
| 152 |
+
self.alpha = alpha
|
| 153 |
+
self.fp_penalty = fp_penalty
|
| 154 |
+
|
| 155 |
+
def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
|
| 156 |
+
bce = nn.functional.binary_cross_entropy_with_logits(
|
| 157 |
+
logits, targets, reduction="none",
|
| 158 |
+
)
|
| 159 |
+
probs = torch.sigmoid(logits)
|
| 160 |
+
p_t = probs * targets + (1 - probs) * (1 - targets)
|
| 161 |
+
focal_weight = (1 - p_t) ** self.gamma
|
| 162 |
+
alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
|
| 163 |
+
loss = alpha_t * focal_weight * bce
|
| 164 |
+
|
| 165 |
+
# Asymmetric cost: penalize FP (interrupting learner) more than FN (waiting)
|
| 166 |
+
# FP = model predicts 1 (complete) when target is 0 (incomplete)
|
| 167 |
+
if self.fp_penalty != 1.0:
|
| 168 |
+
is_fp = (probs > 0.5).float() * (1 - targets) # predicted complete, actually incomplete
|
| 169 |
+
penalty = 1.0 + (self.fp_penalty - 1.0) * is_fp
|
| 170 |
+
loss = loss * penalty
|
| 171 |
+
|
| 172 |
+
return loss.mean()
|
| 173 |
+
|
| 174 |
+
|
| 175 |
+
# ---------------------------------------------------------------------------
|
| 176 |
+
# Data loading
|
| 177 |
+
# ---------------------------------------------------------------------------
|
| 178 |
+
|
| 179 |
+
@dataclass
|
| 180 |
+
class AudioSample:
|
| 181 |
+
audio: np.ndarray
|
| 182 |
+
label: float # 1.0 = complete, 0.0 = incomplete
|
| 183 |
+
source: str
|
| 184 |
+
speaker_id: str = ""
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
def load_pipecat_portuguese(max_samples: int = 5000) -> list[AudioSample]:
|
| 188 |
+
"""Load Pipecat v3.2 Portuguese data (from 01_download_pipecat.py)."""
|
| 189 |
+
audio_dir = DATA_DIR / "pipecat_pt_audio"
|
| 190 |
+
if not audio_dir.exists():
|
| 191 |
+
log.warning("Pipecat PT audio not found at %s — run 01_download_pipecat.py first", audio_dir)
|
| 192 |
+
return []
|
| 193 |
+
|
| 194 |
+
import soundfile as sf
|
| 195 |
+
|
| 196 |
+
samples = []
|
| 197 |
+
for wav_path in sorted(audio_dir.glob("*.wav"))[:max_samples]:
|
| 198 |
+
try:
|
| 199 |
+
audio, sr = sf.read(str(wav_path))
|
| 200 |
+
audio = np.array(audio, dtype=np.float32)
|
| 201 |
+
|
| 202 |
+
if sr != SAMPLE_RATE:
|
| 203 |
+
import torchaudio
|
| 204 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 205 |
+
audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
|
| 206 |
+
|
| 207 |
+
label = 1.0 if "_complete" in wav_path.name else 0.0
|
| 208 |
+
audio = _extract_window(audio)
|
| 209 |
+
if audio is not None:
|
| 210 |
+
samples.append(AudioSample(
|
| 211 |
+
audio=audio, label=label,
|
| 212 |
+
source="pipecat_v3.2",
|
| 213 |
+
speaker_id=f"pipecat_{wav_path.stem.split('_')[0]}",
|
| 214 |
+
))
|
| 215 |
+
except Exception as e:
|
| 216 |
+
log.warning("Error loading %s: %s", wav_path.name, e)
|
| 217 |
+
|
| 218 |
+
n_c = sum(1 for s in samples if s.label == 1.0)
|
| 219 |
+
n_i = sum(1 for s in samples if s.label == 0.0)
|
| 220 |
+
log.info("Pipecat PT: %d samples (%d complete, %d incomplete)", len(samples), n_c, n_i)
|
| 221 |
+
return samples
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
def load_tts_dataset(max_samples: int = 10000) -> list[AudioSample]:
|
| 225 |
+
"""Load TTS-generated audio (from 03_generate_audio.py)."""
|
| 226 |
+
tts_dir = DATA_DIR / "tts_dataset"
|
| 227 |
+
meta_path = tts_dir / "metadata.json"
|
| 228 |
+
|
| 229 |
+
if not meta_path.exists():
|
| 230 |
+
log.warning("TTS dataset not found at %s — run 03_generate_audio.py first", tts_dir)
|
| 231 |
+
return []
|
| 232 |
+
|
| 233 |
+
import soundfile as sf
|
| 234 |
+
|
| 235 |
+
with open(meta_path) as f:
|
| 236 |
+
metadata = json.load(f)
|
| 237 |
+
|
| 238 |
+
audio_dir = tts_dir / "audio"
|
| 239 |
+
samples = []
|
| 240 |
+
|
| 241 |
+
for meta in metadata[:max_samples]:
|
| 242 |
+
wav_path = audio_dir / meta["file"]
|
| 243 |
+
if not wav_path.exists():
|
| 244 |
+
continue
|
| 245 |
+
|
| 246 |
+
try:
|
| 247 |
+
audio, sr = sf.read(str(wav_path))
|
| 248 |
+
audio = np.array(audio, dtype=np.float32)
|
| 249 |
+
|
| 250 |
+
if sr != SAMPLE_RATE:
|
| 251 |
+
import torchaudio
|
| 252 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 253 |
+
audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
|
| 254 |
+
|
| 255 |
+
label = 1.0 if meta["label"] == "complete" else 0.0
|
| 256 |
+
audio = _extract_window(audio)
|
| 257 |
+
if audio is not None:
|
| 258 |
+
samples.append(AudioSample(
|
| 259 |
+
audio=audio, label=label,
|
| 260 |
+
source=f"tts_{meta['accent']}",
|
| 261 |
+
speaker_id=meta["voice"],
|
| 262 |
+
))
|
| 263 |
+
except Exception as e:
|
| 264 |
+
log.warning("Error loading %s: %s", meta["file"], e)
|
| 265 |
+
|
| 266 |
+
n_c = sum(1 for s in samples if s.label == 1.0)
|
| 267 |
+
n_i = sum(1 for s in samples if s.label == 0.0)
|
| 268 |
+
log.info("TTS dataset: %d samples (%d complete, %d incomplete)", len(samples), n_c, n_i)
|
| 269 |
+
return samples
|
| 270 |
+
|
| 271 |
+
|
| 272 |
+
def load_coraa_real_audio(max_samples: int = 3000) -> list[AudioSample]:
|
| 273 |
+
"""Load REAL conversational audio from CORAA (not TTS).
|
| 274 |
+
|
| 275 |
+
SpeculativeETD showed synthetic-only training has a devastating gap:
|
| 276 |
+
F1 drops from 94.7% → 30.3% on real data. Mixing real audio is critical.
|
| 277 |
+
|
| 278 |
+
CORAA MUPE has 365h of real PT-BR interviews with speaker diarization.
|
| 279 |
+
We use these as COMPLETE samples (full utterances with natural prosody).
|
| 280 |
+
For INCOMPLETE, we truncate at natural pause points.
|
| 281 |
+
"""
|
| 282 |
+
from datasets import load_dataset
|
| 283 |
+
import re
|
| 284 |
+
|
| 285 |
+
log.info("Loading CORAA MUPE real audio (streaming)...")
|
| 286 |
+
try:
|
| 287 |
+
ds = load_dataset(
|
| 288 |
+
"nilc-nlp/CORAA-MUPE-ASR",
|
| 289 |
+
split="train",
|
| 290 |
+
streaming=True,
|
| 291 |
+
cache_dir=str(CACHE_DIR),
|
| 292 |
+
)
|
| 293 |
+
except Exception as e:
|
| 294 |
+
log.warning("Failed to load CORAA MUPE: %s", e)
|
| 295 |
+
return []
|
| 296 |
+
|
| 297 |
+
samples = []
|
| 298 |
+
complete_count = 0
|
| 299 |
+
incomplete_count = 0
|
| 300 |
+
target_per_class = max_samples // 2
|
| 301 |
+
|
| 302 |
+
for i, row in enumerate(ds):
|
| 303 |
+
if complete_count >= target_per_class and incomplete_count >= target_per_class:
|
| 304 |
+
break
|
| 305 |
+
|
| 306 |
+
try:
|
| 307 |
+
audio_data = row.get("audio", {})
|
| 308 |
+
if not audio_data:
|
| 309 |
+
continue
|
| 310 |
+
|
| 311 |
+
audio = np.array(audio_data["array"], dtype=np.float32)
|
| 312 |
+
sr = audio_data["sampling_rate"]
|
| 313 |
+
|
| 314 |
+
if sr != SAMPLE_RATE:
|
| 315 |
+
import torchaudio
|
| 316 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 317 |
+
audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
|
| 318 |
+
|
| 319 |
+
duration = len(audio) / SAMPLE_RATE
|
| 320 |
+
if duration < 1.0:
|
| 321 |
+
continue
|
| 322 |
+
|
| 323 |
+
text = str(row.get("text", ""))
|
| 324 |
+
speaker_id = f"coraa_real_{i // 50}"
|
| 325 |
+
|
| 326 |
+
# COMPLETE: full utterances ending with sentence punctuation
|
| 327 |
+
if complete_count < target_per_class and re.search(r'[.!?]+\s*$', text):
|
| 328 |
+
window = _extract_window(audio)
|
| 329 |
+
if window is not None:
|
| 330 |
+
samples.append(AudioSample(
|
| 331 |
+
audio=window, label=1.0,
|
| 332 |
+
source="coraa_real",
|
| 333 |
+
speaker_id=speaker_id,
|
| 334 |
+
))
|
| 335 |
+
complete_count += 1
|
| 336 |
+
|
| 337 |
+
# INCOMPLETE: truncate real audio at 40-70% (natural mid-utterance)
|
| 338 |
+
if incomplete_count < target_per_class and duration >= 2.0:
|
| 339 |
+
cut_frac = random.uniform(0.4, 0.7)
|
| 340 |
+
truncated = audio[:int(len(audio) * cut_frac)]
|
| 341 |
+
window = _extract_window(truncated)
|
| 342 |
+
if window is not None:
|
| 343 |
+
samples.append(AudioSample(
|
| 344 |
+
audio=window, label=0.0,
|
| 345 |
+
source="coraa_real",
|
| 346 |
+
speaker_id=speaker_id,
|
| 347 |
+
))
|
| 348 |
+
incomplete_count += 1
|
| 349 |
+
|
| 350 |
+
except Exception:
|
| 351 |
+
continue
|
| 352 |
+
|
| 353 |
+
if i % 5000 == 0 and i > 0:
|
| 354 |
+
log.info(" CORAA real: scanned %d, %d complete, %d incomplete", i, complete_count, incomplete_count)
|
| 355 |
+
|
| 356 |
+
log.info("CORAA real audio: %d complete + %d incomplete = %d samples",
|
| 357 |
+
complete_count, incomplete_count, len(samples))
|
| 358 |
+
return samples
|
| 359 |
+
|
| 360 |
+
|
| 361 |
+
def load_pipecat_test_data() -> list[AudioSample]:
|
| 362 |
+
"""Load Pipecat v3.2 Portuguese test data for evaluation."""
|
| 363 |
+
test_dir = DATA_DIR / "pipecat_pt_test"
|
| 364 |
+
if not test_dir.exists():
|
| 365 |
+
log.warning("Pipecat PT test data not found — run 01_download_pipecat.py first")
|
| 366 |
+
return []
|
| 367 |
+
|
| 368 |
+
import soundfile as sf
|
| 369 |
+
|
| 370 |
+
samples = []
|
| 371 |
+
for wav_path in sorted(test_dir.glob("*.wav")):
|
| 372 |
+
try:
|
| 373 |
+
audio, sr = sf.read(str(wav_path))
|
| 374 |
+
audio = np.array(audio, dtype=np.float32)
|
| 375 |
+
|
| 376 |
+
if sr != SAMPLE_RATE:
|
| 377 |
+
import torchaudio
|
| 378 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 379 |
+
audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
|
| 380 |
+
|
| 381 |
+
label = 1.0 if "_complete" in wav_path.name else 0.0
|
| 382 |
+
audio = _extract_window(audio)
|
| 383 |
+
if audio is not None:
|
| 384 |
+
samples.append(AudioSample(
|
| 385 |
+
audio=audio, label=label,
|
| 386 |
+
source="pipecat_test",
|
| 387 |
+
speaker_id=f"pipecat_test_{wav_path.stem.split('_')[0]}",
|
| 388 |
+
))
|
| 389 |
+
except Exception as e:
|
| 390 |
+
log.warning("Error loading test %s: %s", wav_path.name, e)
|
| 391 |
+
|
| 392 |
+
log.info("Pipecat test: %d samples", len(samples))
|
| 393 |
+
return samples
|
| 394 |
+
|
| 395 |
+
|
| 396 |
+
def _extract_window(audio: np.ndarray) -> np.ndarray | None:
|
| 397 |
+
"""Extract 8-second window from end of audio, pad if needed."""
|
| 398 |
+
if len(audio) < SAMPLE_RATE: # minimum 1s
|
| 399 |
+
return None
|
| 400 |
+
|
| 401 |
+
peak = np.max(np.abs(audio))
|
| 402 |
+
if peak > 0:
|
| 403 |
+
audio = audio / peak * 0.9
|
| 404 |
+
|
| 405 |
+
if len(audio) > WINDOW_SAMPLES:
|
| 406 |
+
audio = audio[-WINDOW_SAMPLES:]
|
| 407 |
+
elif len(audio) < WINDOW_SAMPLES:
|
| 408 |
+
padding = WINDOW_SAMPLES - len(audio)
|
| 409 |
+
audio = np.pad(audio, (padding, 0), mode="constant")
|
| 410 |
+
|
| 411 |
+
# ~200ms silence at end (VAD behavior)
|
| 412 |
+
silence_samples = int(0.2 * SAMPLE_RATE)
|
| 413 |
+
audio[-silence_samples:] = 0.0
|
| 414 |
+
|
| 415 |
+
return audio.astype(np.float32)
|
| 416 |
+
|
| 417 |
+
|
| 418 |
+
# ---------------------------------------------------------------------------
|
| 419 |
+
# Audio augmentation
|
| 420 |
+
# ---------------------------------------------------------------------------
|
| 421 |
+
|
| 422 |
+
# Cache for background noise samples (loaded once)
|
| 423 |
+
_noise_cache: list[np.ndarray] = []
|
| 424 |
+
|
| 425 |
+
|
| 426 |
+
def _load_noise_samples() -> list[np.ndarray]:
|
| 427 |
+
"""Load real background noise samples for augmentation.
|
| 428 |
+
|
| 429 |
+
Pipecat v3.2 used CC-0 Freesound.org cafe/office noise and saw
|
| 430 |
+
40% fewer short-utterance misclassifications (pipecat_smart_turn_v3_2.md).
|
| 431 |
+
"""
|
| 432 |
+
global _noise_cache
|
| 433 |
+
if _noise_cache:
|
| 434 |
+
return _noise_cache
|
| 435 |
+
|
| 436 |
+
noise_dir = DATA_DIR / "noise_samples"
|
| 437 |
+
if not noise_dir.exists():
|
| 438 |
+
return []
|
| 439 |
+
|
| 440 |
+
import soundfile as sf
|
| 441 |
+
for wav_path in noise_dir.glob("*.wav"):
|
| 442 |
+
try:
|
| 443 |
+
audio, sr = sf.read(str(wav_path))
|
| 444 |
+
audio = np.array(audio, dtype=np.float32)
|
| 445 |
+
if sr != SAMPLE_RATE:
|
| 446 |
+
import torchaudio
|
| 447 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 448 |
+
audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
|
| 449 |
+
_noise_cache.append(audio)
|
| 450 |
+
except Exception:
|
| 451 |
+
pass
|
| 452 |
+
|
| 453 |
+
if _noise_cache:
|
| 454 |
+
log.info("Loaded %d background noise samples for augmentation", len(_noise_cache))
|
| 455 |
+
return _noise_cache
|
| 456 |
+
|
| 457 |
+
|
| 458 |
+
def augment_audio(audio: np.ndarray) -> np.ndarray:
|
| 459 |
+
"""Data augmentation for training.
|
| 460 |
+
|
| 461 |
+
Per reference analysis:
|
| 462 |
+
- Real background noise (Pipecat v3.2: 40% fewer errors with cafe/office noise)
|
| 463 |
+
- Speed perturbation (standard in speech ML)
|
| 464 |
+
- Volume variation (simulates mic distance)
|
| 465 |
+
- Time shift
|
| 466 |
+
"""
|
| 467 |
+
aug = audio.copy()
|
| 468 |
+
|
| 469 |
+
# Speed perturbation
|
| 470 |
+
if random.random() < 0.5:
|
| 471 |
+
speed = random.uniform(0.9, 1.1)
|
| 472 |
+
indices = np.arange(0, len(aug), speed).astype(int)
|
| 473 |
+
indices = indices[indices < len(aug)]
|
| 474 |
+
aug = aug[indices]
|
| 475 |
+
if len(aug) > WINDOW_SAMPLES:
|
| 476 |
+
aug = aug[:WINDOW_SAMPLES]
|
| 477 |
+
elif len(aug) < WINDOW_SAMPLES:
|
| 478 |
+
aug = np.pad(aug, (WINDOW_SAMPLES - len(aug), 0), mode="constant")
|
| 479 |
+
|
| 480 |
+
# Volume variation
|
| 481 |
+
if random.random() < 0.5:
|
| 482 |
+
aug *= random.uniform(0.6, 1.4)
|
| 483 |
+
|
| 484 |
+
# Real background noise (preferred) or Gaussian fallback
|
| 485 |
+
noise_samples = _load_noise_samples()
|
| 486 |
+
if noise_samples and random.random() < 0.5:
|
| 487 |
+
noise = random.choice(noise_samples)
|
| 488 |
+
# Loop or trim noise to match audio length
|
| 489 |
+
if len(noise) < len(aug):
|
| 490 |
+
repeats = len(aug) // len(noise) + 1
|
| 491 |
+
noise = np.tile(noise, repeats)[:len(aug)]
|
| 492 |
+
else:
|
| 493 |
+
start = random.randint(0, len(noise) - len(aug))
|
| 494 |
+
noise = noise[start:start + len(aug)]
|
| 495 |
+
snr_db = random.uniform(10, 25) # 10-25 dB SNR
|
| 496 |
+
noise_scale = np.sqrt(np.mean(aug ** 2)) / (np.sqrt(np.mean(noise ** 2)) * 10 ** (snr_db / 20) + 1e-8)
|
| 497 |
+
aug += noise * noise_scale
|
| 498 |
+
elif random.random() < 0.4:
|
| 499 |
+
# Gaussian noise fallback
|
| 500 |
+
noise_level = random.uniform(0.002, 0.02)
|
| 501 |
+
aug += np.random.randn(len(aug)).astype(np.float32) * noise_level
|
| 502 |
+
|
| 503 |
+
# Time shift
|
| 504 |
+
if random.random() < 0.3:
|
| 505 |
+
shift = random.randint(-int(0.3 * SAMPLE_RATE), int(0.3 * SAMPLE_RATE))
|
| 506 |
+
aug = np.roll(aug, shift)
|
| 507 |
+
if shift > 0:
|
| 508 |
+
aug[:shift] = 0.0
|
| 509 |
+
elif shift < 0:
|
| 510 |
+
aug[shift:] = 0.0
|
| 511 |
+
|
| 512 |
+
return np.clip(aug, -1.0, 1.0).astype(np.float32)
|
| 513 |
+
|
| 514 |
+
|
| 515 |
+
# ---------------------------------------------------------------------------
|
| 516 |
+
# PyTorch Dataset
|
| 517 |
+
# ---------------------------------------------------------------------------
|
| 518 |
+
|
| 519 |
+
class SmartTurnDataset(Dataset):
|
| 520 |
+
def __init__(
|
| 521 |
+
self,
|
| 522 |
+
samples: list[AudioSample],
|
| 523 |
+
feature_extractor: WhisperFeatureExtractor,
|
| 524 |
+
augment: bool = False,
|
| 525 |
+
):
|
| 526 |
+
self.samples = samples
|
| 527 |
+
self.feature_extractor = feature_extractor
|
| 528 |
+
self.augment = augment
|
| 529 |
+
|
| 530 |
+
def __len__(self) -> int:
|
| 531 |
+
return len(self.samples)
|
| 532 |
+
|
| 533 |
+
def __getitem__(self, idx: int) -> dict:
|
| 534 |
+
sample = self.samples[idx]
|
| 535 |
+
audio = sample.audio
|
| 536 |
+
|
| 537 |
+
if self.augment:
|
| 538 |
+
audio = augment_audio(audio)
|
| 539 |
+
|
| 540 |
+
inputs = self.feature_extractor(
|
| 541 |
+
audio,
|
| 542 |
+
sampling_rate=SAMPLE_RATE,
|
| 543 |
+
return_tensors="np",
|
| 544 |
+
padding="max_length",
|
| 545 |
+
max_length=WINDOW_SAMPLES,
|
| 546 |
+
truncation=True,
|
| 547 |
+
do_normalize=True,
|
| 548 |
+
)
|
| 549 |
+
|
| 550 |
+
features = inputs.input_features.squeeze(0).astype(np.float32)
|
| 551 |
+
|
| 552 |
+
return {
|
| 553 |
+
"input_features": torch.from_numpy(features),
|
| 554 |
+
"labels": torch.tensor(sample.label, dtype=torch.float32),
|
| 555 |
+
}
|
| 556 |
+
|
| 557 |
+
|
| 558 |
+
# ---------------------------------------------------------------------------
|
| 559 |
+
# Train/val split
|
| 560 |
+
# ---------------------------------------------------------------------------
|
| 561 |
+
|
| 562 |
+
def split_by_speaker(
|
| 563 |
+
samples: list[AudioSample],
|
| 564 |
+
val_frac: float = 0.1,
|
| 565 |
+
test_frac: float = 0.1,
|
| 566 |
+
) -> tuple[list[AudioSample], list[AudioSample], list[AudioSample]]:
|
| 567 |
+
"""Split by speaker ID to prevent data leakage."""
|
| 568 |
+
speaker_map: dict[str, list[AudioSample]] = {}
|
| 569 |
+
for s in samples:
|
| 570 |
+
speaker_map.setdefault(s.speaker_id or "unknown", []).append(s)
|
| 571 |
+
|
| 572 |
+
speakers = list(speaker_map.keys())
|
| 573 |
+
random.shuffle(speakers)
|
| 574 |
+
|
| 575 |
+
n_val = max(1, int(len(speakers) * val_frac))
|
| 576 |
+
n_test = max(1, int(len(speakers) * test_frac))
|
| 577 |
+
|
| 578 |
+
test_spk = set(speakers[:n_test])
|
| 579 |
+
val_spk = set(speakers[n_test:n_test + n_val])
|
| 580 |
+
train_spk = set(speakers[n_test + n_val:])
|
| 581 |
+
|
| 582 |
+
train = [s for sp in train_spk for s in speaker_map[sp]]
|
| 583 |
+
val = [s for sp in val_spk for s in speaker_map[sp]]
|
| 584 |
+
test = [s for sp in test_spk for s in speaker_map[sp]]
|
| 585 |
+
|
| 586 |
+
random.shuffle(train)
|
| 587 |
+
random.shuffle(val)
|
| 588 |
+
random.shuffle(test)
|
| 589 |
+
|
| 590 |
+
for name, split in [("Train", train), ("Val", val), ("Test", test)]:
|
| 591 |
+
n_c = sum(1 for s in split if s.label == 1.0)
|
| 592 |
+
n_i = sum(1 for s in split if s.label == 0.0)
|
| 593 |
+
log.info(" %s: %d samples (%d complete, %d incomplete, %d speakers)",
|
| 594 |
+
name, len(split), n_c, n_i, len(set(s.speaker_id for s in split)))
|
| 595 |
+
|
| 596 |
+
return train, val, test
|
| 597 |
+
|
| 598 |
+
|
| 599 |
+
# ---------------------------------------------------------------------------
|
| 600 |
+
# Training
|
| 601 |
+
# ---------------------------------------------------------------------------
|
| 602 |
+
|
| 603 |
+
def load_speak_improve_l2(max_samples: int = 2000) -> list[AudioSample]:
|
| 604 |
+
"""Load L2 learner speech from Speak & Improve Corpus 2025.
|
| 605 |
+
|
| 606 |
+
340 hours of L2 English learner speech with disfluency annotations and
|
| 607 |
+
CEFR proficiency scores (A2-C1, majority B1-B2).
|
| 608 |
+
|
| 609 |
+
Even though this is English L2 (not Portuguese), L2 hesitation patterns
|
| 610 |
+
transfer across languages (Cenoz 2000). The pauses, false starts, and
|
| 611 |
+
disfluencies of L2 speakers share universal characteristics regardless
|
| 612 |
+
of target language.
|
| 613 |
+
|
| 614 |
+
Knill et al. (2025). Speak & Improve Corpus 2025: an L2 English Speech
|
| 615 |
+
Corpus for Language Assessment and Feedback. arXiv:2412.11986
|
| 616 |
+
"""
|
| 617 |
+
from datasets import load_dataset
|
| 618 |
+
|
| 619 |
+
log.info("Loading Speak & Improve L2 corpus (streaming)...")
|
| 620 |
+
try:
|
| 621 |
+
ds = load_dataset(
|
| 622 |
+
"CambridgeEnglish/SpeakAndImprove2025",
|
| 623 |
+
split="train",
|
| 624 |
+
streaming=True,
|
| 625 |
+
cache_dir=str(CACHE_DIR),
|
| 626 |
+
trust_remote_code=True,
|
| 627 |
+
)
|
| 628 |
+
except Exception as e:
|
| 629 |
+
log.warning("Failed to load Speak & Improve corpus: %s — trying fallback name", e)
|
| 630 |
+
try:
|
| 631 |
+
ds = load_dataset(
|
| 632 |
+
"cambridgeenglishtests/speak-and-improve-2025",
|
| 633 |
+
split="train",
|
| 634 |
+
streaming=True,
|
| 635 |
+
cache_dir=str(CACHE_DIR),
|
| 636 |
+
trust_remote_code=True,
|
| 637 |
+
)
|
| 638 |
+
except Exception as e2:
|
| 639 |
+
log.warning("Speak & Improve corpus not available: %s — skipping", e2)
|
| 640 |
+
return []
|
| 641 |
+
|
| 642 |
+
samples = []
|
| 643 |
+
complete_count = 0
|
| 644 |
+
incomplete_count = 0
|
| 645 |
+
target_per_class = max_samples // 2
|
| 646 |
+
|
| 647 |
+
for i, row in enumerate(ds):
|
| 648 |
+
if complete_count >= target_per_class and incomplete_count >= target_per_class:
|
| 649 |
+
break
|
| 650 |
+
|
| 651 |
+
try:
|
| 652 |
+
audio_data = row.get("audio", {})
|
| 653 |
+
if not audio_data:
|
| 654 |
+
continue
|
| 655 |
+
|
| 656 |
+
audio = np.array(audio_data["array"], dtype=np.float32)
|
| 657 |
+
sr = audio_data["sampling_rate"]
|
| 658 |
+
|
| 659 |
+
if sr != SAMPLE_RATE:
|
| 660 |
+
import torchaudio
|
| 661 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 662 |
+
audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
|
| 663 |
+
|
| 664 |
+
duration = len(audio) / SAMPLE_RATE
|
| 665 |
+
if duration < 1.0:
|
| 666 |
+
continue
|
| 667 |
+
|
| 668 |
+
text = str(row.get("text", row.get("transcript", "")))
|
| 669 |
+
speaker_id = f"si_l2_{i // 20}"
|
| 670 |
+
|
| 671 |
+
# COMPLETE: full utterances (use entire audio)
|
| 672 |
+
if complete_count < target_per_class and duration >= 2.0:
|
| 673 |
+
window = _extract_window(audio)
|
| 674 |
+
if window is not None:
|
| 675 |
+
samples.append(AudioSample(
|
| 676 |
+
audio=window, label=1.0,
|
| 677 |
+
source="speak_improve_l2",
|
| 678 |
+
speaker_id=speaker_id,
|
| 679 |
+
))
|
| 680 |
+
complete_count += 1
|
| 681 |
+
|
| 682 |
+
# INCOMPLETE: truncate at random point (simulates mid-utterance)
|
| 683 |
+
if incomplete_count < target_per_class and duration >= 3.0:
|
| 684 |
+
cut_frac = random.uniform(0.3, 0.6)
|
| 685 |
+
truncated = audio[:int(len(audio) * cut_frac)]
|
| 686 |
+
# Add hesitation-like pause at end (L2 speakers pause 524ms+ per Kosmala 2022)
|
| 687 |
+
pause_ms = random.uniform(500, 2000)
|
| 688 |
+
pause_samples = int(pause_ms / 1000 * SAMPLE_RATE)
|
| 689 |
+
pause = np.random.randn(pause_samples).astype(np.float32) * 0.001
|
| 690 |
+
truncated = np.concatenate([truncated, pause])
|
| 691 |
+
window = _extract_window(truncated)
|
| 692 |
+
if window is not None:
|
| 693 |
+
samples.append(AudioSample(
|
| 694 |
+
audio=window, label=0.0,
|
| 695 |
+
source="speak_improve_l2",
|
| 696 |
+
speaker_id=speaker_id,
|
| 697 |
+
))
|
| 698 |
+
incomplete_count += 1
|
| 699 |
+
|
| 700 |
+
except Exception:
|
| 701 |
+
continue
|
| 702 |
+
|
| 703 |
+
if i % 2000 == 0 and i > 0:
|
| 704 |
+
log.info(" S&I L2: scanned %d, %d complete, %d incomplete", i, complete_count, incomplete_count)
|
| 705 |
+
|
| 706 |
+
log.info("Speak & Improve L2: %d complete + %d incomplete = %d samples",
|
| 707 |
+
complete_count, incomplete_count, len(samples))
|
| 708 |
+
return samples
|
| 709 |
+
|
| 710 |
+
|
| 711 |
+
def train(
|
| 712 |
+
epochs: int = 6,
|
| 713 |
+
batch_size: int = 128,
|
| 714 |
+
lr: float = 5e-5,
|
| 715 |
+
warmup_ratio: float = 0.2,
|
| 716 |
+
max_pipecat_samples: int = 5000,
|
| 717 |
+
max_tts_samples: int = 10000,
|
| 718 |
+
max_l2_samples: int = 2000,
|
| 719 |
+
whisper_model: str = "openai/whisper-tiny",
|
| 720 |
+
loss_fn: str = "focal", # "focal" or "bce" (Pipecat's original)
|
| 721 |
+
fp_penalty: float = 2.0, # asymmetric cost: FP costs 2x more than FN
|
| 722 |
+
) -> Path:
|
| 723 |
+
"""Fine-tune SmartTurnV3Model on Portuguese data.
|
| 724 |
+
|
| 725 |
+
Default hyperparams from Pipecat's train.py:
|
| 726 |
+
- lr=5e-5, warmup_ratio=0.2, cosine schedule
|
| 727 |
+
- Pipecat uses epochs=4 on 270K samples; we use 6 on 5-15K (more passes needed)
|
| 728 |
+
- Pipecat uses batch_size=384; we use 128 (A10G has 24GB, enough for 128)
|
| 729 |
+
- loss_fn="focal" (our improvement) or "bce" (Pipecat's original) for comparison
|
| 730 |
+
|
| 731 |
+
Language-learning-specific improvements (March 2026 research):
|
| 732 |
+
- fp_penalty=2.0: asymmetric cost — interrupting a learner mid-thought is
|
| 733 |
+
much worse than waiting too long (ConversAR 2025, Praktika approach)
|
| 734 |
+
- Speak & Improve L2 corpus: real L2 learner speech with hesitations/disfluencies
|
| 735 |
+
- Dual threshold in evaluation: eager (0.3-0.5) for speculative prep,
|
| 736 |
+
final (0.7+) for actual turn transition (inspired by Deepgram Flux)
|
| 737 |
+
"""
|
| 738 |
+
|
| 739 |
+
device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
|
| 740 |
+
log.info("Training on device: %s", device)
|
| 741 |
+
if device == "cuda":
|
| 742 |
+
log.info("GPU: %s (%d MB)", torch.cuda.get_device_name(),
|
| 743 |
+
torch.cuda.get_device_properties(0).total_memory // 1024 // 1024)
|
| 744 |
+
|
| 745 |
+
# ----- Load all data sources -----
|
| 746 |
+
t0 = time.time()
|
| 747 |
+
all_samples: list[AudioSample] = []
|
| 748 |
+
|
| 749 |
+
# 1. Pipecat Portuguese (primary — these have LLM-curated labels)
|
| 750 |
+
pipecat = load_pipecat_portuguese(max_samples=max_pipecat_samples)
|
| 751 |
+
all_samples.extend(pipecat)
|
| 752 |
+
del pipecat
|
| 753 |
+
gc.collect()
|
| 754 |
+
|
| 755 |
+
# 2. Our TTS-generated data (native PT-BR + French accent + short utterances)
|
| 756 |
+
tts = load_tts_dataset(max_samples=max_tts_samples)
|
| 757 |
+
all_samples.extend(tts)
|
| 758 |
+
del tts
|
| 759 |
+
gc.collect()
|
| 760 |
+
|
| 761 |
+
# 3. CORAA real audio (critical: SpeculativeETD showed synthetic→real gap of 94.7%→30.3%)
|
| 762 |
+
coraa = load_coraa_real_audio(max_samples=3000)
|
| 763 |
+
all_samples.extend(coraa)
|
| 764 |
+
del coraa
|
| 765 |
+
gc.collect()
|
| 766 |
+
|
| 767 |
+
# 4. Speak & Improve L2 corpus — real L2 learner speech with disfluencies
|
| 768 |
+
# L2 hesitation patterns transfer across languages (Cenoz 2000)
|
| 769 |
+
# 340h of L2 English learners, CEFR A2-C1 (arXiv:2412.11986)
|
| 770 |
+
si_l2 = load_speak_improve_l2(max_samples=max_l2_samples)
|
| 771 |
+
all_samples.extend(si_l2)
|
| 772 |
+
del si_l2
|
| 773 |
+
gc.collect()
|
| 774 |
+
|
| 775 |
+
if not all_samples:
|
| 776 |
+
raise RuntimeError("No training samples loaded! Run 01/03 scripts first.")
|
| 777 |
+
|
| 778 |
+
load_time = time.time() - t0
|
| 779 |
+
n_c = sum(1 for s in all_samples if s.label == 1.0)
|
| 780 |
+
n_i = sum(1 for s in all_samples if s.label == 0.0)
|
| 781 |
+
sources = {}
|
| 782 |
+
for s in all_samples:
|
| 783 |
+
sources[s.source] = sources.get(s.source, 0) + 1
|
| 784 |
+
|
| 785 |
+
log.info("Total: %d samples (%d complete, %d incomplete) in %.0fs", len(all_samples), n_c, n_i, load_time)
|
| 786 |
+
for src, cnt in sorted(sources.items()):
|
| 787 |
+
log.info(" %s: %d", src, cnt)
|
| 788 |
+
|
| 789 |
+
# ----- Split -----
|
| 790 |
+
log.info("=== Splitting by speaker ===")
|
| 791 |
+
train_samples, val_samples, internal_test = split_by_speaker(all_samples)
|
| 792 |
+
|
| 793 |
+
# Also load Pipecat test data (held-out, never seen during training)
|
| 794 |
+
pipecat_test = load_pipecat_test_data()
|
| 795 |
+
|
| 796 |
+
# ----- Datasets -----
|
| 797 |
+
feature_extractor = WhisperFeatureExtractor(chunk_length=8)
|
| 798 |
+
train_ds = SmartTurnDataset(train_samples, feature_extractor, augment=True)
|
| 799 |
+
val_ds = SmartTurnDataset(val_samples, feature_extractor, augment=False)
|
| 800 |
+
|
| 801 |
+
# Balanced sampling
|
| 802 |
+
train_labels = [s.label for s in train_samples]
|
| 803 |
+
n_pos = sum(1 for l in train_labels if l == 1.0)
|
| 804 |
+
n_neg = len(train_labels) - n_pos
|
| 805 |
+
weights = [1.0 / n_neg if l == 0.0 else 1.0 / n_pos for l in train_labels]
|
| 806 |
+
sampler = WeightedRandomSampler(weights, len(weights))
|
| 807 |
+
|
| 808 |
+
use_pin = device == "cuda"
|
| 809 |
+
n_workers = 4 if device == "cuda" else 0
|
| 810 |
+
train_loader = DataLoader(train_ds, batch_size=batch_size, sampler=sampler,
|
| 811 |
+
num_workers=n_workers, pin_memory=use_pin)
|
| 812 |
+
val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False,
|
| 813 |
+
num_workers=0, pin_memory=use_pin)
|
| 814 |
+
|
| 815 |
+
# ----- Model -----
|
| 816 |
+
model = SmartTurnV3Model(whisper_model=whisper_model).to(device)
|
| 817 |
+
total_params = sum(p.numel() for p in model.parameters())
|
| 818 |
+
log.info("Model: %d params (%.1f MB)", total_params, total_params * 4 / 1024 / 1024)
|
| 819 |
+
|
| 820 |
+
# ----- Loss -----
|
| 821 |
+
# Pipecat uses BCEWithLogitsLoss with dynamic pos_weight (clamped 0.1-10.0)
|
| 822 |
+
# We default to Focal Loss (alpha=0.25, gamma=2.0 per Lin et al. 2017)
|
| 823 |
+
# but support BCE for comparison
|
| 824 |
+
pos_weight_val = min(max(n_neg / max(n_pos, 1), 0.1), 10.0)
|
| 825 |
+
pos_weight = torch.tensor([pos_weight_val], device=device)
|
| 826 |
+
|
| 827 |
+
if loss_fn == "bce":
|
| 828 |
+
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
|
| 829 |
+
log.info("Loss: BCEWithLogitsLoss (Pipecat original), pos_weight=%.2f", pos_weight_val)
|
| 830 |
+
else:
|
| 831 |
+
criterion = FocalLoss(gamma=2.0, alpha=0.25, fp_penalty=fp_penalty)
|
| 832 |
+
log.info("Loss: FocalLoss (gamma=2.0, alpha=0.25, fp_penalty=%.1f)", fp_penalty)
|
| 833 |
+
log.info(" Asymmetric cost: FP (interrupting learner) penalized %.1fx more than FN", fp_penalty)
|
| 834 |
+
|
| 835 |
+
# ----- Optimizer: uniform LR (Pipecat uses single lr=5e-5 for all params) -----
|
| 836 |
+
# Pipecat's train.py does NOT use differential LR — all params at same rate.
|
| 837 |
+
# Since we initialize from whisper-tiny (same as Pipecat), uniform LR is correct.
|
| 838 |
+
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
|
| 839 |
+
|
| 840 |
+
# Cosine schedule with warmup (Pipecat: warmup_ratio=0.2)
|
| 841 |
+
total_steps = epochs * len(train_loader)
|
| 842 |
+
warmup_steps = int(total_steps * warmup_ratio)
|
| 843 |
+
|
| 844 |
+
def lr_lambda(step):
|
| 845 |
+
if step < warmup_steps:
|
| 846 |
+
return step / max(warmup_steps, 1)
|
| 847 |
+
progress = (step - warmup_steps) / max(total_steps - warmup_steps, 1)
|
| 848 |
+
return 0.5 * (1 + np.cos(np.pi * progress))
|
| 849 |
+
|
| 850 |
+
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
|
| 851 |
+
|
| 852 |
+
# ----- Training loop -----
|
| 853 |
+
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
| 854 |
+
best_f1 = 0.0
|
| 855 |
+
best_path = OUTPUT_DIR / "best_model.pt"
|
| 856 |
+
resume_path = OUTPUT_DIR / "resume_checkpoint.pt"
|
| 857 |
+
patience = 5
|
| 858 |
+
patience_counter = 0
|
| 859 |
+
history = []
|
| 860 |
+
start_epoch = 0
|
| 861 |
+
|
| 862 |
+
# Resume support
|
| 863 |
+
if resume_path.exists():
|
| 864 |
+
log.info("=== Resuming from checkpoint ===")
|
| 865 |
+
ckpt = torch.load(resume_path, map_location=device, weights_only=False)
|
| 866 |
+
model.load_state_dict(ckpt["model_state_dict"])
|
| 867 |
+
optimizer.load_state_dict(ckpt["optimizer_state_dict"])
|
| 868 |
+
scheduler.load_state_dict(ckpt["scheduler_state_dict"])
|
| 869 |
+
start_epoch = ckpt["epoch"]
|
| 870 |
+
best_f1 = ckpt.get("best_f1", 0.0)
|
| 871 |
+
patience_counter = ckpt.get("patience_counter", 0)
|
| 872 |
+
history = ckpt.get("history", [])
|
| 873 |
+
log.info(" Resumed at epoch %d, best_f1=%.4f", start_epoch, best_f1)
|
| 874 |
+
|
| 875 |
+
log.info("=== Training: %d epochs, batch=%d, lr=%.1e, warmup_ratio=%.1f ===",
|
| 876 |
+
epochs, batch_size, lr, warmup_ratio)
|
| 877 |
+
|
| 878 |
+
for epoch in range(start_epoch, epochs):
|
| 879 |
+
model.train()
|
| 880 |
+
train_loss = 0.0
|
| 881 |
+
train_correct = 0
|
| 882 |
+
train_total = 0
|
| 883 |
+
t_epoch = time.time()
|
| 884 |
+
|
| 885 |
+
for batch_idx, batch in enumerate(train_loader):
|
| 886 |
+
features = batch["input_features"].to(device)
|
| 887 |
+
labels = batch["labels"].to(device)
|
| 888 |
+
|
| 889 |
+
logits = model(features)
|
| 890 |
+
loss = criterion(logits, labels)
|
| 891 |
+
|
| 892 |
+
optimizer.zero_grad()
|
| 893 |
+
loss.backward()
|
| 894 |
+
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
|
| 895 |
+
optimizer.step()
|
| 896 |
+
scheduler.step()
|
| 897 |
+
|
| 898 |
+
train_loss += loss.item() * len(labels)
|
| 899 |
+
preds = (torch.sigmoid(logits) > 0.5).float()
|
| 900 |
+
hard_labels = (labels > 0.5).float()
|
| 901 |
+
train_correct += (preds == hard_labels).sum().item()
|
| 902 |
+
train_total += len(labels)
|
| 903 |
+
|
| 904 |
+
if batch_idx % 50 == 0 and batch_idx > 0:
|
| 905 |
+
log.info(" batch %d/%d loss=%.4f", batch_idx, len(train_loader), loss.item())
|
| 906 |
+
|
| 907 |
+
# Validate
|
| 908 |
+
model.eval()
|
| 909 |
+
val_metrics = _evaluate(model, val_loader, device, criterion)
|
| 910 |
+
train_acc = train_correct / max(train_total, 1)
|
| 911 |
+
epoch_time = time.time() - t_epoch
|
| 912 |
+
|
| 913 |
+
log.info(
|
| 914 |
+
"Epoch %d/%d (%.0fs): train_loss=%.4f train_acc=%.3f | "
|
| 915 |
+
"val_acc=%.3f val_f1=%.3f prec=%.3f rec=%.3f",
|
| 916 |
+
epoch + 1, epochs, epoch_time,
|
| 917 |
+
train_loss / max(train_total, 1), train_acc,
|
| 918 |
+
val_metrics["accuracy"], val_metrics["f1"],
|
| 919 |
+
val_metrics["precision"], val_metrics["recall"],
|
| 920 |
+
)
|
| 921 |
+
|
| 922 |
+
history.append({
|
| 923 |
+
"epoch": epoch + 1,
|
| 924 |
+
"train_loss": train_loss / max(train_total, 1),
|
| 925 |
+
"train_acc": train_acc,
|
| 926 |
+
**{f"val_{k}": v for k, v in val_metrics.items()},
|
| 927 |
+
})
|
| 928 |
+
|
| 929 |
+
# Save best
|
| 930 |
+
if val_metrics["f1"] > best_f1:
|
| 931 |
+
best_f1 = val_metrics["f1"]
|
| 932 |
+
torch.save({
|
| 933 |
+
"model_state_dict": model.state_dict(),
|
| 934 |
+
"epoch": epoch + 1,
|
| 935 |
+
"val_f1": best_f1,
|
| 936 |
+
"val_metrics": val_metrics,
|
| 937 |
+
"whisper_model": whisper_model,
|
| 938 |
+
}, best_path)
|
| 939 |
+
log.info(" -> New best model (val_f1=%.4f)", best_f1)
|
| 940 |
+
patience_counter = 0
|
| 941 |
+
else:
|
| 942 |
+
patience_counter += 1
|
| 943 |
+
if patience_counter >= patience:
|
| 944 |
+
log.info("Early stopping at epoch %d", epoch + 1)
|
| 945 |
+
break
|
| 946 |
+
|
| 947 |
+
# Resume checkpoint
|
| 948 |
+
torch.save({
|
| 949 |
+
"model_state_dict": model.state_dict(),
|
| 950 |
+
"optimizer_state_dict": optimizer.state_dict(),
|
| 951 |
+
"scheduler_state_dict": scheduler.state_dict(),
|
| 952 |
+
"epoch": epoch + 1,
|
| 953 |
+
"best_f1": best_f1,
|
| 954 |
+
"patience_counter": patience_counter,
|
| 955 |
+
"history": history,
|
| 956 |
+
}, resume_path)
|
| 957 |
+
|
| 958 |
+
if resume_path.exists():
|
| 959 |
+
resume_path.unlink()
|
| 960 |
+
|
| 961 |
+
# ----- Final evaluation on internal test split -----
|
| 962 |
+
log.info("\n=== Internal Test Evaluation ===")
|
| 963 |
+
checkpoint = torch.load(best_path, map_location=device, weights_only=True)
|
| 964 |
+
model.load_state_dict(checkpoint["model_state_dict"])
|
| 965 |
+
model.eval()
|
| 966 |
+
|
| 967 |
+
if internal_test:
|
| 968 |
+
test_ds = SmartTurnDataset(internal_test, feature_extractor, augment=False)
|
| 969 |
+
test_loader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
|
| 970 |
+
test_metrics = _evaluate(model, test_loader, device, criterion)
|
| 971 |
+
_log_metrics("Internal test", test_metrics)
|
| 972 |
+
else:
|
| 973 |
+
test_metrics = {}
|
| 974 |
+
|
| 975 |
+
# ----- Pipecat test set (baseline comparison) -----
|
| 976 |
+
pipecat_test_metrics = {}
|
| 977 |
+
if pipecat_test:
|
| 978 |
+
log.info("\n=== Pipecat PT Test Evaluation (baseline comparison) ===")
|
| 979 |
+
pipecat_ds = SmartTurnDataset(pipecat_test, feature_extractor, augment=False)
|
| 980 |
+
pipecat_loader = DataLoader(pipecat_ds, batch_size=batch_size, shuffle=False)
|
| 981 |
+
pipecat_test_metrics = _evaluate(model, pipecat_loader, device, criterion)
|
| 982 |
+
_log_metrics("Pipecat PT test", pipecat_test_metrics)
|
| 983 |
+
log.info(" Pipecat baseline: 95.42%% accuracy, 2.79%% FP, 1.79%% FN")
|
| 984 |
+
|
| 985 |
+
# ----- Threshold sweep -----
|
| 986 |
+
log.info("\n=== Threshold Sweep ===")
|
| 987 |
+
threshold_results = {}
|
| 988 |
+
eval_loader = test_loader if internal_test else val_loader
|
| 989 |
+
for thresh in [0.3, 0.4, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85]:
|
| 990 |
+
t_metrics = _evaluate(model, eval_loader, device, criterion, threshold=thresh)
|
| 991 |
+
threshold_results[str(thresh)] = t_metrics
|
| 992 |
+
log.info(" threshold=%.2f: prec=%.3f rec=%.3f f1=%.3f acc=%.3f FP=%.3f",
|
| 993 |
+
thresh, t_metrics["precision"], t_metrics["recall"],
|
| 994 |
+
t_metrics["f1"], t_metrics["accuracy"], t_metrics["fp_rate"])
|
| 995 |
+
|
| 996 |
+
# ----- Dual threshold recommendation (inspired by Deepgram Flux) -----
|
| 997 |
+
# For a language-learning avatar:
|
| 998 |
+
# - eager_threshold (0.3-0.5): start preparing response speculatively
|
| 999 |
+
# (e.g., begin LLM generation) — reduces perceived latency
|
| 1000 |
+
# - final_threshold (0.7+): actually take the turn and speak
|
| 1001 |
+
# Higher final threshold = fewer interruptions (critical for L2 learners)
|
| 1002 |
+
log.info("\n=== Dual Threshold Recommendation (Deepgram Flux-inspired) ===")
|
| 1003 |
+
# Find final threshold: maximize precision with recall >= 85%
|
| 1004 |
+
# (we want very few interruptions, even at the cost of some missed turns)
|
| 1005 |
+
final_candidates = [(k, v) for k, v in threshold_results.items()
|
| 1006 |
+
if v["recall"] >= 0.85 and float(k) >= 0.6]
|
| 1007 |
+
if final_candidates:
|
| 1008 |
+
best_final = min(final_candidates, key=lambda x: x[1]["fp_rate"])
|
| 1009 |
+
log.info(" Recommended final_threshold: %.2f (FP=%.1f%%, prec=%.3f, rec=%.3f)",
|
| 1010 |
+
float(best_final[0]), best_final[1]["fp_rate"] * 100,
|
| 1011 |
+
best_final[1]["precision"], best_final[1]["recall"])
|
| 1012 |
+
else:
|
| 1013 |
+
best_final = ("0.7", threshold_results.get("0.7", {}))
|
| 1014 |
+
log.info(" Default final_threshold: 0.70")
|
| 1015 |
+
|
| 1016 |
+
# Eager threshold: lower confidence where we start speculative processing
|
| 1017 |
+
eager_thresh = max(0.3, float(best_final[0]) - 0.3)
|
| 1018 |
+
log.info(" Recommended eager_threshold: %.2f (start LLM generation speculatively)",
|
| 1019 |
+
eager_thresh)
|
| 1020 |
+
log.info(" Latency savings: ~150-250ms earlier response start (Deepgram Flux benchmark)")
|
| 1021 |
+
|
| 1022 |
+
# ----- ONNX export (FP32 → INT8, per Pipecat's deploy pipeline) -----
|
| 1023 |
+
model = model.to("cpu")
|
| 1024 |
+
onnx_fp32_path = OUTPUT_DIR / "smart_turn_pt_v3_fp32.onnx"
|
| 1025 |
+
onnx_int8_path = OUTPUT_DIR / "smart_turn_pt_v3.onnx"
|
| 1026 |
+
dummy = torch.randn(1, 80, 800)
|
| 1027 |
+
try:
|
| 1028 |
+
# Step 1: Export FP32 ONNX (opset 18, per Pipecat train.py)
|
| 1029 |
+
torch.onnx.export(
|
| 1030 |
+
model, dummy, str(onnx_fp32_path),
|
| 1031 |
+
input_names=["input_features"],
|
| 1032 |
+
output_names=["logits"],
|
| 1033 |
+
dynamic_axes={"input_features": {0: "batch"}, "logits": {0: "batch"}},
|
| 1034 |
+
opset_version=18,
|
| 1035 |
+
)
|
| 1036 |
+
log.info("ONNX FP32 exported: %s (%.1f MB)", onnx_fp32_path,
|
| 1037 |
+
onnx_fp32_path.stat().st_size / 1024 / 1024)
|
| 1038 |
+
|
| 1039 |
+
# Step 2: INT8 static quantization (entropy calibration, per Pipecat)
|
| 1040 |
+
# Pipecat uses: 1024 calibration samples, QDQ format, Entropy method
|
| 1041 |
+
try:
|
| 1042 |
+
from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantFormat
|
| 1043 |
+
|
| 1044 |
+
class SmartTurnCalibrationReader(CalibrationDataReader):
|
| 1045 |
+
def __init__(self, dataset, n_samples=1024):
|
| 1046 |
+
self.data = []
|
| 1047 |
+
for i, sample in enumerate(dataset):
|
| 1048 |
+
if i >= n_samples:
|
| 1049 |
+
break
|
| 1050 |
+
self.data.append({"input_features": sample["input_features"].unsqueeze(0).numpy()})
|
| 1051 |
+
self.idx = 0
|
| 1052 |
+
|
| 1053 |
+
def get_next(self):
|
| 1054 |
+
if self.idx >= len(self.data):
|
| 1055 |
+
return None
|
| 1056 |
+
result = self.data[self.idx]
|
| 1057 |
+
self.idx += 1
|
| 1058 |
+
return result
|
| 1059 |
+
|
| 1060 |
+
# Use validation data for calibration
|
| 1061 |
+
calib_reader = SmartTurnCalibrationReader(val_ds, n_samples=1024)
|
| 1062 |
+
quantize_static(
|
| 1063 |
+
str(onnx_fp32_path),
|
| 1064 |
+
str(onnx_int8_path),
|
| 1065 |
+
calib_reader,
|
| 1066 |
+
quant_format=QuantFormat.QDQ,
|
| 1067 |
+
)
|
| 1068 |
+
log.info("ONNX INT8 exported: %s (%.1f MB)", onnx_int8_path,
|
| 1069 |
+
onnx_int8_path.stat().st_size / 1024 / 1024)
|
| 1070 |
+
except ImportError:
|
| 1071 |
+
log.warning("onnxruntime.quantization not available — saving FP32 only")
|
| 1072 |
+
import shutil
|
| 1073 |
+
shutil.copy2(onnx_fp32_path, onnx_int8_path)
|
| 1074 |
+
except Exception as e:
|
| 1075 |
+
log.warning("INT8 quantization failed: %s — saving FP32 only", e)
|
| 1076 |
+
import shutil
|
| 1077 |
+
shutil.copy2(onnx_fp32_path, onnx_int8_path)
|
| 1078 |
+
|
| 1079 |
+
except Exception as e:
|
| 1080 |
+
log.warning("ONNX export failed: %s", e)
|
| 1081 |
+
|
| 1082 |
+
# ----- Save results -----
|
| 1083 |
+
results = {
|
| 1084 |
+
"model": "smart_turn_pt_v3_finetuned",
|
| 1085 |
+
"whisper_model": whisper_model,
|
| 1086 |
+
"architecture": "SmartTurnV3Model (exact Pipecat)",
|
| 1087 |
+
"target_domain": "language_learning_avatar",
|
| 1088 |
+
"learner_profile": "francophone_learning_portuguese",
|
| 1089 |
+
"total_samples": len(all_samples),
|
| 1090 |
+
"sources": sources,
|
| 1091 |
+
"best_epoch": checkpoint["epoch"],
|
| 1092 |
+
"best_val_f1": best_f1,
|
| 1093 |
+
"test_metrics": test_metrics,
|
| 1094 |
+
"pipecat_test_metrics": pipecat_test_metrics,
|
| 1095 |
+
"pipecat_baseline": {"accuracy": 0.9542, "fp_rate": 0.0279, "fn_rate": 0.0179},
|
| 1096 |
+
"threshold_sweep": threshold_results,
|
| 1097 |
+
"dual_threshold": {
|
| 1098 |
+
"eager_threshold": eager_thresh,
|
| 1099 |
+
"final_threshold": float(best_final[0]),
|
| 1100 |
+
"rationale": "For L2 language learning: higher final threshold reduces "
|
| 1101 |
+
"interruptions. Eager threshold enables speculative LLM "
|
| 1102 |
+
"generation 150-250ms earlier (Deepgram Flux approach).",
|
| 1103 |
+
},
|
| 1104 |
+
"history": history,
|
| 1105 |
+
"config": {
|
| 1106 |
+
"epochs": epochs,
|
| 1107 |
+
"batch_size": batch_size,
|
| 1108 |
+
"lr": lr,
|
| 1109 |
+
"warmup_ratio": warmup_ratio,
|
| 1110 |
+
"label_smoothing": LABEL_SMOOTH,
|
| 1111 |
+
"focal_loss_gamma": 2.0,
|
| 1112 |
+
"focal_loss_alpha": 0.25,
|
| 1113 |
+
"fp_penalty": fp_penalty,
|
| 1114 |
+
"loss_fn": loss_fn,
|
| 1115 |
+
},
|
| 1116 |
+
}
|
| 1117 |
+
|
| 1118 |
+
results_path = OUTPUT_DIR / "training_results.json"
|
| 1119 |
+
with open(results_path, "w") as f:
|
| 1120 |
+
json.dump(results, f, indent=2)
|
| 1121 |
+
log.info("Results saved to %s", results_path)
|
| 1122 |
+
|
| 1123 |
+
return onnx_int8_path
|
| 1124 |
+
|
| 1125 |
+
|
| 1126 |
+
def _evaluate(
|
| 1127 |
+
model: nn.Module,
|
| 1128 |
+
loader: DataLoader,
|
| 1129 |
+
device: str,
|
| 1130 |
+
criterion: nn.Module,
|
| 1131 |
+
threshold: float = 0.5,
|
| 1132 |
+
) -> dict:
|
| 1133 |
+
"""Evaluate model at a given threshold."""
|
| 1134 |
+
correct = total = tp = fp = fn = tn = 0
|
| 1135 |
+
total_loss = 0.0
|
| 1136 |
+
|
| 1137 |
+
with torch.no_grad():
|
| 1138 |
+
for batch in loader:
|
| 1139 |
+
features = batch["input_features"].to(device)
|
| 1140 |
+
labels = batch["labels"].to(device)
|
| 1141 |
+
logits = model(features)
|
| 1142 |
+
loss = criterion(logits, labels)
|
| 1143 |
+
total_loss += loss.item() * len(labels)
|
| 1144 |
+
|
| 1145 |
+
preds = (torch.sigmoid(logits) > threshold).float()
|
| 1146 |
+
hard_labels = (labels > 0.5).float()
|
| 1147 |
+
correct += (preds == hard_labels).sum().item()
|
| 1148 |
+
total += len(labels)
|
| 1149 |
+
tp += ((preds == 1) & (hard_labels == 1)).sum().item()
|
| 1150 |
+
fp += ((preds == 1) & (hard_labels == 0)).sum().item()
|
| 1151 |
+
fn += ((preds == 0) & (hard_labels == 1)).sum().item()
|
| 1152 |
+
tn += ((preds == 0) & (hard_labels == 0)).sum().item()
|
| 1153 |
+
|
| 1154 |
+
accuracy = correct / max(total, 1)
|
| 1155 |
+
precision = tp / max(tp + fp, 1)
|
| 1156 |
+
recall = tp / max(tp + fn, 1)
|
| 1157 |
+
f1 = 2 * precision * recall / max(precision + recall, 1e-8)
|
| 1158 |
+
|
| 1159 |
+
return {
|
| 1160 |
+
"accuracy": round(accuracy, 4),
|
| 1161 |
+
"precision": round(precision, 4),
|
| 1162 |
+
"recall": round(recall, 4),
|
| 1163 |
+
"f1": round(f1, 4),
|
| 1164 |
+
"loss": round(total_loss / max(total, 1), 4),
|
| 1165 |
+
"tp": tp, "fp": fp, "fn": fn, "tn": tn,
|
| 1166 |
+
"fp_rate": round(fp / max(fp + tn, 1), 4),
|
| 1167 |
+
"fn_rate": round(fn / max(fn + tp, 1), 4),
|
| 1168 |
+
}
|
| 1169 |
+
|
| 1170 |
+
|
| 1171 |
+
def _log_metrics(name: str, metrics: dict) -> None:
|
| 1172 |
+
log.info("%s results:", name)
|
| 1173 |
+
log.info(" Accuracy: %.3f", metrics["accuracy"])
|
| 1174 |
+
log.info(" Precision: %.3f", metrics["precision"])
|
| 1175 |
+
log.info(" Recall: %.3f", metrics["recall"])
|
| 1176 |
+
log.info(" F1: %.3f", metrics["f1"])
|
| 1177 |
+
log.info(" FP rate: %.3f", metrics["fp_rate"])
|
| 1178 |
+
log.info(" FN rate: %.3f", metrics["fn_rate"])
|
| 1179 |
+
log.info(" TP=%d FP=%d FN=%d TN=%d", metrics["tp"], metrics["fp"], metrics["fn"], metrics["tn"])
|
| 1180 |
+
|
| 1181 |
+
|
| 1182 |
+
if __name__ == "__main__":
|
| 1183 |
+
logging.basicConfig(
|
| 1184 |
+
level=logging.INFO,
|
| 1185 |
+
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
|
| 1186 |
+
)
|
| 1187 |
+
random.seed(42)
|
| 1188 |
+
np.random.seed(42)
|
| 1189 |
+
torch.manual_seed(42)
|
| 1190 |
+
|
| 1191 |
+
train(
|
| 1192 |
+
epochs=6,
|
| 1193 |
+
batch_size=128,
|
| 1194 |
+
lr=5e-5,
|
| 1195 |
+
warmup_ratio=0.2,
|
| 1196 |
+
max_pipecat_samples=5000,
|
| 1197 |
+
max_tts_samples=10000,
|
| 1198 |
+
max_l2_samples=2000,
|
| 1199 |
+
whisper_model="openai/whisper-tiny",
|
| 1200 |
+
loss_fn="focal", # or "bce" for Pipecat's original
|
| 1201 |
+
fp_penalty=2.0, # interrupting learner costs 2x more than waiting
|
| 1202 |
+
)
|
03-finetune-pipecat-pt/05_evaluate.py
ADDED
|
@@ -0,0 +1,511 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Evaluate fine-tuned model against Pipecat baseline and ONNX reference.
|
| 2 |
+
|
| 3 |
+
Comparisons:
|
| 4 |
+
1. Our fine-tuned model (PyTorch) on Pipecat PT test set
|
| 5 |
+
2. Pipecat original ONNX model on same test set
|
| 6 |
+
3. Threshold sweep to find optimal operating point
|
| 7 |
+
4. Per-source breakdown (pipecat data vs our TTS data)
|
| 8 |
+
|
| 9 |
+
Run:
|
| 10 |
+
python 05_evaluate.py --model results/best_model.pt
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import argparse
|
| 16 |
+
import json
|
| 17 |
+
import logging
|
| 18 |
+
import time
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
import numpy as np
|
| 22 |
+
import torch
|
| 23 |
+
import torch.nn as nn
|
| 24 |
+
from torch.utils.data import DataLoader
|
| 25 |
+
|
| 26 |
+
log = logging.getLogger(__name__)
|
| 27 |
+
|
| 28 |
+
DATA_DIR = Path(__file__).parent / "data"
|
| 29 |
+
_workspace = Path("/workspace") if Path("/workspace").exists() else Path(".")
|
| 30 |
+
RESULTS_DIR = _workspace / "results"
|
| 31 |
+
CACHE_DIR = _workspace / "hf_cache"
|
| 32 |
+
|
| 33 |
+
SAMPLE_RATE = 16000
|
| 34 |
+
WINDOW_SECONDS = 8
|
| 35 |
+
WINDOW_SAMPLES = WINDOW_SECONDS * SAMPLE_RATE
|
| 36 |
+
|
| 37 |
+
# Pipecat baseline (from their blog: Smart Turn v3 Portuguese)
|
| 38 |
+
PIPECAT_BASELINE = {
|
| 39 |
+
"accuracy": 0.9542,
|
| 40 |
+
"fp_rate": 0.0279,
|
| 41 |
+
"fn_rate": 0.0179,
|
| 42 |
+
"source": "https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/",
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def load_model(model_path: Path, device: str) -> nn.Module:
|
| 47 |
+
"""Load fine-tuned SmartTurnV3Model."""
|
| 48 |
+
# Import from finetune script
|
| 49 |
+
import importlib.util
|
| 50 |
+
spec = importlib.util.spec_from_file_location(
|
| 51 |
+
"finetune", Path(__file__).parent / "04_finetune.py"
|
| 52 |
+
)
|
| 53 |
+
finetune = importlib.util.module_from_spec(spec)
|
| 54 |
+
spec.loader.exec_module(finetune)
|
| 55 |
+
|
| 56 |
+
model = finetune.SmartTurnV3Model(whisper_model="openai/whisper-tiny")
|
| 57 |
+
checkpoint = torch.load(model_path, map_location=device, weights_only=True)
|
| 58 |
+
model.load_state_dict(checkpoint["model_state_dict"])
|
| 59 |
+
model = model.to(device)
|
| 60 |
+
model.eval()
|
| 61 |
+
|
| 62 |
+
log.info("Loaded model from %s (epoch %d, val_f1=%.4f)",
|
| 63 |
+
model_path, checkpoint.get("epoch", -1), checkpoint.get("val_f1", -1))
|
| 64 |
+
return model
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
def load_onnx_model(onnx_dir: Path | None = None):
|
| 68 |
+
"""Load Pipecat ONNX model for comparison."""
|
| 69 |
+
try:
|
| 70 |
+
import onnxruntime as ort
|
| 71 |
+
except ImportError:
|
| 72 |
+
log.warning("onnxruntime not installed — skipping ONNX evaluation")
|
| 73 |
+
return None
|
| 74 |
+
|
| 75 |
+
if onnx_dir is None:
|
| 76 |
+
onnx_dir = DATA_DIR / "pipecat_model"
|
| 77 |
+
|
| 78 |
+
cpu_path = onnx_dir / "smart-turn-v3.2-cpu.onnx"
|
| 79 |
+
gpu_path = onnx_dir / "smart-turn-v3.2-gpu.onnx"
|
| 80 |
+
|
| 81 |
+
onnx_path = cpu_path if cpu_path.exists() else (gpu_path if gpu_path.exists() else None)
|
| 82 |
+
if onnx_path is None:
|
| 83 |
+
log.warning("No Pipecat ONNX model found in %s", onnx_dir)
|
| 84 |
+
return None
|
| 85 |
+
|
| 86 |
+
session = ort.InferenceSession(str(onnx_path))
|
| 87 |
+
log.info("Loaded ONNX model: %s", onnx_path)
|
| 88 |
+
return session
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def load_test_data() -> tuple[list, list]:
|
| 92 |
+
"""Load test datasets: Pipecat PT test + our TTS test."""
|
| 93 |
+
import soundfile as sf
|
| 94 |
+
from transformers import WhisperFeatureExtractor
|
| 95 |
+
|
| 96 |
+
feature_extractor = WhisperFeatureExtractor(chunk_length=8)
|
| 97 |
+
|
| 98 |
+
# 1. Pipecat test data
|
| 99 |
+
pipecat_samples = []
|
| 100 |
+
test_dir = DATA_DIR / "pipecat_pt_test"
|
| 101 |
+
if test_dir.exists():
|
| 102 |
+
for wav_path in sorted(test_dir.glob("*.wav")):
|
| 103 |
+
try:
|
| 104 |
+
audio, sr = sf.read(str(wav_path))
|
| 105 |
+
audio = np.array(audio, dtype=np.float32)
|
| 106 |
+
if sr != SAMPLE_RATE:
|
| 107 |
+
import torchaudio
|
| 108 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 109 |
+
audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
|
| 110 |
+
|
| 111 |
+
audio = _extract_window(audio)
|
| 112 |
+
if audio is not None:
|
| 113 |
+
label = 1.0 if "_complete" in wav_path.name else 0.0
|
| 114 |
+
features = feature_extractor(
|
| 115 |
+
audio, sampling_rate=SAMPLE_RATE,
|
| 116 |
+
return_tensors="np", padding="max_length",
|
| 117 |
+
max_length=WINDOW_SAMPLES, truncation=True,
|
| 118 |
+
do_normalize=True,
|
| 119 |
+
).input_features.squeeze(0).astype(np.float32)
|
| 120 |
+
|
| 121 |
+
pipecat_samples.append({
|
| 122 |
+
"features": features,
|
| 123 |
+
"label": label,
|
| 124 |
+
"source": "pipecat_test",
|
| 125 |
+
"file": wav_path.name,
|
| 126 |
+
})
|
| 127 |
+
except Exception as e:
|
| 128 |
+
pass
|
| 129 |
+
|
| 130 |
+
log.info("Pipecat test: %d samples", len(pipecat_samples))
|
| 131 |
+
|
| 132 |
+
# 2. Our TTS test data
|
| 133 |
+
tts_samples = []
|
| 134 |
+
tts_dir = DATA_DIR / "tts_dataset"
|
| 135 |
+
meta_path = tts_dir / "metadata.json"
|
| 136 |
+
if meta_path.exists():
|
| 137 |
+
with open(meta_path) as f:
|
| 138 |
+
metadata = json.load(f)
|
| 139 |
+
|
| 140 |
+
# Use last 20% as test
|
| 141 |
+
test_start = int(len(metadata) * 0.8)
|
| 142 |
+
test_meta = metadata[test_start:]
|
| 143 |
+
|
| 144 |
+
audio_dir = tts_dir / "audio"
|
| 145 |
+
for meta in test_meta:
|
| 146 |
+
wav_path = audio_dir / meta["file"]
|
| 147 |
+
if not wav_path.exists():
|
| 148 |
+
continue
|
| 149 |
+
try:
|
| 150 |
+
audio, sr = sf.read(str(wav_path))
|
| 151 |
+
audio = np.array(audio, dtype=np.float32)
|
| 152 |
+
if sr != SAMPLE_RATE:
|
| 153 |
+
import torchaudio
|
| 154 |
+
tensor = torch.from_numpy(audio).float().unsqueeze(0)
|
| 155 |
+
audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
|
| 156 |
+
|
| 157 |
+
audio = _extract_window(audio)
|
| 158 |
+
if audio is not None:
|
| 159 |
+
label = 1.0 if meta["label"] == "complete" else 0.0
|
| 160 |
+
features = feature_extractor(
|
| 161 |
+
audio, sampling_rate=SAMPLE_RATE,
|
| 162 |
+
return_tensors="np", padding="max_length",
|
| 163 |
+
max_length=WINDOW_SAMPLES, truncation=True,
|
| 164 |
+
do_normalize=True,
|
| 165 |
+
).input_features.squeeze(0).astype(np.float32)
|
| 166 |
+
|
| 167 |
+
tts_samples.append({
|
| 168 |
+
"features": features,
|
| 169 |
+
"label": label,
|
| 170 |
+
"source": meta.get("accent", "unknown"),
|
| 171 |
+
"file": meta["file"],
|
| 172 |
+
})
|
| 173 |
+
except Exception:
|
| 174 |
+
pass
|
| 175 |
+
|
| 176 |
+
log.info("TTS test: %d samples", len(tts_samples))
|
| 177 |
+
return pipecat_samples, tts_samples
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
def _extract_window(audio: np.ndarray) -> np.ndarray | None:
|
| 181 |
+
"""Extract 8s window from end of audio."""
|
| 182 |
+
if len(audio) < SAMPLE_RATE:
|
| 183 |
+
return None
|
| 184 |
+
peak = np.max(np.abs(audio))
|
| 185 |
+
if peak > 0:
|
| 186 |
+
audio = audio / peak * 0.9
|
| 187 |
+
if len(audio) > WINDOW_SAMPLES:
|
| 188 |
+
audio = audio[-WINDOW_SAMPLES:]
|
| 189 |
+
elif len(audio) < WINDOW_SAMPLES:
|
| 190 |
+
audio = np.pad(audio, (WINDOW_SAMPLES - len(audio), 0), mode="constant")
|
| 191 |
+
audio[-int(0.2 * SAMPLE_RATE):] = 0.0
|
| 192 |
+
return audio.astype(np.float32)
|
| 193 |
+
|
| 194 |
+
|
| 195 |
+
# ---------------------------------------------------------------------------
|
| 196 |
+
# Evaluation functions
|
| 197 |
+
# ---------------------------------------------------------------------------
|
| 198 |
+
|
| 199 |
+
def evaluate_pytorch(
|
| 200 |
+
model: nn.Module,
|
| 201 |
+
samples: list[dict],
|
| 202 |
+
device: str,
|
| 203 |
+
threshold: float = 0.5,
|
| 204 |
+
) -> dict:
|
| 205 |
+
"""Evaluate PyTorch model on samples."""
|
| 206 |
+
if not samples:
|
| 207 |
+
return {}
|
| 208 |
+
|
| 209 |
+
tp = fp = fn = tn = 0
|
| 210 |
+
latencies = []
|
| 211 |
+
|
| 212 |
+
with torch.no_grad():
|
| 213 |
+
for s in samples:
|
| 214 |
+
features = torch.from_numpy(s["features"]).unsqueeze(0).to(device)
|
| 215 |
+
|
| 216 |
+
t0 = time.perf_counter()
|
| 217 |
+
logits = model(features)
|
| 218 |
+
latency_ms = (time.perf_counter() - t0) * 1000
|
| 219 |
+
latencies.append(latency_ms)
|
| 220 |
+
|
| 221 |
+
pred = float(torch.sigmoid(logits).item() > threshold)
|
| 222 |
+
label = s["label"]
|
| 223 |
+
|
| 224 |
+
if pred == 1 and label == 1:
|
| 225 |
+
tp += 1
|
| 226 |
+
elif pred == 1 and label == 0:
|
| 227 |
+
fp += 1
|
| 228 |
+
elif pred == 0 and label == 1:
|
| 229 |
+
fn += 1
|
| 230 |
+
else:
|
| 231 |
+
tn += 1
|
| 232 |
+
|
| 233 |
+
total = tp + fp + fn + tn
|
| 234 |
+
return _compute_metrics(tp, fp, fn, tn, latencies)
|
| 235 |
+
|
| 236 |
+
|
| 237 |
+
def evaluate_onnx(
|
| 238 |
+
session,
|
| 239 |
+
samples: list[dict],
|
| 240 |
+
threshold: float = 0.5,
|
| 241 |
+
) -> dict:
|
| 242 |
+
"""Evaluate ONNX model on samples."""
|
| 243 |
+
if not samples or session is None:
|
| 244 |
+
return {}
|
| 245 |
+
|
| 246 |
+
tp = fp = fn = tn = 0
|
| 247 |
+
latencies = []
|
| 248 |
+
|
| 249 |
+
input_name = session.get_inputs()[0].name
|
| 250 |
+
for s in samples:
|
| 251 |
+
features = s["features"][np.newaxis, ...]
|
| 252 |
+
|
| 253 |
+
t0 = time.perf_counter()
|
| 254 |
+
outputs = session.run(None, {input_name: features})
|
| 255 |
+
latency_ms = (time.perf_counter() - t0) * 1000
|
| 256 |
+
latencies.append(latency_ms)
|
| 257 |
+
|
| 258 |
+
logit = outputs[0].item() if outputs[0].size == 1 else outputs[0][0].item()
|
| 259 |
+
pred = float(1.0 / (1.0 + np.exp(-logit)) > threshold)
|
| 260 |
+
label = s["label"]
|
| 261 |
+
|
| 262 |
+
if pred == 1 and label == 1:
|
| 263 |
+
tp += 1
|
| 264 |
+
elif pred == 1 and label == 0:
|
| 265 |
+
fp += 1
|
| 266 |
+
elif pred == 0 and label == 1:
|
| 267 |
+
fn += 1
|
| 268 |
+
else:
|
| 269 |
+
tn += 1
|
| 270 |
+
|
| 271 |
+
return _compute_metrics(tp, fp, fn, tn, latencies)
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
def _compute_metrics(tp, fp, fn, tn, latencies=None) -> dict:
|
| 275 |
+
total = tp + fp + fn + tn
|
| 276 |
+
accuracy = (tp + tn) / max(total, 1)
|
| 277 |
+
precision = tp / max(tp + fp, 1)
|
| 278 |
+
recall = tp / max(tp + fn, 1)
|
| 279 |
+
f1 = 2 * precision * recall / max(precision + recall, 1e-8)
|
| 280 |
+
fp_rate = fp / max(fp + tn, 1)
|
| 281 |
+
fn_rate = fn / max(fn + tp, 1)
|
| 282 |
+
|
| 283 |
+
metrics = {
|
| 284 |
+
"accuracy": round(accuracy, 4),
|
| 285 |
+
"precision": round(precision, 4),
|
| 286 |
+
"recall": round(recall, 4),
|
| 287 |
+
"f1": round(f1, 4),
|
| 288 |
+
"fp_rate": round(fp_rate, 4),
|
| 289 |
+
"fn_rate": round(fn_rate, 4),
|
| 290 |
+
"tp": tp, "fp": fp, "fn": fn, "tn": tn,
|
| 291 |
+
"total": total,
|
| 292 |
+
}
|
| 293 |
+
|
| 294 |
+
if latencies:
|
| 295 |
+
metrics["latency_mean_ms"] = round(np.mean(latencies), 2)
|
| 296 |
+
metrics["latency_p50_ms"] = round(np.median(latencies), 2)
|
| 297 |
+
metrics["latency_p95_ms"] = round(np.percentile(latencies, 95), 2)
|
| 298 |
+
|
| 299 |
+
return metrics
|
| 300 |
+
|
| 301 |
+
|
| 302 |
+
# ---------------------------------------------------------------------------
|
| 303 |
+
# Main evaluation
|
| 304 |
+
# ---------------------------------------------------------------------------
|
| 305 |
+
|
| 306 |
+
def run_evaluation(model_path: Path) -> dict:
|
| 307 |
+
"""Run full evaluation suite."""
|
| 308 |
+
device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
|
| 309 |
+
log.info("Evaluating on device: %s", device)
|
| 310 |
+
|
| 311 |
+
# Load models
|
| 312 |
+
model = load_model(model_path, device)
|
| 313 |
+
onnx_session = load_onnx_model()
|
| 314 |
+
|
| 315 |
+
# Load test data
|
| 316 |
+
pipecat_samples, tts_samples = load_test_data()
|
| 317 |
+
all_samples = pipecat_samples + tts_samples
|
| 318 |
+
|
| 319 |
+
results = {"pipecat_baseline": PIPECAT_BASELINE}
|
| 320 |
+
|
| 321 |
+
# 1. Our model on all test data
|
| 322 |
+
log.info("\n=== Our model (all test data) ===")
|
| 323 |
+
our_all = evaluate_pytorch(model, all_samples, device)
|
| 324 |
+
if our_all:
|
| 325 |
+
_log_metrics("Our model (all)", our_all)
|
| 326 |
+
results["our_model_all"] = our_all
|
| 327 |
+
|
| 328 |
+
# 2. Our model on Pipecat test only
|
| 329 |
+
if pipecat_samples:
|
| 330 |
+
log.info("\n=== Our model (Pipecat PT test only) ===")
|
| 331 |
+
our_pipecat = evaluate_pytorch(model, pipecat_samples, device)
|
| 332 |
+
_log_metrics("Our model (Pipecat test)", our_pipecat)
|
| 333 |
+
results["our_model_pipecat_test"] = our_pipecat
|
| 334 |
+
|
| 335 |
+
# Compare with baseline
|
| 336 |
+
diff_acc = our_pipecat["accuracy"] - PIPECAT_BASELINE["accuracy"]
|
| 337 |
+
diff_fp = our_pipecat["fp_rate"] - PIPECAT_BASELINE["fp_rate"]
|
| 338 |
+
diff_fn = our_pipecat["fn_rate"] - PIPECAT_BASELINE["fn_rate"]
|
| 339 |
+
log.info(" vs Pipecat baseline: accuracy %+.2f%%, FP %+.2f%%, FN %+.2f%%",
|
| 340 |
+
diff_acc * 100, diff_fp * 100, diff_fn * 100)
|
| 341 |
+
|
| 342 |
+
# 3. Our model per accent
|
| 343 |
+
for accent_name, accent_filter in [("native_pt_br", "native_pt_br"), ("french_pt", "french_pt")]:
|
| 344 |
+
accent_samples = [s for s in tts_samples if s["source"] == accent_filter]
|
| 345 |
+
if accent_samples:
|
| 346 |
+
log.info("\n=== Our model (%s) ===", accent_name)
|
| 347 |
+
m = evaluate_pytorch(model, accent_samples, device)
|
| 348 |
+
_log_metrics(f"Our model ({accent_name})", m)
|
| 349 |
+
results[f"our_model_{accent_name}"] = m
|
| 350 |
+
|
| 351 |
+
# 4. ONNX reference model
|
| 352 |
+
if onnx_session and pipecat_samples:
|
| 353 |
+
log.info("\n=== Pipecat ONNX model (reference) ===")
|
| 354 |
+
onnx_metrics = evaluate_onnx(onnx_session, pipecat_samples)
|
| 355 |
+
if onnx_metrics:
|
| 356 |
+
_log_metrics("Pipecat ONNX", onnx_metrics)
|
| 357 |
+
results["pipecat_onnx"] = onnx_metrics
|
| 358 |
+
|
| 359 |
+
# 5. Threshold sweep
|
| 360 |
+
log.info("\n=== Threshold Sweep (all test data) ===")
|
| 361 |
+
sweep = {}
|
| 362 |
+
for thresh in [0.3, 0.4, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9]:
|
| 363 |
+
m = evaluate_pytorch(model, all_samples, device, threshold=thresh)
|
| 364 |
+
sweep[str(thresh)] = m
|
| 365 |
+
log.info(" threshold=%.2f: acc=%.3f prec=%.3f rec=%.3f f1=%.3f FP=%.3f FN=%.3f",
|
| 366 |
+
thresh, m["accuracy"], m["precision"], m["recall"],
|
| 367 |
+
m["f1"], m["fp_rate"], m["fn_rate"])
|
| 368 |
+
|
| 369 |
+
results["threshold_sweep"] = sweep
|
| 370 |
+
|
| 371 |
+
# Find optimal thresholds
|
| 372 |
+
best_f1_thresh = max(sweep.items(), key=lambda x: x[1]["f1"])
|
| 373 |
+
best_prec_thresh = max(
|
| 374 |
+
[(k, v) for k, v in sweep.items() if v["recall"] >= 0.90],
|
| 375 |
+
key=lambda x: x[1]["precision"],
|
| 376 |
+
default=(None, None),
|
| 377 |
+
)
|
| 378 |
+
|
| 379 |
+
log.info("\n=== Recommendations ===")
|
| 380 |
+
log.info(" Best F1: threshold=%.2f (F1=%.3f, prec=%.3f, rec=%.3f)",
|
| 381 |
+
float(best_f1_thresh[0]), best_f1_thresh[1]["f1"],
|
| 382 |
+
best_f1_thresh[1]["precision"], best_f1_thresh[1]["recall"])
|
| 383 |
+
if best_prec_thresh[0]:
|
| 384 |
+
log.info(" Best precision (recall>=90%%): threshold=%.2f (prec=%.3f, rec=%.3f)",
|
| 385 |
+
float(best_prec_thresh[0]), best_prec_thresh[1]["precision"],
|
| 386 |
+
best_prec_thresh[1]["recall"])
|
| 387 |
+
|
| 388 |
+
# ----- Dual Threshold for Language Learning Avatar -----
|
| 389 |
+
# Inspired by Deepgram Flux (eot_threshold + eager_eot_threshold)
|
| 390 |
+
# For L2 learners: higher final threshold to avoid interrupting mid-hesitation
|
| 391 |
+
log.info("\n=== Dual Threshold (Language Learning Mode) ===")
|
| 392 |
+
log.info(" Context: conversational avatar for francophones learning Portuguese")
|
| 393 |
+
log.info(" Priority: minimize interruptions (FP) over fast response (FN)")
|
| 394 |
+
|
| 395 |
+
# Final threshold: minimize FP rate while keeping recall >= 85%
|
| 396 |
+
final_candidates = [(k, v) for k, v in sweep.items()
|
| 397 |
+
if v["recall"] >= 0.85 and float(k) >= 0.6]
|
| 398 |
+
if final_candidates:
|
| 399 |
+
best_final = min(final_candidates, key=lambda x: x[1]["fp_rate"])
|
| 400 |
+
final_thresh = float(best_final[0])
|
| 401 |
+
log.info(" Final threshold: %.2f (FP=%.1f%%, rec=%.1f%%)",
|
| 402 |
+
final_thresh, best_final[1]["fp_rate"] * 100,
|
| 403 |
+
best_final[1]["recall"] * 100)
|
| 404 |
+
else:
|
| 405 |
+
final_thresh = 0.7
|
| 406 |
+
log.info(" Final threshold: 0.70 (default)")
|
| 407 |
+
|
| 408 |
+
# Eager threshold: lower confidence for speculative LLM generation
|
| 409 |
+
eager_thresh = max(0.3, final_thresh - 0.3)
|
| 410 |
+
log.info(" Eager threshold: %.2f (start speculative LLM prep)", eager_thresh)
|
| 411 |
+
log.info(" Expected latency savings: ~150-250ms on response start")
|
| 412 |
+
|
| 413 |
+
# Simulate dual-threshold behavior
|
| 414 |
+
if all_samples:
|
| 415 |
+
log.info("\n Dual-threshold simulation:")
|
| 416 |
+
eager_m = evaluate_pytorch(model, all_samples, device, threshold=eager_thresh)
|
| 417 |
+
final_m = evaluate_pytorch(model, all_samples, device, threshold=final_thresh)
|
| 418 |
+
log.info(" Eager (%.2f): would trigger on %.1f%% of samples (%.1f%% false triggers)",
|
| 419 |
+
eager_thresh, (eager_m["tp"] + eager_m["fp"]) / max(eager_m["total"], 1) * 100,
|
| 420 |
+
eager_m["fp_rate"] * 100)
|
| 421 |
+
log.info(" Final (%.2f): confirms %.1f%% of turns (%.1f%% false confirms, %.1f%% missed)",
|
| 422 |
+
final_thresh, final_m["recall"] * 100,
|
| 423 |
+
final_m["fp_rate"] * 100, final_m["fn_rate"] * 100)
|
| 424 |
+
wasted_speculative = eager_m["fp"] - final_m["fp"]
|
| 425 |
+
log.info(" Wasted speculative preps: %d (started but not confirmed)", max(0, wasted_speculative))
|
| 426 |
+
|
| 427 |
+
results["recommended"] = {
|
| 428 |
+
"best_f1_threshold": float(best_f1_thresh[0]),
|
| 429 |
+
"best_precision_threshold": float(best_prec_thresh[0]) if best_prec_thresh[0] else None,
|
| 430 |
+
"dual_threshold": {
|
| 431 |
+
"eager": eager_thresh,
|
| 432 |
+
"final": final_thresh,
|
| 433 |
+
"mode": "language_learning",
|
| 434 |
+
"rationale": "Higher final threshold minimizes interruptions for L2 learners "
|
| 435 |
+
"who pause 500-2000ms mid-sentence. Eager threshold enables "
|
| 436 |
+
"speculative LLM prep for lower perceived latency.",
|
| 437 |
+
},
|
| 438 |
+
}
|
| 439 |
+
|
| 440 |
+
# 6. Summary comparison table
|
| 441 |
+
log.info("\n" + "=" * 70)
|
| 442 |
+
log.info("SUMMARY COMPARISON")
|
| 443 |
+
log.info("=" * 70)
|
| 444 |
+
log.info("%-30s %8s %8s %8s %8s", "Model", "Acc", "FP%", "FN%", "F1")
|
| 445 |
+
log.info("-" * 70)
|
| 446 |
+
log.info("%-30s %7.1f%% %7.2f%% %7.2f%% %8s",
|
| 447 |
+
"Pipecat v3.2 (baseline)",
|
| 448 |
+
PIPECAT_BASELINE["accuracy"] * 100,
|
| 449 |
+
PIPECAT_BASELINE["fp_rate"] * 100,
|
| 450 |
+
PIPECAT_BASELINE["fn_rate"] * 100,
|
| 451 |
+
"~0.96")
|
| 452 |
+
|
| 453 |
+
if "pipecat_onnx" in results:
|
| 454 |
+
m = results["pipecat_onnx"]
|
| 455 |
+
log.info("%-30s %7.1f%% %7.2f%% %7.2f%% %7.3f",
|
| 456 |
+
"Pipecat ONNX (our eval)",
|
| 457 |
+
m["accuracy"] * 100, m["fp_rate"] * 100,
|
| 458 |
+
m["fn_rate"] * 100, m["f1"])
|
| 459 |
+
|
| 460 |
+
if "our_model_pipecat_test" in results:
|
| 461 |
+
m = results["our_model_pipecat_test"]
|
| 462 |
+
log.info("%-30s %7.1f%% %7.2f%% %7.2f%% %7.3f",
|
| 463 |
+
"Our model (Pipecat test)",
|
| 464 |
+
m["accuracy"] * 100, m["fp_rate"] * 100,
|
| 465 |
+
m["fn_rate"] * 100, m["f1"])
|
| 466 |
+
|
| 467 |
+
if "our_model_all" in results:
|
| 468 |
+
m = results["our_model_all"]
|
| 469 |
+
log.info("%-30s %7.1f%% %7.2f%% %7.2f%% %7.3f",
|
| 470 |
+
"Our model (all test)",
|
| 471 |
+
m["accuracy"] * 100, m["fp_rate"] * 100,
|
| 472 |
+
m["fn_rate"] * 100, m["f1"])
|
| 473 |
+
|
| 474 |
+
log.info("=" * 70)
|
| 475 |
+
|
| 476 |
+
# Save results
|
| 477 |
+
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
|
| 478 |
+
results_path = RESULTS_DIR / "evaluation_results.json"
|
| 479 |
+
with open(results_path, "w") as f:
|
| 480 |
+
json.dump(results, f, indent=2)
|
| 481 |
+
log.info("\nResults saved to %s", results_path)
|
| 482 |
+
|
| 483 |
+
return results
|
| 484 |
+
|
| 485 |
+
|
| 486 |
+
def _log_metrics(name: str, m: dict) -> None:
|
| 487 |
+
log.info("%s:", name)
|
| 488 |
+
log.info(" Accuracy: %.3f (%.1f%%)", m["accuracy"], m["accuracy"] * 100)
|
| 489 |
+
log.info(" Precision: %.3f", m["precision"])
|
| 490 |
+
log.info(" Recall: %.3f", m["recall"])
|
| 491 |
+
log.info(" F1: %.3f", m["f1"])
|
| 492 |
+
log.info(" FP rate: %.3f (%.1f%%)", m["fp_rate"], m["fp_rate"] * 100)
|
| 493 |
+
log.info(" FN rate: %.3f (%.1f%%)", m["fn_rate"], m["fn_rate"] * 100)
|
| 494 |
+
log.info(" TP=%d FP=%d FN=%d TN=%d (total=%d)", m["tp"], m["fp"], m["fn"], m["tn"], m["total"])
|
| 495 |
+
if "latency_mean_ms" in m:
|
| 496 |
+
log.info(" Latency: mean=%.1fms p50=%.1fms p95=%.1fms",
|
| 497 |
+
m["latency_mean_ms"], m["latency_p50_ms"], m["latency_p95_ms"])
|
| 498 |
+
|
| 499 |
+
|
| 500 |
+
if __name__ == "__main__":
|
| 501 |
+
logging.basicConfig(
|
| 502 |
+
level=logging.INFO,
|
| 503 |
+
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
|
| 504 |
+
)
|
| 505 |
+
|
| 506 |
+
parser = argparse.ArgumentParser()
|
| 507 |
+
parser.add_argument("--model", type=str, default=str(RESULTS_DIR / "best_model.pt"),
|
| 508 |
+
help="Path to fine-tuned model checkpoint")
|
| 509 |
+
args = parser.parse_args()
|
| 510 |
+
|
| 511 |
+
run_evaluation(Path(args.model))
|
03-finetune-pipecat-pt/06_inference.py
ADDED
|
@@ -0,0 +1,560 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Inference engine for the language-learning avatar turn-taking model.
|
| 2 |
+
|
| 3 |
+
This module implements the dual-threshold + backchannel system designed
|
| 4 |
+
for a conversational avatar that teaches Portuguese to French speakers.
|
| 5 |
+
|
| 6 |
+
Key insight: L2 learners pause 1-3s mid-sentence (word search, conjugation,
|
| 7 |
+
code-switching). A naive system either:
|
| 8 |
+
a) Interrupts them (bad UX — kills confidence), or
|
| 9 |
+
b) Stays silent too long (learner thinks avatar froze)
|
| 10 |
+
|
| 11 |
+
Solution: dual-threshold with backchannel signals.
|
| 12 |
+
|
| 13 |
+
Architecture:
|
| 14 |
+
Audio stream → SmartTurnV3 model → confidence score (0.0-1.0)
|
| 15 |
+
│
|
| 16 |
+
├─ score < eager_threshold (0.4) → LISTENING (do nothing)
|
| 17 |
+
├─ eager ≤ score < final (0.7) → PREPARING (backchannel + speculative LLM)
|
| 18 |
+
└─ score ≥ final_threshold (0.7) → RESPONDING (take the turn)
|
| 19 |
+
|
| 20 |
+
Silence timer (parallel):
|
| 21 |
+
0-600ms → normal (no action)
|
| 22 |
+
600ms-1.5s → visual backchannel (nod, eye contact)
|
| 23 |
+
1.5s-3.0s → verbal backchannel ("mhm", "continue...")
|
| 24 |
+
3.0s+ → encouragement ("sem pressa, pode pensar...")
|
| 25 |
+
|
| 26 |
+
References:
|
| 27 |
+
- Deepgram Flux: eot_threshold + eager_eot_threshold (2025)
|
| 28 |
+
- ConversAR: "infinite thinking period" for L2 learners (2025)
|
| 29 |
+
- Tavus: 600ms response latency threshold (2025)
|
| 30 |
+
- Kosmala (2022): L2 repair pauses average 844ms
|
| 31 |
+
- Cenoz (2000): L2 silent pauses range 205ms to 11,569ms
|
| 32 |
+
|
| 33 |
+
Usage:
|
| 34 |
+
engine = TurnTakingEngine(model_path="results/smart_turn_pt_v3.onnx")
|
| 35 |
+
engine.on_state_change(my_callback)
|
| 36 |
+
|
| 37 |
+
# Feed audio chunks continuously
|
| 38 |
+
for chunk in audio_stream:
|
| 39 |
+
engine.feed_audio(chunk)
|
| 40 |
+
"""
|
| 41 |
+
|
| 42 |
+
from __future__ import annotations
|
| 43 |
+
|
| 44 |
+
import logging
|
| 45 |
+
import time
|
| 46 |
+
from dataclasses import dataclass, field
|
| 47 |
+
from enum import Enum
|
| 48 |
+
from pathlib import Path
|
| 49 |
+
from typing import Callable
|
| 50 |
+
|
| 51 |
+
import numpy as np
|
| 52 |
+
|
| 53 |
+
log = logging.getLogger(__name__)
|
| 54 |
+
|
| 55 |
+
SAMPLE_RATE = 16000
|
| 56 |
+
WINDOW_SECONDS = 8
|
| 57 |
+
WINDOW_SAMPLES = WINDOW_SECONDS * SAMPLE_RATE
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
# ---------------------------------------------------------------------------
|
| 61 |
+
# States and events
|
| 62 |
+
# ---------------------------------------------------------------------------
|
| 63 |
+
|
| 64 |
+
class TurnState(Enum):
|
| 65 |
+
"""Avatar turn-taking state machine."""
|
| 66 |
+
LISTENING = "listening" # User is speaking, avatar listens
|
| 67 |
+
SILENCE = "silence" # User stopped, waiting to classify
|
| 68 |
+
BACKCHANNEL_VISUAL = "bc_visual" # Nod/eye contact (600ms-1.5s silence)
|
| 69 |
+
BACKCHANNEL_VERBAL = "bc_verbal" # "Mhm" / "continue" (1.5s-3.0s)
|
| 70 |
+
ENCOURAGEMENT = "encouragement" # "Sem pressa..." (3.0s+ silence)
|
| 71 |
+
PREPARING = "preparing" # Eager threshold hit, speculative LLM started
|
| 72 |
+
RESPONDING = "responding" # Final threshold hit, avatar speaks
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
@dataclass
|
| 76 |
+
class TurnEvent:
|
| 77 |
+
"""Event emitted on state transitions."""
|
| 78 |
+
state: TurnState
|
| 79 |
+
confidence: float # Model's end-of-turn confidence (0-1)
|
| 80 |
+
silence_duration_ms: float # How long the user has been silent
|
| 81 |
+
timestamp: float # time.monotonic()
|
| 82 |
+
message: str = "" # Backchannel text (if applicable)
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
@dataclass
|
| 86 |
+
class BackchannelConfig:
|
| 87 |
+
"""Configurable backchannel behavior for the avatar.
|
| 88 |
+
|
| 89 |
+
These can be tuned per learner profile or CEFR level:
|
| 90 |
+
- A1/A2: longer delays, more encouraging messages
|
| 91 |
+
- B1/B2: shorter delays, less frequent backchannels
|
| 92 |
+
"""
|
| 93 |
+
# Silence thresholds (milliseconds)
|
| 94 |
+
visual_backchannel_ms: float = 600.0
|
| 95 |
+
verbal_backchannel_ms: float = 1500.0
|
| 96 |
+
encouragement_ms: float = 3000.0
|
| 97 |
+
|
| 98 |
+
# Model confidence thresholds
|
| 99 |
+
eager_threshold: float = 0.4 # Start speculative LLM generation
|
| 100 |
+
final_threshold: float = 0.7 # Confirm end-of-turn, avatar speaks
|
| 101 |
+
|
| 102 |
+
# Backchannel messages (Portuguese, with French-speaker-friendly variants)
|
| 103 |
+
visual_actions: list[str] = field(default_factory=lambda: [
|
| 104 |
+
"nod", # Aceno de cabeça
|
| 105 |
+
"eye_contact", # Olhar atento
|
| 106 |
+
"slight_smile", # Sorriso leve
|
| 107 |
+
])
|
| 108 |
+
|
| 109 |
+
verbal_backchannels: list[str] = field(default_factory=lambda: [
|
| 110 |
+
"Mhm...",
|
| 111 |
+
"Uhum...",
|
| 112 |
+
"Sim...",
|
| 113 |
+
"Tá...",
|
| 114 |
+
"Sei...",
|
| 115 |
+
"Continue...",
|
| 116 |
+
])
|
| 117 |
+
|
| 118 |
+
encouragement_messages: list[str] = field(default_factory=lambda: [
|
| 119 |
+
"Pode continuar, sem pressa...",
|
| 120 |
+
"Tá pensando? Tranquilo...",
|
| 121 |
+
"Pode pensar, eu espero...",
|
| 122 |
+
"Sem pressa, tá tudo bem...",
|
| 123 |
+
"Take your time... pode falar em português...", # Code-switch friendly
|
| 124 |
+
"Prenez votre temps... quando estiver pronto...", # French reassurance
|
| 125 |
+
])
|
| 126 |
+
|
| 127 |
+
# Cooldowns (don't spam backchannels)
|
| 128 |
+
verbal_cooldown_ms: float = 3000.0 # Min time between verbal backchannels
|
| 129 |
+
encouragement_cooldown_ms: float = 8000.0 # Min time between encouragements
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
# ---------------------------------------------------------------------------
|
| 133 |
+
# CEFR-aware presets
|
| 134 |
+
# ---------------------------------------------------------------------------
|
| 135 |
+
|
| 136 |
+
CEFR_PRESETS: dict[str, BackchannelConfig] = {
|
| 137 |
+
"A1": BackchannelConfig(
|
| 138 |
+
visual_backchannel_ms=500.0,
|
| 139 |
+
verbal_backchannel_ms=1200.0,
|
| 140 |
+
encouragement_ms=2500.0,
|
| 141 |
+
eager_threshold=0.5, # Higher — need more confidence before preparing
|
| 142 |
+
final_threshold=0.8, # Much higher — very patient, rarely interrupt
|
| 143 |
+
),
|
| 144 |
+
"A2": BackchannelConfig(
|
| 145 |
+
visual_backchannel_ms=550.0,
|
| 146 |
+
verbal_backchannel_ms=1300.0,
|
| 147 |
+
encouragement_ms=2800.0,
|
| 148 |
+
eager_threshold=0.45,
|
| 149 |
+
final_threshold=0.75,
|
| 150 |
+
),
|
| 151 |
+
"B1": BackchannelConfig(
|
| 152 |
+
visual_backchannel_ms=600.0,
|
| 153 |
+
verbal_backchannel_ms=1500.0,
|
| 154 |
+
encouragement_ms=3000.0,
|
| 155 |
+
eager_threshold=0.4,
|
| 156 |
+
final_threshold=0.7, # Default
|
| 157 |
+
),
|
| 158 |
+
"B2": BackchannelConfig(
|
| 159 |
+
visual_backchannel_ms=600.0,
|
| 160 |
+
verbal_backchannel_ms=1800.0,
|
| 161 |
+
encouragement_ms=4000.0,
|
| 162 |
+
eager_threshold=0.35,
|
| 163 |
+
final_threshold=0.65, # More responsive — B2 pauses less
|
| 164 |
+
),
|
| 165 |
+
"C1": BackchannelConfig(
|
| 166 |
+
visual_backchannel_ms=600.0,
|
| 167 |
+
verbal_backchannel_ms=2000.0,
|
| 168 |
+
encouragement_ms=5000.0,
|
| 169 |
+
eager_threshold=0.35,
|
| 170 |
+
final_threshold=0.6, # Near-native responsiveness
|
| 171 |
+
),
|
| 172 |
+
}
|
| 173 |
+
|
| 174 |
+
|
| 175 |
+
# ---------------------------------------------------------------------------
|
| 176 |
+
# Turn-taking engine
|
| 177 |
+
# ---------------------------------------------------------------------------
|
| 178 |
+
|
| 179 |
+
class TurnTakingEngine:
|
| 180 |
+
"""Real-time turn-taking engine for language-learning avatar.
|
| 181 |
+
|
| 182 |
+
Combines the SmartTurnV3 model with a silence timer and backchannel
|
| 183 |
+
state machine to create a patient, encouraging conversational partner.
|
| 184 |
+
|
| 185 |
+
Usage:
|
| 186 |
+
engine = TurnTakingEngine("results/smart_turn_pt_v3.onnx")
|
| 187 |
+
engine.on_state_change(handle_event)
|
| 188 |
+
|
| 189 |
+
# In audio loop:
|
| 190 |
+
for chunk in mic_stream:
|
| 191 |
+
engine.feed_audio(chunk)
|
| 192 |
+
|
| 193 |
+
# Or feed pre-computed features:
|
| 194 |
+
engine.feed_score(model_confidence, is_speech=True)
|
| 195 |
+
"""
|
| 196 |
+
|
| 197 |
+
def __init__(
|
| 198 |
+
self,
|
| 199 |
+
model_path: str | Path | None = None,
|
| 200 |
+
config: BackchannelConfig | None = None,
|
| 201 |
+
cefr_level: str = "B1",
|
| 202 |
+
):
|
| 203 |
+
# Config: use CEFR preset or custom
|
| 204 |
+
if config is not None:
|
| 205 |
+
self.config = config
|
| 206 |
+
elif cefr_level in CEFR_PRESETS:
|
| 207 |
+
self.config = CEFR_PRESETS[cefr_level]
|
| 208 |
+
log.info("Using CEFR %s preset: eager=%.2f, final=%.2f",
|
| 209 |
+
cefr_level, self.config.eager_threshold, self.config.final_threshold)
|
| 210 |
+
else:
|
| 211 |
+
self.config = BackchannelConfig()
|
| 212 |
+
|
| 213 |
+
# State
|
| 214 |
+
self._state = TurnState.LISTENING
|
| 215 |
+
self._silence_start: float | None = None
|
| 216 |
+
self._last_speech_time: float = time.monotonic()
|
| 217 |
+
self._last_verbal_bc: float = 0.0
|
| 218 |
+
self._last_encouragement: float = 0.0
|
| 219 |
+
self._last_confidence: float = 0.0
|
| 220 |
+
self._callbacks: list[Callable[[TurnEvent], None]] = []
|
| 221 |
+
self._audio_buffer = np.zeros(WINDOW_SAMPLES, dtype=np.float32)
|
| 222 |
+
self._speculative_started = False
|
| 223 |
+
|
| 224 |
+
# Load ONNX model if provided
|
| 225 |
+
self._session = None
|
| 226 |
+
self._feature_extractor = None
|
| 227 |
+
if model_path is not None:
|
| 228 |
+
self._load_model(model_path)
|
| 229 |
+
|
| 230 |
+
def _load_model(self, model_path: str | Path) -> None:
|
| 231 |
+
"""Load ONNX model for inference."""
|
| 232 |
+
try:
|
| 233 |
+
import onnxruntime as ort
|
| 234 |
+
self._session = ort.InferenceSession(
|
| 235 |
+
str(model_path),
|
| 236 |
+
providers=["CPUExecutionProvider"],
|
| 237 |
+
)
|
| 238 |
+
log.info("Loaded turn-taking model: %s", model_path)
|
| 239 |
+
except Exception as e:
|
| 240 |
+
log.warning("Failed to load model %s: %s — running in manual mode", model_path, e)
|
| 241 |
+
|
| 242 |
+
try:
|
| 243 |
+
from transformers import WhisperFeatureExtractor
|
| 244 |
+
self._feature_extractor = WhisperFeatureExtractor(chunk_length=WINDOW_SECONDS)
|
| 245 |
+
except ImportError:
|
| 246 |
+
log.warning("transformers not available — cannot extract features")
|
| 247 |
+
|
| 248 |
+
def on_state_change(self, callback: Callable[[TurnEvent], None]) -> None:
|
| 249 |
+
"""Register callback for state change events."""
|
| 250 |
+
self._callbacks.append(callback)
|
| 251 |
+
|
| 252 |
+
@property
|
| 253 |
+
def state(self) -> TurnState:
|
| 254 |
+
return self._state
|
| 255 |
+
|
| 256 |
+
@property
|
| 257 |
+
def silence_duration_ms(self) -> float:
|
| 258 |
+
if self._silence_start is None:
|
| 259 |
+
return 0.0
|
| 260 |
+
return (time.monotonic() - self._silence_start) * 1000
|
| 261 |
+
|
| 262 |
+
# ----- Audio input -----
|
| 263 |
+
|
| 264 |
+
def feed_audio(self, chunk: np.ndarray, is_speech: bool | None = None) -> TurnEvent | None:
|
| 265 |
+
"""Feed an audio chunk and get turn-taking decision.
|
| 266 |
+
|
| 267 |
+
chunk: float32 audio at 16kHz
|
| 268 |
+
is_speech: if None, will use simple energy-based VAD
|
| 269 |
+
|
| 270 |
+
Returns TurnEvent if state changed, None otherwise.
|
| 271 |
+
"""
|
| 272 |
+
now = time.monotonic()
|
| 273 |
+
|
| 274 |
+
# Update audio buffer (sliding window)
|
| 275 |
+
chunk = chunk.astype(np.float32)
|
| 276 |
+
if len(chunk) >= WINDOW_SAMPLES:
|
| 277 |
+
self._audio_buffer = chunk[-WINDOW_SAMPLES:]
|
| 278 |
+
else:
|
| 279 |
+
self._audio_buffer = np.roll(self._audio_buffer, -len(chunk))
|
| 280 |
+
self._audio_buffer[-len(chunk):] = chunk
|
| 281 |
+
|
| 282 |
+
# Simple VAD if not provided
|
| 283 |
+
if is_speech is None:
|
| 284 |
+
rms = np.sqrt(np.mean(chunk ** 2))
|
| 285 |
+
is_speech = rms > 0.01 # Simple threshold
|
| 286 |
+
|
| 287 |
+
if is_speech:
|
| 288 |
+
return self._on_speech(now)
|
| 289 |
+
else:
|
| 290 |
+
return self._on_silence(now)
|
| 291 |
+
|
| 292 |
+
def feed_score(self, confidence: float, is_speech: bool) -> TurnEvent | None:
|
| 293 |
+
"""Feed pre-computed model confidence score.
|
| 294 |
+
|
| 295 |
+
Use this when you run the model externally and just want
|
| 296 |
+
the backchannel/threshold logic.
|
| 297 |
+
|
| 298 |
+
confidence: 0.0 (definitely incomplete) to 1.0 (definitely complete)
|
| 299 |
+
is_speech: whether VAD detected speech in this frame
|
| 300 |
+
"""
|
| 301 |
+
now = time.monotonic()
|
| 302 |
+
self._last_confidence = confidence
|
| 303 |
+
|
| 304 |
+
if is_speech:
|
| 305 |
+
return self._on_speech(now)
|
| 306 |
+
else:
|
| 307 |
+
return self._on_silence(now, confidence=confidence)
|
| 308 |
+
|
| 309 |
+
# ----- Internal state machine -----
|
| 310 |
+
|
| 311 |
+
def _on_speech(self, now: float) -> TurnEvent | None:
|
| 312 |
+
"""User is speaking."""
|
| 313 |
+
self._last_speech_time = now
|
| 314 |
+
self._silence_start = None
|
| 315 |
+
self._speculative_started = False
|
| 316 |
+
|
| 317 |
+
if self._state != TurnState.LISTENING:
|
| 318 |
+
return self._transition(TurnState.LISTENING, confidence=0.0, now=now)
|
| 319 |
+
return None
|
| 320 |
+
|
| 321 |
+
def _on_silence(self, now: float, confidence: float | None = None) -> TurnEvent | None:
|
| 322 |
+
"""User is silent — run the state machine."""
|
| 323 |
+
# Mark silence start
|
| 324 |
+
if self._silence_start is None:
|
| 325 |
+
self._silence_start = now
|
| 326 |
+
|
| 327 |
+
silence_ms = (now - self._silence_start) * 1000
|
| 328 |
+
|
| 329 |
+
# Get model confidence if we have a model and no external score
|
| 330 |
+
if confidence is None:
|
| 331 |
+
confidence = self._run_model()
|
| 332 |
+
self._last_confidence = confidence
|
| 333 |
+
|
| 334 |
+
cfg = self.config
|
| 335 |
+
|
| 336 |
+
# ----- Decision tree -----
|
| 337 |
+
|
| 338 |
+
# 1. Final threshold → RESPONDING (take the turn)
|
| 339 |
+
if confidence >= cfg.final_threshold:
|
| 340 |
+
return self._transition(TurnState.RESPONDING, confidence, now)
|
| 341 |
+
|
| 342 |
+
# 2. Eager threshold → PREPARING (speculative LLM, backchannel)
|
| 343 |
+
if confidence >= cfg.eager_threshold and not self._speculative_started:
|
| 344 |
+
self._speculative_started = True
|
| 345 |
+
return self._transition(TurnState.PREPARING, confidence, now)
|
| 346 |
+
|
| 347 |
+
# 3. Silence-based backchannels (even if model is unsure)
|
| 348 |
+
# These keep the learner engaged so they don't think the avatar froze
|
| 349 |
+
|
| 350 |
+
if silence_ms >= cfg.encouragement_ms:
|
| 351 |
+
if now - self._last_encouragement >= cfg.encouragement_cooldown_ms / 1000:
|
| 352 |
+
self._last_encouragement = now
|
| 353 |
+
return self._transition(TurnState.ENCOURAGEMENT, confidence, now)
|
| 354 |
+
|
| 355 |
+
if silence_ms >= cfg.verbal_backchannel_ms:
|
| 356 |
+
if now - self._last_verbal_bc >= cfg.verbal_cooldown_ms / 1000:
|
| 357 |
+
self._last_verbal_bc = now
|
| 358 |
+
return self._transition(TurnState.BACKCHANNEL_VERBAL, confidence, now)
|
| 359 |
+
|
| 360 |
+
if silence_ms >= cfg.visual_backchannel_ms:
|
| 361 |
+
if self._state not in (TurnState.BACKCHANNEL_VISUAL,
|
| 362 |
+
TurnState.BACKCHANNEL_VERBAL,
|
| 363 |
+
TurnState.ENCOURAGEMENT,
|
| 364 |
+
TurnState.PREPARING):
|
| 365 |
+
return self._transition(TurnState.BACKCHANNEL_VISUAL, confidence, now)
|
| 366 |
+
|
| 367 |
+
# Still in silence, no state change
|
| 368 |
+
if self._state == TurnState.LISTENING:
|
| 369 |
+
return self._transition(TurnState.SILENCE, confidence, now)
|
| 370 |
+
|
| 371 |
+
return None
|
| 372 |
+
|
| 373 |
+
def _transition(self, new_state: TurnState, confidence: float, now: float) -> TurnEvent:
|
| 374 |
+
"""Transition to a new state and emit event."""
|
| 375 |
+
import random
|
| 376 |
+
|
| 377 |
+
old_state = self._state
|
| 378 |
+
self._state = new_state
|
| 379 |
+
|
| 380 |
+
silence_ms = (now - self._silence_start) * 1000 if self._silence_start else 0.0
|
| 381 |
+
|
| 382 |
+
# Pick appropriate message
|
| 383 |
+
message = ""
|
| 384 |
+
if new_state == TurnState.BACKCHANNEL_VISUAL:
|
| 385 |
+
message = random.choice(self.config.visual_actions)
|
| 386 |
+
elif new_state == TurnState.BACKCHANNEL_VERBAL:
|
| 387 |
+
message = random.choice(self.config.verbal_backchannels)
|
| 388 |
+
elif new_state == TurnState.ENCOURAGEMENT:
|
| 389 |
+
message = random.choice(self.config.encouragement_messages)
|
| 390 |
+
elif new_state == TurnState.PREPARING:
|
| 391 |
+
message = "speculative_llm_start"
|
| 392 |
+
elif new_state == TurnState.RESPONDING:
|
| 393 |
+
message = "take_turn"
|
| 394 |
+
|
| 395 |
+
event = TurnEvent(
|
| 396 |
+
state=new_state,
|
| 397 |
+
confidence=confidence,
|
| 398 |
+
silence_duration_ms=silence_ms,
|
| 399 |
+
timestamp=now,
|
| 400 |
+
message=message,
|
| 401 |
+
)
|
| 402 |
+
|
| 403 |
+
if old_state != new_state:
|
| 404 |
+
log.debug("Turn state: %s → %s (conf=%.2f, silence=%.0fms, msg=%s)",
|
| 405 |
+
old_state.value, new_state.value, confidence, silence_ms, message)
|
| 406 |
+
|
| 407 |
+
for cb in self._callbacks:
|
| 408 |
+
try:
|
| 409 |
+
cb(event)
|
| 410 |
+
except Exception as e:
|
| 411 |
+
log.warning("Callback error: %s", e)
|
| 412 |
+
|
| 413 |
+
return event
|
| 414 |
+
|
| 415 |
+
def _run_model(self) -> float:
|
| 416 |
+
"""Run the ONNX model on current audio buffer."""
|
| 417 |
+
if self._session is None or self._feature_extractor is None:
|
| 418 |
+
return 0.0
|
| 419 |
+
|
| 420 |
+
try:
|
| 421 |
+
inputs = self._feature_extractor(
|
| 422 |
+
self._audio_buffer,
|
| 423 |
+
sampling_rate=SAMPLE_RATE,
|
| 424 |
+
return_tensors="np",
|
| 425 |
+
padding="max_length",
|
| 426 |
+
max_length=WINDOW_SAMPLES,
|
| 427 |
+
truncation=True,
|
| 428 |
+
do_normalize=True,
|
| 429 |
+
)
|
| 430 |
+
features = inputs.input_features.astype(np.float32)
|
| 431 |
+
|
| 432 |
+
input_name = self._session.get_inputs()[0].name
|
| 433 |
+
outputs = self._session.run(None, {input_name: features})
|
| 434 |
+
logit = outputs[0].item() if outputs[0].size == 1 else outputs[0][0].item()
|
| 435 |
+
|
| 436 |
+
# Sigmoid
|
| 437 |
+
confidence = 1.0 / (1.0 + np.exp(-logit))
|
| 438 |
+
return float(confidence)
|
| 439 |
+
except Exception as e:
|
| 440 |
+
log.warning("Model inference error: %s", e)
|
| 441 |
+
return 0.0
|
| 442 |
+
|
| 443 |
+
# ----- Convenience -----
|
| 444 |
+
|
| 445 |
+
def reset(self) -> None:
|
| 446 |
+
"""Reset state (e.g., when starting a new conversation turn)."""
|
| 447 |
+
self._state = TurnState.LISTENING
|
| 448 |
+
self._silence_start = None
|
| 449 |
+
self._speculative_started = False
|
| 450 |
+
self._last_confidence = 0.0
|
| 451 |
+
self._audio_buffer = np.zeros(WINDOW_SAMPLES, dtype=np.float32)
|
| 452 |
+
|
| 453 |
+
def set_cefr_level(self, level: str) -> None:
|
| 454 |
+
"""Change CEFR level at runtime (e.g., after proficiency assessment)."""
|
| 455 |
+
if level in CEFR_PRESETS:
|
| 456 |
+
self.config = CEFR_PRESETS[level]
|
| 457 |
+
log.info("Switched to CEFR %s: eager=%.2f, final=%.2f",
|
| 458 |
+
level, self.config.eager_threshold, self.config.final_threshold)
|
| 459 |
+
else:
|
| 460 |
+
log.warning("Unknown CEFR level: %s", level)
|
| 461 |
+
|
| 462 |
+
|
| 463 |
+
# ---------------------------------------------------------------------------
|
| 464 |
+
# Example usage and demo
|
| 465 |
+
# ---------------------------------------------------------------------------
|
| 466 |
+
|
| 467 |
+
def demo_simulation():
|
| 468 |
+
"""Simulate a conversation to demonstrate the backchannel system.
|
| 469 |
+
|
| 470 |
+
Scenario: French speaker (B1) learning Portuguese, pausing frequently.
|
| 471 |
+
"""
|
| 472 |
+
print("=" * 70)
|
| 473 |
+
print("DEMO: Turn-Taking Engine for Language Learning Avatar")
|
| 474 |
+
print("Scenario: French B1 learner speaking Portuguese")
|
| 475 |
+
print("=" * 70)
|
| 476 |
+
|
| 477 |
+
events_log = []
|
| 478 |
+
|
| 479 |
+
def on_event(event: TurnEvent):
|
| 480 |
+
events_log.append(event)
|
| 481 |
+
state_emoji = {
|
| 482 |
+
TurnState.LISTENING: "👂",
|
| 483 |
+
TurnState.SILENCE: "⏸️",
|
| 484 |
+
TurnState.BACKCHANNEL_VISUAL: "😊",
|
| 485 |
+
TurnState.BACKCHANNEL_VERBAL: "💬",
|
| 486 |
+
TurnState.ENCOURAGEMENT: "🤗",
|
| 487 |
+
TurnState.PREPARING: "🧠",
|
| 488 |
+
TurnState.RESPONDING: "🗣️",
|
| 489 |
+
}
|
| 490 |
+
emoji = state_emoji.get(event.state, "❓")
|
| 491 |
+
print(f" {emoji} [{event.silence_duration_ms:6.0f}ms] "
|
| 492 |
+
f"conf={event.confidence:.2f} → {event.state.value}: {event.message}")
|
| 493 |
+
|
| 494 |
+
engine = TurnTakingEngine(config=CEFR_PRESETS["B1"])
|
| 495 |
+
engine.on_state_change(on_event)
|
| 496 |
+
|
| 497 |
+
# Simulate a conversation timeline
|
| 498 |
+
# Each entry: (description, duration_ms, is_speech, model_confidence)
|
| 499 |
+
timeline = [
|
| 500 |
+
# Learner starts speaking
|
| 501 |
+
("Learner: 'Eu fui ao...'", 1200, True, 0.1),
|
| 502 |
+
# Pause — searching for word (conjugation hesitation)
|
| 503 |
+
(" [silence — thinking about conjugation]", 300, False, 0.15),
|
| 504 |
+
(" [still thinking...]", 400, False, 0.20),
|
| 505 |
+
(" [600ms — visual backchannel]", 300, False, 0.22),
|
| 506 |
+
(" [1s — still silent]", 400, False, 0.25),
|
| 507 |
+
(" [1.5s — verbal backchannel]", 500, False, 0.28),
|
| 508 |
+
# Learner continues
|
| 509 |
+
("Learner: '...mercado... euh...'", 800, True, 0.1),
|
| 510 |
+
# Another pause — code-switching hesitation
|
| 511 |
+
(" [silence — 'comment dit-on...']", 500, False, 0.20),
|
| 512 |
+
(" [1s silence]", 500, False, 0.30),
|
| 513 |
+
(" [1.5s — verbal backchannel]", 500, False, 0.35),
|
| 514 |
+
(" [2s — model getting uncertain]", 500, False, 0.42),
|
| 515 |
+
# Learner continues again
|
| 516 |
+
("Learner: '...para comprar... uma tesoura.'", 1500, True, 0.1),
|
| 517 |
+
# Final silence — this time it's a real end of turn
|
| 518 |
+
(" [silence after complete sentence]", 300, False, 0.45),
|
| 519 |
+
(" [600ms]", 300, False, 0.55),
|
| 520 |
+
(" [model confidence rising]", 300, False, 0.65),
|
| 521 |
+
(" [model confident — end of turn]", 200, False, 0.75),
|
| 522 |
+
]
|
| 523 |
+
|
| 524 |
+
print("\nTimeline:")
|
| 525 |
+
print("-" * 70)
|
| 526 |
+
|
| 527 |
+
for description, duration_ms, is_speech, confidence in timeline:
|
| 528 |
+
print(f"\n{description} ({duration_ms}ms)")
|
| 529 |
+
# Simulate in 100ms chunks
|
| 530 |
+
n_chunks = max(1, duration_ms // 100)
|
| 531 |
+
for _ in range(n_chunks):
|
| 532 |
+
engine.feed_score(confidence, is_speech)
|
| 533 |
+
time.sleep(0.001) # Tiny sleep for monotonic clock to advance
|
| 534 |
+
|
| 535 |
+
print("\n" + "=" * 70)
|
| 536 |
+
print(f"Total events: {len(events_log)}")
|
| 537 |
+
print(f"Final state: {engine.state.value}")
|
| 538 |
+
|
| 539 |
+
# Show CEFR comparison
|
| 540 |
+
print("\n" + "=" * 70)
|
| 541 |
+
print("CEFR LEVEL COMPARISON")
|
| 542 |
+
print("=" * 70)
|
| 543 |
+
print(f"{'Level':<6} {'Eager':<8} {'Final':<8} {'Visual BC':<10} {'Verbal BC':<10} {'Encourage':<10}")
|
| 544 |
+
print("-" * 52)
|
| 545 |
+
for level, preset in CEFR_PRESETS.items():
|
| 546 |
+
print(f"{level:<6} {preset.eager_threshold:<8.2f} {preset.final_threshold:<8.2f} "
|
| 547 |
+
f"{preset.visual_backchannel_ms:<10.0f} {preset.verbal_backchannel_ms:<10.0f} "
|
| 548 |
+
f"{preset.encouragement_ms:<10.0f}")
|
| 549 |
+
print()
|
| 550 |
+
print("A1/A2: Very patient — high final threshold (0.75-0.80), early encouragement")
|
| 551 |
+
print("B1: Default — balanced patience and responsiveness")
|
| 552 |
+
print("B2/C1: More responsive — lower final threshold (0.60-0.65)")
|
| 553 |
+
|
| 554 |
+
|
| 555 |
+
if __name__ == "__main__":
|
| 556 |
+
logging.basicConfig(
|
| 557 |
+
level=logging.DEBUG,
|
| 558 |
+
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
|
| 559 |
+
)
|
| 560 |
+
demo_simulation()
|
03-finetune-pipecat-pt/README.md
ADDED
|
@@ -0,0 +1,489 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Experimento 03 — Fine-tune Pipecat Smart Turn para Portugues + Frances
|
| 2 |
+
|
| 3 |
+
> Fine-tuning do Pipecat Smart Turn v3 para **avatar conversacional de aprendizado de portugues** por francofonos. Primeiro modelo de turn-taking otimizado para aprendizes L2.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Sumario
|
| 8 |
+
|
| 9 |
+
1. [Objetivo](#objetivo)
|
| 10 |
+
2. [Por que a partir do Pipecat](#por-que-a-partir-do-pipecat)
|
| 11 |
+
3. [Metodo — Pipeline de dados](#metodo--pipeline-de-dados)
|
| 12 |
+
- [Pipeline de criacao de dados](#pipeline-de-criacao-de-dados)
|
| 13 |
+
- [Projetos que validam este metodo](#projetos-que-validam-este-metodo)
|
| 14 |
+
4. [Fine-tuning](#fine-tuning)
|
| 15 |
+
- [Modelo base](#modelo-base)
|
| 16 |
+
- [Estrategia de fine-tuning](#estrategia-de-fine-tuning)
|
| 17 |
+
- [Infra](#infra)
|
| 18 |
+
5. [Melhorias para aprendizado de idiomas (L2)](#melhorias-para-aprendizado-de-idiomas-l2)
|
| 19 |
+
- [Custo assimetrico](#custo-assimetrico)
|
| 20 |
+
- [Threshold duplo (Deepgram Flux)](#threshold-duplo-deepgram-flux)
|
| 21 |
+
- [Dados L2 reais (Speak & Improve)](#dados-l2-reais-speak--improve)
|
| 22 |
+
- [CEFR-aware presets](#cefr-aware-presets)
|
| 23 |
+
6. [Sistema de backchannel](#sistema-de-backchannel)
|
| 24 |
+
- [Problema: "o avatar travou?"](#problema-o-avatar-travou)
|
| 25 |
+
- [Solucao: sinais de escuta ativa](#solucao-sinais-de-escuta-ativa)
|
| 26 |
+
- [Presets por nivel CEFR](#presets-por-nivel-cefr)
|
| 27 |
+
7. [Engine de inferencia (06_inference.py)](#engine-de-inferencia-06_inferencepy)
|
| 28 |
+
- [Estados do turn-taking](#estados-do-turn-taking)
|
| 29 |
+
- [Exemplo: aprendiz B1 conjugando verbo](#exemplo-aprendiz-b1-conjugando-verbo)
|
| 30 |
+
8. [Dados especificos — Frances falando portugues](#dados-especificos--frances-falando-portugues)
|
| 31 |
+
- [Tipos de hesitacao](#tipos-de-hesitacao)
|
| 32 |
+
- [Geracoes com Claude](#geracoes-com-claude)
|
| 33 |
+
9. [Benchmarks de referencia](#benchmarks-de-referencia)
|
| 34 |
+
- [Pipecat Smart Turn v3.0](#pipecat-smart-turn-v30)
|
| 35 |
+
- [LiveKit v0.4.1](#livekit-v041)
|
| 36 |
+
- [Outros modelos](#outros-modelos)
|
| 37 |
+
- [Gap sintetico → real](#gap-sintetico--real)
|
| 38 |
+
10. [Metricas alvo](#metricas-alvo)
|
| 39 |
+
11. [Correcoes apos analise de referencias](#correcoes-apos-analise-de-referencias)
|
| 40 |
+
12. [Estrutura de arquivos](#estrutura-de-arquivos)
|
| 41 |
+
13. [Cronograma](#cronograma)
|
| 42 |
+
14. [Dependencias](#dependencias)
|
| 43 |
+
15. [Referencias](#referencias)
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Objetivo
|
| 48 |
+
|
| 49 |
+
Fine-tunar o modelo pre-treinado do **Pipecat Smart Turn v3** (ja treinado em 23 linguas, 270K amostras) especificamente para:
|
| 50 |
+
|
| 51 |
+
1. **Portugues brasileiro** — melhorar deteccao de fim de turno em conversas em PT-BR
|
| 52 |
+
2. **Frances falando portugues** — detectar fim de turno de falantes nativos de frances que estao falando portugues (com sotaque, hesitacoes e code-switching tipicos)
|
| 53 |
+
3. **Avatar conversacional L2** — o modelo sera usado em um avatar de IA que ensina portugues a francofonos, exigindo paciencia extra com pausas de aprendiz
|
| 54 |
+
|
| 55 |
+
> **Pioneirismo**: Nenhum modelo de turn-taking otimizado para aprendizes L2 existe (comercial ou academico). O BabelCast e o primeiro.
|
| 56 |
+
|
| 57 |
+
## Por que a partir do Pipecat
|
| 58 |
+
|
| 59 |
+
Nos experimentos anteriores (`previous-experiments/02-finetune-scratch/`) treinamos do zero com Whisper Tiny + dados CORAA/MUPE. Resultados:
|
| 60 |
+
|
| 61 |
+
| Metrica | Do zero (melhor) | Pipecat original (PT) |
|
| 62 |
+
|---------|------------------|----------------------|
|
| 63 |
+
| Accuracy | 78.2% | **95.42%** |
|
| 64 |
+
| False Positive | 16.8% @0.5 | **2.79%** |
|
| 65 |
+
| False Negative | 5.1% @0.5 | **1.79%** |
|
| 66 |
+
| F1 | 0.798 | ~0.96 (estimado) |
|
| 67 |
+
| Dados | 15K amostras | 270K amostras |
|
| 68 |
+
| Labels | Pontuacao + corte artificial | LLM-curados + TTS |
|
| 69 |
+
|
| 70 |
+
A diferenca e brutal: o Pipecat chega a **95.42% em portugues** com dados LLM-curados + TTS; nos chegamos a 78.2% com labels de pontuacao. O problema nao e o modelo — e a **qualidade dos dados**.
|
| 71 |
+
|
| 72 |
+
**Problemas do treino do zero:**
|
| 73 |
+
- Labels de baixa qualidade (pontuacao nao indica fim de turno de verdade)
|
| 74 |
+
- Corte artificial em 30-75% nao simula pausas reais de hesitacao
|
| 75 |
+
- Poucos dados (15K vs 270K do Pipecat)
|
| 76 |
+
- Modelo nao entende pausas de hesitacao — acerta so 67.7% das pausas @threshold=0.5
|
| 77 |
+
|
| 78 |
+
**Vantagens de partir do Pipecat:**
|
| 79 |
+
- Modelo ja entende turn-taking em 23 linguas
|
| 80 |
+
- Precisa de muito menos dados pra adaptar (5-10K vs 270K)
|
| 81 |
+
- Transfer learning: mantem conhecimento geral, adapta pra PT-BR
|
| 82 |
+
- Treino muito mais rapido (~30 min vs horas)
|
| 83 |
+
|
| 84 |
+
## Metodo — Pipeline de dados
|
| 85 |
+
|
| 86 |
+
### Pipeline de criacao de dados
|
| 87 |
+
|
| 88 |
+
Baseado no pipeline comprovado do Pipecat v3.1:
|
| 89 |
+
|
| 90 |
+
```
|
| 91 |
+
Etapa 1: Frases fonte
|
| 92 |
+
- Transcricoes do CORAA (portugues conversacional real)
|
| 93 |
+
- Frases geradas pelo Claude (contextos especificos)
|
| 94 |
+
- Frases tipicas de aprendiz de frances falando PT
|
| 95 |
+
- Speak & Improve Corpus 2025 (340h L2 com disfluencias)
|
| 96 |
+
|
|
| 97 |
+
Etapa 2: LLM processa (Claude Haiku — custo baixo)
|
| 98 |
+
- Filtra frases com erros / ambiguas (Pipecat: Gemini removeu 50-80%)
|
| 99 |
+
- Classifica: COMPLETO vs INCOMPLETO (semantico, nao por pontuacao)
|
| 100 |
+
- Insere fillers brasileiros: "hum", "tipo", "ne", "entao", "e..."
|
| 101 |
+
- Insere fillers de frances falando PT: "euh", "comment dit-on", "como se diz"
|
| 102 |
+
- Gera variantes incompletas: corta frases em pontos naturais
|
| 103 |
+
|
|
| 104 |
+
Etapa 3: TTS gera audio
|
| 105 |
+
- Vozes PT-BR nativas (Kokoro, Google Chirp3)
|
| 106 |
+
- Vozes com sotaque frances (XTTS voice cloning, ou Chirp3 com accent)
|
| 107 |
+
- Variacao de velocidade, tom, ruido de fundo
|
| 108 |
+
|
|
| 109 |
+
Etapa 4: Dataset final
|
| 110 |
+
- 5-10K amostras balanceadas (50% completo / 50% incompleto)
|
| 111 |
+
- Metadados: lingua, sotaque, tipo_filler, confianca_label
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### Projetos que validam este metodo
|
| 115 |
+
|
| 116 |
+
| Projeto | O que fizeram | Resultado |
|
| 117 |
+
|---------|--------------|-----------|
|
| 118 |
+
| **Pipecat v3.1** | Gemini filtra frases + insere fillers, Claude/GPT geram listas de fillers, Chirp3 TTS | 270K amostras, 81-97% accuracy, 23 linguas |
|
| 119 |
+
| **LiveKit** | Qwen 7B como professor gera soft labels, destila pra 0.5B | 99.3% TP rate, -39% interrupcoes |
|
| 120 |
+
| **Vogent Turn 80M** | Dados humanos + sinteticos com edge cases (disfluencias, listas, pausas) | 94.1% accuracy, estado da arte |
|
| 121 |
+
| **Deepgram** | 100+ horas anotadas por humanos, refinamento iterativo de labels | Melhor calibracao entre todos |
|
| 122 |
+
| **SpeculativeETD** | MultiWOZ texto → TTS + fillers injetados + pausas sinteticas | 120K amostras, dataset publico |
|
| 123 |
+
| **SODA (Allen AI)** | GPT-3.5 gera 1.5M dialogos a partir de knowledge graph | Preferido sobre BlenderBot, Vicuna |
|
| 124 |
+
| **Refuel Autolabel** | GPT-4 como anotador: 88.4% agreement (vs 86% humano) | 20x mais rapido, 7x mais barato |
|
| 125 |
+
|
| 126 |
+
**Referencia principal:** [Pipecat Data Generation Contribution Guide](https://github.com/pipecat-ai/smart-turn/blob/main/docs/data_generation_contribution_guide.md)
|
| 127 |
+
|
| 128 |
+
## Fine-tuning
|
| 129 |
+
|
| 130 |
+
### Modelo base
|
| 131 |
+
|
| 132 |
+
```python
|
| 133 |
+
# Baixar modelo pre-treinado do HuggingFace
|
| 134 |
+
from huggingface_hub import hf_hub_download
|
| 135 |
+
model_path = hf_hub_download("pipecat-ai/smart-turn-v3", "model.onnx")
|
| 136 |
+
|
| 137 |
+
# Whisper Tiny encoder (39M) + linear classifier
|
| 138 |
+
# Hidden size: 384, ONNX INT8: 8MB, FP32: 32MB
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
### Estrategia de fine-tuning
|
| 142 |
+
|
| 143 |
+
```python
|
| 144 |
+
# Carregar pesos pre-treinados do Pipecat
|
| 145 |
+
model = SmartTurnModel(whisper_model="openai/whisper-tiny")
|
| 146 |
+
model.load_state_dict(pipecat_weights)
|
| 147 |
+
|
| 148 |
+
# LR uniforme 5e-5 (igual ao Pipecat train.py)
|
| 149 |
+
optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)
|
| 150 |
+
|
| 151 |
+
# Focal Loss com custo assimetrico para aprendizes L2
|
| 152 |
+
criterion = FocalLoss(gamma=2.0, alpha=0.25, fp_penalty=2.0)
|
| 153 |
+
|
| 154 |
+
# 6 epochs, batch_size=128, cosine schedule + warmup 0.2
|
| 155 |
+
train(model, portuguese_dataset, epochs=6, batch_size=128)
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
### Infra
|
| 159 |
+
|
| 160 |
+
- **GPU**: Modal A10G (~$0.50/run)
|
| 161 |
+
- **Tempo estimado**: 15-30 min
|
| 162 |
+
- **Alternativa**: ai-gateway → TensorDock/Vast.ai
|
| 163 |
+
|
| 164 |
+
## Melhorias para aprendizado de idiomas (L2)
|
| 165 |
+
|
| 166 |
+
> Pesquisa realizada em 2026-03-16 confirmou que **nenhum modelo de turn-taking para aprendizes L2 existe** — nem comercial (Praktika, ELSA, Gliglish, TalkPal) nem academico. Todos usam abordagens genericas.
|
| 167 |
+
|
| 168 |
+
### Custo assimetrico
|
| 169 |
+
|
| 170 |
+
Para aprendizado de idiomas, **interromper o aluno (FP) e muito pior que esperar demais (FN)**. Um aluno interrompido perde confianca e para de tentar.
|
| 171 |
+
|
| 172 |
+
```python
|
| 173 |
+
# fp_penalty=2.0: falsos positivos custam 2x mais na loss
|
| 174 |
+
criterion = FocalLoss(gamma=2.0, alpha=0.25, fp_penalty=2.0)
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
Inspirado no ConversAR (Meta, 2025) que usa "infinite thinking period" para L2.
|
| 178 |
+
|
| 179 |
+
### Threshold duplo (Deepgram Flux)
|
| 180 |
+
|
| 181 |
+
Dois thresholds separados para latencia vs precisao:
|
| 182 |
+
|
| 183 |
+
| Threshold | Valor | Funcao |
|
| 184 |
+
|-----------|-------|--------|
|
| 185 |
+
| **eager** (0.3-0.5) | Prepara resposta do LLM especulativamente | Reduz latencia percebida |
|
| 186 |
+
| **final** (0.7+) | Autoriza o avatar a falar | Minimiza interrupcoes |
|
| 187 |
+
|
| 188 |
+
Se o score cai antes de atingir `final`, a preparacao especulativa e descartada — sem custo para o aluno.
|
| 189 |
+
|
| 190 |
+
### Dados L2 reais (Speak & Improve)
|
| 191 |
+
|
| 192 |
+
O [Speak & Improve Corpus 2025](https://huggingface.co/datasets/speak-improve/corpus) contem **340 horas** de fala L2 em ingles com anotacoes de disfluencia, niveis CEFR A2-C1.
|
| 193 |
+
|
| 194 |
+
Embora seja ingles (nao portugues), os **padroes de hesitacao L2 sao cross-linguisticos**: pausas longas, repeticoes, auto-correcoes. Usamos como fonte de hesitacao patterns no fine-tuning.
|
| 195 |
+
|
| 196 |
+
### CEFR-aware presets
|
| 197 |
+
|
| 198 |
+
O modelo adapta paciencia conforme o nivel do aluno:
|
| 199 |
+
|
| 200 |
+
| Nivel | final_threshold | eager_threshold | Comportamento |
|
| 201 |
+
|-------|----------------|-----------------|---------------|
|
| 202 |
+
| **A1** (iniciante) | 0.80 | 0.40 | Muito paciente — espera pausas longas |
|
| 203 |
+
| **A2** | 0.75 | 0.38 | Paciente |
|
| 204 |
+
| **B1** (intermediario) | 0.70 | 0.35 | Moderado |
|
| 205 |
+
| **B2** | 0.65 | 0.33 | Responsivo |
|
| 206 |
+
| **C1** (avancado) | 0.60 | 0.30 | Quase nativo — resposta rapida |
|
| 207 |
+
|
| 208 |
+
## Sistema de backchannel
|
| 209 |
+
|
| 210 |
+
### Problema: "o avatar travou?"
|
| 211 |
+
|
| 212 |
+
Aprendizes L2 frequentemente pausam 1-3 segundos para pensar em conjugacoes, vocabulario, ou estrutura. Sem feedback, o aprendiz acha que o avatar travou e desiste de falar.
|
| 213 |
+
|
| 214 |
+
> **Pesquisa**: Tavus identifica um threshold de **600ms** — apos esse tempo, o falante espera alguma resposta do sistema. Para L2, esse tempo e ainda mais critico.
|
| 215 |
+
|
| 216 |
+
### Solucao: sinais de escuta ativa
|
| 217 |
+
|
| 218 |
+
O avatar emite sinais progressivos de que esta ouvindo:
|
| 219 |
+
|
| 220 |
+
| Tempo de silencio | Sinal | Tipo | Exemplo |
|
| 221 |
+
|-------------------|-------|------|---------|
|
| 222 |
+
| **600ms** | Aceno visual | Visual | Avatar faz "mhm" com a cabeca |
|
| 223 |
+
| **1.5s** | Backchannel verbal | Audio | "mhm", "sim", "uhum" |
|
| 224 |
+
| **3.0s** | Encorajamento | Audio | "sem pressa", "pode continuar", "estou ouvindo" |
|
| 225 |
+
|
| 226 |
+
Os sinais **nao interrompem o turno do aluno** — sao sobreposicoes curtas que indicam escuta ativa.
|
| 227 |
+
|
| 228 |
+
### Presets por nivel CEFR
|
| 229 |
+
|
| 230 |
+
| Nivel | Visual (ms) | Verbal (ms) | Encorajamento (ms) |
|
| 231 |
+
|-------|-------------|-------------|---------------------|
|
| 232 |
+
| **A1** | 500 | 1200 | 2500 |
|
| 233 |
+
| **B1** | 600 | 1500 | 3000 |
|
| 234 |
+
| **C1** | 800 | 2000 | 4000 |
|
| 235 |
+
|
| 236 |
+
Alunos A1 recebem sinais mais cedo; alunos C1 precisam de menos apoio.
|
| 237 |
+
|
| 238 |
+
## Engine de inferencia (06_inference.py)
|
| 239 |
+
|
| 240 |
+
O arquivo `06_inference.py` implementa a engine completa de turn-taking com backchannel.
|
| 241 |
+
|
| 242 |
+
### Estados do turn-taking
|
| 243 |
+
|
| 244 |
+
```
|
| 245 |
+
LISTENING → SILENCE → BACKCHANNEL_VISUAL → BACKCHANNEL_VERBAL → ENCOURAGEMENT
|
| 246 |
+
↑ ↓ ↓ ↓ ↓
|
| 247 |
+
←──────────←──────────────←─────────────────────←──────────────────←
|
| 248 |
+
(aluno volta a falar)
|
| 249 |
+
|
| 250 |
+
SILENCE → PREPARING → RESPONDING
|
| 251 |
+
(eager) (final)
|
| 252 |
+
```
|
| 253 |
+
|
| 254 |
+
- **LISTENING**: aluno esta falando, modelo monitora score
|
| 255 |
+
- **SILENCE**: pausa detectada, timer inicia
|
| 256 |
+
- **BACKCHANNEL_***: sinais de escuta ativa (nao interrompe turno)
|
| 257 |
+
- **PREPARING**: score atingiu eager_threshold, LLM comeca a preparar resposta
|
| 258 |
+
- **RESPONDING**: score atingiu final_threshold, avatar fala
|
| 259 |
+
|
| 260 |
+
### Exemplo: aprendiz B1 conjugando verbo
|
| 261 |
+
|
| 262 |
+
```
|
| 263 |
+
Aluno: "Ontem eu... [pausa 600ms]"
|
| 264 |
+
Avatar: [aceno visual - nod]
|
| 265 |
+
Aluno: "... fui? fiz? [pausa 1.5s]"
|
| 266 |
+
Avatar: "mhm" [backchannel verbal]
|
| 267 |
+
Aluno: "... fui ao mercado."
|
| 268 |
+
Avatar: [espera final_threshold] → responde normalmente
|
| 269 |
+
```
|
| 270 |
+
|
| 271 |
+
## Dados especificos — Frances falando portugues
|
| 272 |
+
|
| 273 |
+
### Tipos de hesitacao
|
| 274 |
+
|
| 275 |
+
| Tipo | Exemplo | Label |
|
| 276 |
+
|------|---------|-------|
|
| 277 |
+
| Filler frances | "Eu fui... euh... ao mercado" | INCOMPLETO |
|
| 278 |
+
| Busca de palavra | "Eu preciso de... comment dit-on... uma tesoura" | INCOMPLETO |
|
| 279 |
+
| Code-switching | "Eu gosto de... enfin... tipo... de praia" | INCOMPLETO |
|
| 280 |
+
| Pausa de conjugacao | "Eu... fui? fiz?... ontem" | INCOMPLETO |
|
| 281 |
+
| Entonacao francesa | Frase completa mas com pitch plano (sem queda final) | COMPLETO (dificil) |
|
| 282 |
+
| Ritmo silabico | Frances e syllable-timed, PT e stress-timed | Ambos |
|
| 283 |
+
|
| 284 |
+
### Geracoes com Claude
|
| 285 |
+
|
| 286 |
+
```
|
| 287 |
+
Prompt para Claude gerar frases de aprendiz:
|
| 288 |
+
|
| 289 |
+
"Gere 100 frases que um frances de nivel B1 falando portugues diria
|
| 290 |
+
em uma reuniao de trabalho. Inclua:
|
| 291 |
+
- Hesitacoes tipicas (euh, alors, comment dire)
|
| 292 |
+
- Erros comuns de conjugacao
|
| 293 |
+
- Pausas naturais pra pensar na palavra
|
| 294 |
+
- Code-switching involuntario (palavras em frances no meio)
|
| 295 |
+
- Frases completas com entonacao plana (sem queda de pitch)
|
| 296 |
+
|
| 297 |
+
Para cada frase, indique: COMPLETO ou INCOMPLETO"
|
| 298 |
+
```
|
| 299 |
+
|
| 300 |
+
## Benchmarks de referencia
|
| 301 |
+
|
| 302 |
+
### Pipecat Smart Turn v3.0
|
| 303 |
+
|
| 304 |
+
Audio-only, 8M params:
|
| 305 |
+
|
| 306 |
+
| Lingua | Accuracy | FP | FN |
|
| 307 |
+
|--------|----------|-----|-----|
|
| 308 |
+
| Turco | 97.10% | 1.66% | 1.24% |
|
| 309 |
+
| Coreano | 96.85% | 1.12% | 2.02% |
|
| 310 |
+
| Japones | 96.76% | 2.04% | 1.20% |
|
| 311 |
+
| Frances | 96.01% | 1.60% | 2.39% |
|
| 312 |
+
| **Portugues** | **95.42%** | **2.79%** | **1.79%** |
|
| 313 |
+
| Ingles | 94.31% | 2.64% | 3.06% |
|
| 314 |
+
| Espanhol | 91.97% | 4.48% | 3.55% |
|
| 315 |
+
|
| 316 |
+
Fonte: [Smart Turn v3 blog](https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/)
|
| 317 |
+
|
| 318 |
+
### LiveKit v0.4.1
|
| 319 |
+
|
| 320 |
+
Texto-only, 500M params, @99.3% TPR:
|
| 321 |
+
|
| 322 |
+
| Lingua | TNR | Melhoria vs anterior |
|
| 323 |
+
|--------|------|---------------------|
|
| 324 |
+
| Hindi | 96.3% | +31.48% |
|
| 325 |
+
| Coreano | 94.5% | +30.38% |
|
| 326 |
+
| Frances | 88.9% | +33.93% |
|
| 327 |
+
| **Portugues** | **87.4%** | **+45.97%** |
|
| 328 |
+
| Ingles | 87.0% | +21.69% |
|
| 329 |
+
|
| 330 |
+
Fonte: [LiveKit v0.4.1 blog](https://livekit.com/blog/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/)
|
| 331 |
+
|
| 332 |
+
### Outros modelos
|
| 333 |
+
|
| 334 |
+
| Modelo | Tipo | Accuracy | Linguas | Nota |
|
| 335 |
+
|--------|------|----------|---------|------|
|
| 336 |
+
| Vogent Turn 80M | Audio+texto | 94.1% | 1 (EN) | Estado da arte multimodal |
|
| 337 |
+
| Krisp v2 | Audio | 82.3% bal.acc | "Agnostico" | Proprietario |
|
| 338 |
+
| Deepgram Flux | ASR+EoT | #1 VAQI | 1 (EN) | Proprietario |
|
| 339 |
+
| VAP | Audio | 79.6% bal.acc | 3 (EN/ZH/JP) | Academico, auto-supervisionado |
|
| 340 |
+
|
| 341 |
+
### Gap sintetico → real
|
| 342 |
+
|
| 343 |
+
O SpeculativeETD mostrou que modelos treinados so em dados sinteticos (TTS) perdem **muito** em dados reais:
|
| 344 |
+
|
| 345 |
+
| Dataset | Wav2vec F1 |
|
| 346 |
+
|---------|-----------|
|
| 347 |
+
| Sintetico | 94.7% |
|
| 348 |
+
| Real | **30.3%** |
|
| 349 |
+
|
| 350 |
+
Isso mostra que alem de gerar dados com TTS, precisamos incluir **audio real** no fine-tuning.
|
| 351 |
+
|
| 352 |
+
## Metricas alvo
|
| 353 |
+
|
| 354 |
+
| Metrica | Exp 02 (do zero) | Pipecat PT (ref) | Alvo (este exp) |
|
| 355 |
+
|---------|------------------|------------------|-----------------|
|
| 356 |
+
| Accuracy | 78.2% | 95.42% | > 96% |
|
| 357 |
+
| False Positive | 16.8% | 2.79% | < 2.5% |
|
| 358 |
+
| False Negative | 5.1% | 1.79% | < 2.0% |
|
| 359 |
+
| F1 | 0.798 | ~0.96 | > 0.96 |
|
| 360 |
+
| Tamanho modelo | 30.5 MB | 8 MB (ONNX INT8) | ~8 MB |
|
| 361 |
+
| Inferencia CPU | ~12ms | 12ms | ~12ms |
|
| 362 |
+
|
| 363 |
+
## Correcoes apos analise de referencias
|
| 364 |
+
|
| 365 |
+
*(2026-03-16)*
|
| 366 |
+
|
| 367 |
+
Baixamos 6 papers, 10 blog posts e 7 guias tecnicos (ver `references/`). A analise cruzada revelou problemas no plano original:
|
| 368 |
+
|
| 369 |
+
| Mudanca | Antes | Depois | Fonte |
|
| 370 |
+
|---------|-------|--------|-------|
|
| 371 |
+
| Focal Loss alpha | 0.6 | **0.25** | Lin et al. 2017, Table 1a |
|
| 372 |
+
| Batch size | 32 | **128** | Pipecat train.py usa 384 |
|
| 373 |
+
| Epochs | 10 | **6** | Pipecat usa 4 em 270K |
|
| 374 |
+
| Label smoothing | 0.05 | **removido** | EMNLP 2022: dupla regularizacao com FL |
|
| 375 |
+
| Learning rate | diferencial (0.1x encoder) | **uniforme 5e-5** | Pipecat train.py usa lr unica |
|
| 376 |
+
| Ruido augmentation | Gaussiano | **ruido real** (cafe/escritorio) | Pipecat v3.2: -40% erros |
|
| 377 |
+
| ONNX opset | 17 | **18** | Pipecat train.py |
|
| 378 |
+
| Quantizacao | nenhuma | **INT8 estatica** (entropy, 1024 calib) | Pipecat deploy: 32MB → 8MB |
|
| 379 |
+
| Loss alternativa | so Focal | **Focal + BCE** comparados | Pipecat original usa BCE |
|
| 380 |
+
|
| 381 |
+
### Tecnicas adicionais identificadas
|
| 382 |
+
|
| 383 |
+
1. **Knowledge distillation** (LiveKit: -39% interrupcoes, +45.97% melhoria PT)
|
| 384 |
+
2. **Short utterance dataset** (Pipecat v3.2: -40% erros em respostas curtas) — implementado
|
| 385 |
+
3. **Audio real misturado com TTS** (SpeculativeETD: F1 cai de 94.7% → 30.3% so com sintetico) — implementado via CORAA
|
| 386 |
+
4. **Pausa de 1.5-3s apos fillers** (SpeculativeETD V3: melhor variante) — implementado
|
| 387 |
+
5. **Threshold per-language** (LiveKit: languages.json com thresholds por lingua)
|
| 388 |
+
|
| 389 |
+
## Estrutura de arquivos
|
| 390 |
+
|
| 391 |
+
```
|
| 392 |
+
03-finetune-pipecat-pt/
|
| 393 |
+
README.md # Este documento
|
| 394 |
+
01_download_pipecat.py # Baixa modelo pre-treinado do HuggingFace
|
| 395 |
+
02_generate_labels.py # Claude API gera/filtra/classifica frases PT
|
| 396 |
+
03_generate_audio.py # TTS gera audio (nativo + sotaque frances)
|
| 397 |
+
04_finetune.py # Fine-tune com custo assimetrico + dados L2
|
| 398 |
+
05_evaluate.py # Avaliacao com dual threshold sweep
|
| 399 |
+
06_inference.py # Engine de inferencia com backchannel + CEFR presets
|
| 400 |
+
modal_run.py # Deploy no Modal (GPU)
|
| 401 |
+
|
| 402 |
+
references/
|
| 403 |
+
hesitation_turn_taking_l2_review.md # Mini-artigo com 35 referencias
|
| 404 |
+
papers/ # 6 PDFs academicos + summaries
|
| 405 |
+
focal_loss_lin_2017.*
|
| 406 |
+
speculative_etd_2025.*
|
| 407 |
+
vap_turn_taking_ekstedt_2024.*
|
| 408 |
+
turn_taking_review_skantze_2021.*
|
| 409 |
+
soda_dialog_distillation_2023.*
|
| 410 |
+
finite_state_turn_taking_raux_2009.*
|
| 411 |
+
papers/hesitation-l2-french/ # 19 PDFs + 10 MD (L2 hesitacao/francofono)
|
| 412 |
+
papers/language-learning-turn-taking/ # Pesquisa L2 turn-taking (2026-03-16)
|
| 413 |
+
survey_turn_taking_iwsds2025.* # Survey IWSDS 2025
|
| 414 |
+
multilingual_vap_2024.* # VAP multilingual
|
| 415 |
+
speak_improve_corpus_2025.* # Speak & Improve L2 corpus
|
| 416 |
+
hesitation_tagging_l2_whisper.* # Whisper+LoRA hesitation
|
| 417 |
+
conversar_mixed_reality_l2.* # ConversAR (Meta Quest)
|
| 418 |
+
deepgram_flux.* # Deepgram Flux overview
|
| 419 |
+
hume_evi.* # Hume EVI overview
|
| 420 |
+
praktika_openai.* # Praktika case study
|
| 421 |
+
tavus_turn_taking_guide.* # Tavus guide
|
| 422 |
+
blogs/ # 10 blog posts
|
| 423 |
+
guides/ # 7 technical guides
|
| 424 |
+
|
| 425 |
+
data/ # (gitignored)
|
| 426 |
+
pipecat_pt_audio/ # Audio PT do Pipecat v3.2
|
| 427 |
+
pipecat_pt_test/ # Audio PT test do Pipecat v3.2
|
| 428 |
+
claude_labeled/ # Frases processadas pelo Claude (JSON)
|
| 429 |
+
tts_dataset/ # Audio gerado (nativo + sotaque frances)
|
| 430 |
+
noise_samples/ # Ruido real CC-0 (cafe, escritorio)
|
| 431 |
+
|
| 432 |
+
results/ # (gitignored)
|
| 433 |
+
best_model.pt # Modelo treinado
|
| 434 |
+
smart_turn_pt_v3.onnx # ONNX INT8 (~8 MB)
|
| 435 |
+
smart_turn_pt_v3_fp32.onnx # ONNX FP32 (~32 MB)
|
| 436 |
+
training_results.json # Metricas
|
| 437 |
+
evaluation_results.json # Comparacao com baseline
|
| 438 |
+
```
|
| 439 |
+
|
| 440 |
+
## Cronograma
|
| 441 |
+
|
| 442 |
+
| Etapa | Descricao | Tempo |
|
| 443 |
+
|-------|-----------|-------|
|
| 444 |
+
| 1 | Baixar e inspecionar modelo Pipecat | 1h |
|
| 445 |
+
| 2 | Script de labeling com Claude API | 1 dia |
|
| 446 |
+
| 3 | Gerar audio TTS (nativo PT-BR) | 1 dia |
|
| 447 |
+
| 4 | Gerar audio com sotaque frances | 1-2 dias |
|
| 448 |
+
| 5 | Fine-tune no Modal | 30 min |
|
| 449 |
+
| 6 | Avaliacao + dual threshold sweep | 2h |
|
| 450 |
+
| 7 | Teste no avatar conversacional | 1 dia |
|
| 451 |
+
| **Total** | | **~5 dias** |
|
| 452 |
+
|
| 453 |
+
## Dependencias
|
| 454 |
+
|
| 455 |
+
```
|
| 456 |
+
# Python
|
| 457 |
+
torch, torchaudio, transformers # Modelo
|
| 458 |
+
anthropic # Claude API (labeling)
|
| 459 |
+
kokoro, TTS # Geracao de audio
|
| 460 |
+
datasets, soundfile, librosa # Processamento
|
| 461 |
+
modal # Deploy GPU
|
| 462 |
+
|
| 463 |
+
# APIs
|
| 464 |
+
ANTHROPIC_API_KEY # Claude Haiku pra labeling
|
| 465 |
+
MODAL_TOKEN # Modal pra treino GPU
|
| 466 |
+
|
| 467 |
+
# Modelos
|
| 468 |
+
pipecat-ai/smart-turn-v3 # HuggingFace — modelo pre-treinado
|
| 469 |
+
openai/whisper-tiny # HuggingFace — encoder base
|
| 470 |
+
```
|
| 471 |
+
|
| 472 |
+
## Referencias
|
| 473 |
+
|
| 474 |
+
Documento completo de pesquisa com 35 referencias: [`references/hesitation_turn_taking_l2_review.md`](references/hesitation_turn_taking_l2_review.md)
|
| 475 |
+
|
| 476 |
+
**Principais:**
|
| 477 |
+
|
| 478 |
+
| # | Referencia | Contribuicao |
|
| 479 |
+
|---|-----------|--------------|
|
| 480 |
+
| 1 | Pipecat Smart Turn v3 (Daily, 2025) | Modelo base, pipeline de dados |
|
| 481 |
+
| 2 | Lin et al. (ICCV 2017) | Focal Loss, alpha=0.25 |
|
| 482 |
+
| 3 | Skantze (CSL 2021) | Survey de turn-taking |
|
| 483 |
+
| 4 | Knill et al. (2025) | Speak & Improve L2 corpus |
|
| 484 |
+
| 5 | Saeki et al. (2025) | Whisper+LoRA hesitation tagging |
|
| 485 |
+
| 6 | Gamboa et al. (2025) | ConversAR, custo assimetrico L2 |
|
| 486 |
+
| 7 | Deepgram Flux (2025) | Dual threshold, speculative ASR |
|
| 487 |
+
| 8 | Ekstedt et al. (2024) | VAP multilingual |
|
| 488 |
+
| 9 | Raux & Eskenazi (NAACL 2009) | FSM turn-taking |
|
| 489 |
+
| 10 | LiveKit (2025) | Knowledge distillation, 500M params |
|
03-finetune-pipecat-pt/modal_run.py
ADDED
|
@@ -0,0 +1,267 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Run the full fine-tuning pipeline on Modal (A10G GPU).
|
| 2 |
+
|
| 3 |
+
Usage:
|
| 4 |
+
modal run modal_run.py
|
| 5 |
+
|
| 6 |
+
This deploys the pipeline on a Modal A10G GPU (~$0.50/run):
|
| 7 |
+
1. Downloads Pipecat Portuguese data
|
| 8 |
+
2. Generates labels with Claude API (if ANTHROPIC_API_KEY set)
|
| 9 |
+
3. Generates TTS audio with Kokoro
|
| 10 |
+
4. Fine-tunes SmartTurnV3Model
|
| 11 |
+
5. Evaluates against Pipecat baseline
|
| 12 |
+
6. Saves results to Modal volume
|
| 13 |
+
|
| 14 |
+
Estimated time: 30-60 min total
|
| 15 |
+
Estimated cost: ~$0.50-1.00
|
| 16 |
+
"""
|
| 17 |
+
|
| 18 |
+
from __future__ import annotations
|
| 19 |
+
|
| 20 |
+
import modal
|
| 21 |
+
|
| 22 |
+
app = modal.App("babelcast-finetune-pipecat-pt")
|
| 23 |
+
|
| 24 |
+
# Persistent volume for data + results
|
| 25 |
+
volume = modal.Volume.from_name("finetune-pipecat-pt", create_if_missing=True)
|
| 26 |
+
|
| 27 |
+
# GPU image with all dependencies
|
| 28 |
+
image = (
|
| 29 |
+
modal.Image.debian_slim(python_version="3.11")
|
| 30 |
+
.pip_install(
|
| 31 |
+
# Core ML
|
| 32 |
+
"torch>=2.1",
|
| 33 |
+
"torchaudio>=2.1",
|
| 34 |
+
"transformers>=4.36",
|
| 35 |
+
"datasets>=2.16",
|
| 36 |
+
# Audio processing
|
| 37 |
+
"soundfile>=0.12",
|
| 38 |
+
"librosa>=0.10",
|
| 39 |
+
"numpy<2",
|
| 40 |
+
# TTS
|
| 41 |
+
"kokoro>=0.3",
|
| 42 |
+
# ONNX for evaluation
|
| 43 |
+
"onnxruntime>=1.16",
|
| 44 |
+
# Claude API for labeling
|
| 45 |
+
"anthropic>=0.40",
|
| 46 |
+
# Utilities
|
| 47 |
+
"huggingface-hub>=0.20",
|
| 48 |
+
)
|
| 49 |
+
.apt_install("ffmpeg", "libsndfile1")
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
@app.function(
|
| 54 |
+
image=image,
|
| 55 |
+
gpu="A10G",
|
| 56 |
+
timeout=3600, # 1 hour max
|
| 57 |
+
volumes={"/workspace": volume},
|
| 58 |
+
secrets=[
|
| 59 |
+
modal.Secret.from_name("anthropic-api-key", required=False),
|
| 60 |
+
modal.Secret.from_name("huggingface-token", required=False),
|
| 61 |
+
],
|
| 62 |
+
)
|
| 63 |
+
def run_pipeline(
|
| 64 |
+
skip_download: bool = False,
|
| 65 |
+
skip_labels: bool = False,
|
| 66 |
+
skip_audio: bool = False,
|
| 67 |
+
skip_train: bool = False,
|
| 68 |
+
max_pipecat_samples: int = 5000,
|
| 69 |
+
max_tts_samples: int = 10000,
|
| 70 |
+
max_l2_samples: int = 2000,
|
| 71 |
+
epochs: int = 6,
|
| 72 |
+
batch_size: int = 128,
|
| 73 |
+
lr: float = 5e-5,
|
| 74 |
+
loss_fn: str = "focal",
|
| 75 |
+
fp_penalty: float = 2.0,
|
| 76 |
+
):
|
| 77 |
+
"""Run the full pipeline on Modal GPU."""
|
| 78 |
+
import logging
|
| 79 |
+
import os
|
| 80 |
+
import sys
|
| 81 |
+
from pathlib import Path
|
| 82 |
+
|
| 83 |
+
logging.basicConfig(
|
| 84 |
+
level=logging.INFO,
|
| 85 |
+
format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
|
| 86 |
+
)
|
| 87 |
+
log = logging.getLogger("modal_run")
|
| 88 |
+
|
| 89 |
+
# Add script directory to path
|
| 90 |
+
script_dir = Path(__file__).parent
|
| 91 |
+
sys.path.insert(0, str(script_dir))
|
| 92 |
+
|
| 93 |
+
# Symlink data dir to /workspace for persistence
|
| 94 |
+
data_dir = script_dir / "data"
|
| 95 |
+
workspace_data = Path("/workspace/data")
|
| 96 |
+
workspace_data.mkdir(parents=True, exist_ok=True)
|
| 97 |
+
if not data_dir.exists():
|
| 98 |
+
data_dir.symlink_to(workspace_data)
|
| 99 |
+
|
| 100 |
+
# Step 1: Download Pipecat data
|
| 101 |
+
if not skip_download:
|
| 102 |
+
log.info("=" * 60)
|
| 103 |
+
log.info("STEP 1: Download Pipecat Portuguese data")
|
| 104 |
+
log.info("=" * 60)
|
| 105 |
+
from import_module_01 import download_pipecat_dataset, download_pipecat_test_data, download_onnx_model
|
| 106 |
+
try:
|
| 107 |
+
# Use importlib since filenames start with numbers
|
| 108 |
+
import importlib.util
|
| 109 |
+
spec = importlib.util.spec_from_file_location("dl", script_dir / "01_download_pipecat.py")
|
| 110 |
+
dl = importlib.util.module_from_spec(spec)
|
| 111 |
+
spec.loader.exec_module(dl)
|
| 112 |
+
|
| 113 |
+
dl.download_pipecat_dataset(max_pt_samples=max_pipecat_samples)
|
| 114 |
+
dl.download_pipecat_test_data(max_pt_samples=2000)
|
| 115 |
+
dl.download_onnx_model()
|
| 116 |
+
except Exception as e:
|
| 117 |
+
log.error("Download failed: %s", e)
|
| 118 |
+
raise
|
| 119 |
+
|
| 120 |
+
volume.commit()
|
| 121 |
+
log.info("Step 1 complete — data saved to volume")
|
| 122 |
+
|
| 123 |
+
# Step 2: Generate labels (requires ANTHROPIC_API_KEY)
|
| 124 |
+
if not skip_labels:
|
| 125 |
+
log.info("=" * 60)
|
| 126 |
+
log.info("STEP 2: Generate labels with Claude API")
|
| 127 |
+
log.info("=" * 60)
|
| 128 |
+
try:
|
| 129 |
+
import importlib.util
|
| 130 |
+
spec = importlib.util.spec_from_file_location("labels", script_dir / "02_generate_labels.py")
|
| 131 |
+
labels = importlib.util.module_from_spec(spec)
|
| 132 |
+
spec.loader.exec_module(labels)
|
| 133 |
+
|
| 134 |
+
api_key = os.environ.get("ANTHROPIC_API_KEY")
|
| 135 |
+
if api_key:
|
| 136 |
+
log.info("ANTHROPIC_API_KEY found — using Claude for labeling")
|
| 137 |
+
else:
|
| 138 |
+
log.info("No ANTHROPIC_API_KEY — using rule-based fallback")
|
| 139 |
+
|
| 140 |
+
labels.run_full_pipeline(max_transcripts=5000, max_fr_sentences=500)
|
| 141 |
+
except Exception as e:
|
| 142 |
+
log.warning("Label generation failed: %s — continuing without custom labels", e)
|
| 143 |
+
|
| 144 |
+
volume.commit()
|
| 145 |
+
|
| 146 |
+
# Step 3: Generate TTS audio
|
| 147 |
+
if not skip_audio:
|
| 148 |
+
log.info("=" * 60)
|
| 149 |
+
log.info("STEP 3: Generate TTS audio")
|
| 150 |
+
log.info("=" * 60)
|
| 151 |
+
try:
|
| 152 |
+
import importlib.util
|
| 153 |
+
spec = importlib.util.spec_from_file_location("audio", script_dir / "03_generate_audio.py")
|
| 154 |
+
audio = importlib.util.module_from_spec(spec)
|
| 155 |
+
spec.loader.exec_module(audio)
|
| 156 |
+
|
| 157 |
+
audio.run_audio_generation(
|
| 158 |
+
max_native_sentences=3000,
|
| 159 |
+
max_french_sentences=1000,
|
| 160 |
+
augment_copies=2,
|
| 161 |
+
)
|
| 162 |
+
except Exception as e:
|
| 163 |
+
log.warning("TTS generation failed: %s — continuing with Pipecat data only", e)
|
| 164 |
+
|
| 165 |
+
volume.commit()
|
| 166 |
+
|
| 167 |
+
# Step 4: Fine-tune
|
| 168 |
+
if not skip_train:
|
| 169 |
+
log.info("=" * 60)
|
| 170 |
+
log.info("STEP 4: Fine-tune SmartTurnV3Model")
|
| 171 |
+
log.info("=" * 60)
|
| 172 |
+
try:
|
| 173 |
+
import importlib.util
|
| 174 |
+
spec = importlib.util.spec_from_file_location("ft", script_dir / "04_finetune.py")
|
| 175 |
+
ft = importlib.util.module_from_spec(spec)
|
| 176 |
+
spec.loader.exec_module(ft)
|
| 177 |
+
|
| 178 |
+
ft.train(
|
| 179 |
+
epochs=epochs,
|
| 180 |
+
batch_size=batch_size,
|
| 181 |
+
lr=lr,
|
| 182 |
+
warmup_ratio=0.2,
|
| 183 |
+
max_pipecat_samples=max_pipecat_samples,
|
| 184 |
+
max_tts_samples=max_tts_samples,
|
| 185 |
+
max_l2_samples=max_l2_samples,
|
| 186 |
+
loss_fn=loss_fn,
|
| 187 |
+
fp_penalty=fp_penalty,
|
| 188 |
+
)
|
| 189 |
+
except Exception as e:
|
| 190 |
+
log.error("Training failed: %s", e)
|
| 191 |
+
raise
|
| 192 |
+
|
| 193 |
+
volume.commit()
|
| 194 |
+
|
| 195 |
+
# Step 5: Evaluate
|
| 196 |
+
log.info("=" * 60)
|
| 197 |
+
log.info("STEP 5: Evaluate")
|
| 198 |
+
log.info("=" * 60)
|
| 199 |
+
try:
|
| 200 |
+
import importlib.util
|
| 201 |
+
spec = importlib.util.spec_from_file_location("ev", script_dir / "05_evaluate.py")
|
| 202 |
+
ev = importlib.util.module_from_spec(spec)
|
| 203 |
+
spec.loader.exec_module(ev)
|
| 204 |
+
|
| 205 |
+
results_dir = Path("/workspace/results")
|
| 206 |
+
model_path = results_dir / "best_model.pt"
|
| 207 |
+
if model_path.exists():
|
| 208 |
+
results = ev.run_evaluation(model_path)
|
| 209 |
+
log.info("Evaluation complete!")
|
| 210 |
+
else:
|
| 211 |
+
log.warning("No model found at %s — skipping evaluation", model_path)
|
| 212 |
+
except Exception as e:
|
| 213 |
+
log.error("Evaluation failed: %s", e)
|
| 214 |
+
|
| 215 |
+
volume.commit()
|
| 216 |
+
log.info("=" * 60)
|
| 217 |
+
log.info("PIPELINE COMPLETE — results saved to Modal volume 'finetune-pipecat-pt'")
|
| 218 |
+
log.info("=" * 60)
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
@app.function(
|
| 222 |
+
image=image,
|
| 223 |
+
volumes={"/workspace": volume},
|
| 224 |
+
)
|
| 225 |
+
def download_results(local_dir: str = "results") -> list[str]:
|
| 226 |
+
"""Download results from Modal volume."""
|
| 227 |
+
from pathlib import Path
|
| 228 |
+
import shutil
|
| 229 |
+
|
| 230 |
+
results_dir = Path("/workspace/results")
|
| 231 |
+
if not results_dir.exists():
|
| 232 |
+
print("No results found on volume")
|
| 233 |
+
return []
|
| 234 |
+
|
| 235 |
+
files = []
|
| 236 |
+
for f in results_dir.rglob("*"):
|
| 237 |
+
if f.is_file():
|
| 238 |
+
files.append(str(f.relative_to(results_dir)))
|
| 239 |
+
print(f" {f.name}: {f.stat().st_size / 1024:.1f} KB")
|
| 240 |
+
|
| 241 |
+
return files
|
| 242 |
+
|
| 243 |
+
|
| 244 |
+
@app.local_entrypoint()
|
| 245 |
+
def main(
|
| 246 |
+
skip_download: bool = False,
|
| 247 |
+
skip_labels: bool = False,
|
| 248 |
+
skip_audio: bool = False,
|
| 249 |
+
skip_train: bool = False,
|
| 250 |
+
download_only: bool = False,
|
| 251 |
+
epochs: int = 10,
|
| 252 |
+
batch_size: int = 32,
|
| 253 |
+
):
|
| 254 |
+
"""Entry point for `modal run modal_run.py`."""
|
| 255 |
+
if download_only:
|
| 256 |
+
files = download_results.remote()
|
| 257 |
+
print(f"\nFound {len(files)} result files on volume")
|
| 258 |
+
return
|
| 259 |
+
|
| 260 |
+
run_pipeline.remote(
|
| 261 |
+
skip_download=skip_download,
|
| 262 |
+
skip_labels=skip_labels,
|
| 263 |
+
skip_audio=skip_audio,
|
| 264 |
+
skip_train=skip_train,
|
| 265 |
+
epochs=epochs,
|
| 266 |
+
batch_size=batch_size,
|
| 267 |
+
)
|
03-finetune-pipecat-pt/references/blogs/assemblyai_turn_detection.md
ADDED
|
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "How Intelligent Turn Detection (Endpointing) Solves the Biggest Challenge in Voice Agent Development"
|
| 3 |
+
source: https://www.assemblyai.com/blog/turn-detection-endpointing-voice-agent
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# How Intelligent Turn Detection (Endpointing) Solves the Biggest Challenge in Voice Agent Development
|
| 8 |
+
|
| 9 |
+
## Introduction
|
| 10 |
+
|
| 11 |
+
Voice agents enable natural conversations between humans and AI systems, representing one of Speech AI's fastest-growing applications. However, a persistent challenge affects all voice agent implementations: **turn detection**, or determining when a human finishes speaking so the AI can respond appropriately.
|
| 12 |
+
|
| 13 |
+
## The Latency Problem in Voice Agents
|
| 14 |
+
|
| 15 |
+
Voice agent developers prioritize low latency for optimal user experience. Three fundamental latencies exist in streaming speech-to-text models:
|
| 16 |
+
|
| 17 |
+
1. **Partial transcript latency** -- Speed of returning initial transcript predictions
|
| 18 |
+
2. **Final transcript latency** -- Speed of returning finalized transcripts after speech ends
|
| 19 |
+
3. **Endpointing latency** -- Speed of detecting when someone stops speaking
|
| 20 |
+
|
| 21 |
+
Voice agents specifically require fast endpointing latency. Developers have historically over-optimized for general latency reduction, treating end-of-turn detection as secondary. However, addressing turn detection provides "far greater improvements to the user experience than incremental latency optimizations," as endpointing delay exists at a different magnitude than millisecond-level latency gains.
|
| 22 |
+
|
| 23 |
+
## Three Endpointing Methods
|
| 24 |
+
|
| 25 |
+
### Manual Endpointing
|
| 26 |
+
Users explicitly indicate completion through button presses or voice commands. This creates poor user experience and harms adoption despite potentially fast response times.
|
| 27 |
+
|
| 28 |
+
### Silence Detection
|
| 29 |
+
The current industry standard waits for specified silence duration thresholds. This approach balances responsiveness against premature interruptions but struggles finding optimal threshold values -- long enough to avoid cutoffs mid-thought, yet short enough to maintain conversational fluidity.
|
| 30 |
+
|
| 31 |
+
### Semantic Endpointing
|
| 32 |
+
Advanced systems analyze what someone says rather than just silence duration. This approach understands when thoughts are complete, distinguishing natural pauses mid-sentence from genuine utterance endings. Modern implementations use language models to predict sentence boundaries and semantic completeness.
|
| 33 |
+
|
| 34 |
+
## How Semantic Endpointing Works
|
| 35 |
+
|
| 36 |
+
Semantic endpointing encompasses multiple implementation approaches. AssemblyAI's Universal-Streaming model predicts a special token that the model learns to identify during training based on context.
|
| 37 |
+
|
| 38 |
+
### Configuration Parameters
|
| 39 |
+
|
| 40 |
+
Several user-configurable options enable developers to tune endpointing behavior:
|
| 41 |
+
|
| 42 |
+
- **end_of_turn_confidence_threshold** (default: 0.7) -- Required confidence level for end-of-turn predictions
|
| 43 |
+
- **min_end_of_turn_silence_when_confident** (default: 160ms) -- Minimum silence duration after confident prediction to prevent false positives
|
| 44 |
+
- **max_turn_silence** (default: 2400ms) -- Fallback silence-based endpointing for unusual speech patterns
|
| 45 |
+
|
| 46 |
+
The system requires three conditions for semantic end-of-turn:
|
| 47 |
+
|
| 48 |
+
1. Model predicts end-of-turn with sufficient confidence
|
| 49 |
+
2. Minimum silence duration has passed
|
| 50 |
+
3. Minimal speech present to prevent false positives from audio noise
|
| 51 |
+
|
| 52 |
+
Traditional silence-based endpointing remains enabled as a final catch-all mechanism for atypical patterns.
|
| 53 |
+
|
| 54 |
+
## Comparative Analysis: Turn Detection Approaches
|
| 55 |
+
|
| 56 |
+
### LiveKit: Semantic-Only Approach
|
| 57 |
+
|
| 58 |
+
**Input:** Transcribed text only
|
| 59 |
+
**Key Dependency:** Voice Activity Detection (VAD)
|
| 60 |
+
**Processing:** Runs after VAD detects speech end
|
| 61 |
+
|
| 62 |
+
**Strengths:**
|
| 63 |
+
- Simple, focused semantic analysis
|
| 64 |
+
- Open source customization available
|
| 65 |
+
|
| 66 |
+
**Limitations:**
|
| 67 |
+
- Heavy VAD dependency creates delays if background noise triggers continuous VAD
|
| 68 |
+
- Ignores audio cues humans naturally use for turn boundaries
|
| 69 |
+
- Tends toward maximum delay periods, creating sluggish conversations
|
| 70 |
+
|
| 71 |
+
### Pipecat: Audio-Centric Detection
|
| 72 |
+
|
| 73 |
+
**Input:** Audio features only (prosody, intonation)
|
| 74 |
+
**Key Dependency:** Direct audio analysis
|
| 75 |
+
**Processing:** Real-time audio pattern recognition
|
| 76 |
+
|
| 77 |
+
**Strengths:**
|
| 78 |
+
- Leverages natural prosodic and intonation cues
|
| 79 |
+
- Potentially faster turn detection than text-dependent models
|
| 80 |
+
- Open source flexibility
|
| 81 |
+
|
| 82 |
+
**Limitations:**
|
| 83 |
+
- Sensitive to background noise interference
|
| 84 |
+
- Performance varies significantly with speaker accents
|
| 85 |
+
- Struggles with atypical prosodic patterns
|
| 86 |
+
- Limited by audio quality availability
|
| 87 |
+
|
| 88 |
+
### AssemblyAI: Hybrid Approach
|
| 89 |
+
|
| 90 |
+
**Input:** Both transcribed text and audio context
|
| 91 |
+
**Key Dependency:** Integrated approach with dynamic thresholds
|
| 92 |
+
**Processing:** Context-aware analysis enabling early turn detection during silence based on syntactic completeness
|
| 93 |
+
|
| 94 |
+
**Strengths:**
|
| 95 |
+
- Robust across varying acoustic conditions
|
| 96 |
+
- Dynamic adaptation based on sentence completeness rather than static VAD
|
| 97 |
+
- Better background noise handling through semantic backup
|
| 98 |
+
- More natural conversation flow with context-aware timing
|
| 99 |
+
|
| 100 |
+
**Limitations:**
|
| 101 |
+
- Closed source limits customization
|
| 102 |
+
- Potentially higher computational requirements
|
| 103 |
+
- Depends on transcription quality for optimal performance
|
| 104 |
+
|
| 105 |
+
## Feature Comparison Matrix
|
| 106 |
+
|
| 107 |
+
| Feature | LiveKit | Pipecat | AssemblyAI |
|
| 108 |
+
|---------|---------|---------|-----------|
|
| 109 |
+
| Input Type | Text only | Audio only | Text + Audio |
|
| 110 |
+
| VAD Dependency | High | None | Low |
|
| 111 |
+
| Noise Robustness | Poor (via VAD) | Poor | Good |
|
| 112 |
+
| Speaker Variation | Good | Poor | Good |
|
| 113 |
+
| Response Speed | Slow | Fast | Adaptive |
|
| 114 |
+
| Complexity | Low | Medium | High |
|
| 115 |
+
|
| 116 |
+
## Selection Guidance
|
| 117 |
+
|
| 118 |
+
**Choose LiveKit if:**
|
| 119 |
+
- You need simple, semantic-focused approaches
|
| 120 |
+
- You have clean audio environments with minimal background noise
|
| 121 |
+
- Response speed is less critical than accuracy
|
| 122 |
+
|
| 123 |
+
**Choose Pipecat if:**
|
| 124 |
+
- You have consistent speaker profiles and clean audio
|
| 125 |
+
- You want to experiment with pure audio-based detection
|
| 126 |
+
|
| 127 |
+
**Choose AssemblyAI if:**
|
| 128 |
+
- You want combined semantic and acoustic endpointing
|
| 129 |
+
- You handle diverse speakers and noisy environments
|
| 130 |
+
- Natural conversation flow is critical
|
| 131 |
+
- Deployment flexibility matters
|
| 132 |
+
|
| 133 |
+
## The Future of Turn Detection
|
| 134 |
+
|
| 135 |
+
Single-modality solutions offer simplicity and specialized performance, while hybrid approaches like AssemblyAI's demonstrate benefits of combining multiple signal types for robust and natural conversational experiences. Future developments will likely incorporate conversational context, speaker identification, and multimodal cues (including visual information in relevant scenarios).
|
| 136 |
+
|
| 137 |
+
The optimal choice depends on specific use case requirements, technical constraints, and acceptable trade-offs between accuracy, speed, and robustness.
|
03-finetune-pipecat-pt/references/blogs/deepgram_eot_evaluation.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Evaluating End-of-Turn Detection Models"
|
| 3 |
+
source: https://deepgram.com/learn/evaluating-end-of-turn-detection-models
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Evaluating End-of-Turn (Turn Detection) Models
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
Natural voice agent interactions depend critically on accurate turn-taking behavior. End-of-Turn (EoT) detection -- predicting when a speaker has finished and awaits a response -- is essential for creating conversational agents that feel responsive rather than interruptive.
|
| 12 |
+
|
| 13 |
+
Deepgram developed a comprehensive evaluation methodology for turn detection after finding existing approaches inadequate. Their approach prioritizes real-world performance over proxy metrics, evaluating "complete conversations between humans" rather than isolated turns.
|
| 14 |
+
|
| 15 |
+
## Full Conversational Evaluation
|
| 16 |
+
|
| 17 |
+
Rather than testing individual turns, Deepgram analyzes entire conversations. This methodology reflects realistic scenarios where:
|
| 18 |
+
|
| 19 |
+
- **Natural labeling**: Conversation partners provide high-quality, low-latency ground truth detection
|
| 20 |
+
- **Variable detection budgets**: Different turns demand faster responses based on context (simple queries versus complex reasoning)
|
| 21 |
+
- **Natural pause patterns**: Pre-turn silences enable evaluation of start-of-speech detection and backchannel phenomena
|
| 22 |
+
|
| 23 |
+
The team labeled over 100 hours of real conversations with ground truth transcripts and timestamped EoT labels. Notably, refining their annotation specification from "End-of-Thought" to "End-of-Turn" improved annotator confidence from 5/10 to 8.5/10 by reducing ambiguity.
|
| 24 |
+
|
| 25 |
+
## The Challenge with Timestamps
|
| 26 |
+
|
| 27 |
+
Human annotations revealed an unexpected problem: annotators consistently left small gaps between speech completion and label placement. Deepgram discovered their system frequently detected EoT within these gaps -- appearing as false positives under naive temporal alignment.
|
| 28 |
+
|
| 29 |
+
### Forced Alignment Solution
|
| 30 |
+
|
| 31 |
+
Rather than trusting raw human timestamps, Deepgram applied "forced alignment to update the human timestamps," extracting more precise end-of-speech markers. While conservative human placement might reflect natural pauses before responses, voice agents benefit from fastest possible detection since "their turn starts are frequently delayed relative to EoT detection due to extra latency associated with LLM and TTS generation."
|
| 32 |
+
|
| 33 |
+
## Sequence Alignment for Improved Evaluation
|
| 34 |
+
|
| 35 |
+
Standard temporal matching proved insufficiently precise. Deepgram adopted sequence alignment -- treating turn boundaries as special tokens (`[EoT]`) within transcripts, similar to Word Error Rate (WER) calculation.
|
| 36 |
+
|
| 37 |
+
**Impact**: This shift yielded "3-5% absolute increases in precision and recall across models," with "manual investigation of results suggested the new values were more representative of the true performance."
|
| 38 |
+
|
| 39 |
+
The approach handled all detector types: all-in-one STT solutions, text-based detectors, and audio-only models (using Nova-3 for inter-turn transcription).
|
| 40 |
+
|
| 41 |
+
## Handling Dropped Turns
|
| 42 |
+
|
| 43 |
+
A modified Levenshtein algorithm addressed cases where intermediate turns were missed. Beyond simple edit distance, the system "determines the best alignment not only based on overall edit distance but also based on the most likely alignment between EoT tokens," preventing misalignment when turns are dropped.
|
| 44 |
+
|
| 45 |
+
## Start-of-Turn (SoT) Evaluation
|
| 46 |
+
|
| 47 |
+
Complementing EoT analysis, Deepgram evaluated "start-of-turn (SoT)" detection -- identifying when users resume speaking, critical for handling interruptions or "barge-in" scenarios.
|
| 48 |
+
|
| 49 |
+
**Flux performance metrics**:
|
| 50 |
+
- Detection within ~100-200ms of first word onset
|
| 51 |
+
- False positive rate: less than or equal to 1-2%
|
| 52 |
+
- Analysis used word start times as detection targets (representing unachievable lower bounds)
|
| 53 |
+
|
| 54 |
+
Negative latency directly indicates false positives -- detecting speech before it actually begins.
|
| 55 |
+
|
| 56 |
+
## Future Directions
|
| 57 |
+
|
| 58 |
+
The team highlighted emerging evaluation frontiers: "one could combine the various aspects of turn detection, such as false positive rate and latency, into single quality metrics." They specifically mentioned their Voice Agent Quality Index (VAQI) framework and anticipated incorporating semantic/conversational flow considerations -- enabling "stratified or weighted metrics that reflect that some turns merit faster responses than others."
|
03-finetune-pipecat-pt/references/blogs/krisp_turn_taking_v1.md
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Turn-Taking for Voice AI"
|
| 3 |
+
source: https://krisp.ai/blog/turn-taking-for-voice-ai/
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Audio-only, 6M Weights Turn-Taking Model for Voice AI Agents
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
Krisp has developed a lightweight turn-taking model designed to determine when speakers should transition in voice conversations. This technology is included at no additional cost in Krisp's VIVA SDK.
|
| 12 |
+
|
| 13 |
+
## What is Turn-Taking?
|
| 14 |
+
|
| 15 |
+
Turn-taking represents "the fundamental mechanism by which participants in a conversation coordinate who speaks when." In voice AI contexts, it manages when agents should listen, speak, or remain silent. The model addresses two primary tasks:
|
| 16 |
+
|
| 17 |
+
- **End-of-turn prediction**: Detecting when the current speaker will finish
|
| 18 |
+
- **Backchannel prediction**: Recognizing brief acknowledgments like "uh-huh" without speaker transitions
|
| 19 |
+
|
| 20 |
+
## Implementation Approaches
|
| 21 |
+
|
| 22 |
+
**Audio-based methods** analyze acoustic features including pitch variations, energy levels, intonation, pauses, and speaking rate. These enable low-latency responses critical for real-time scenarios.
|
| 23 |
+
|
| 24 |
+
**Text-based approaches** examine transcribed content, identifying linguistic cues like sentence boundaries and discourse markers, though they typically require larger architectures.
|
| 25 |
+
|
| 26 |
+
**Multimodal (fusion) systems** combine both modalities, leveraging acoustic cues alongside semantic understanding.
|
| 27 |
+
|
| 28 |
+
## Key Challenges
|
| 29 |
+
|
| 30 |
+
- **Hesitation vs. completion**: Distinguishing filler words ("um," "you know") from true turn endings
|
| 31 |
+
- **Natural pauses**: Differentiating conversational pauses from actual turn boundaries
|
| 32 |
+
- **Response speed**: Minimizing latency while maintaining accuracy
|
| 33 |
+
- **Speaking diversity**: Accounting for varying rhythms, accents, and intonation patterns
|
| 34 |
+
|
| 35 |
+
## Model Architecture
|
| 36 |
+
|
| 37 |
+
The Krisp model processes 100ms audio frames, outputting confidence scores (0-1) indicating shift likelihood. A configurable threshold determines binary shift predictions, with a default 5-second maximum hold duration.
|
| 38 |
+
|
| 39 |
+
## Performance Comparison
|
| 40 |
+
|
| 41 |
+
| Attribute | Krisp TT | SmartTurn v1 | SmartTurn v2 | VAD-based |
|
| 42 |
+
|-----------|----------|-------------|-------------|-----------|
|
| 43 |
+
| Parameters | 6.1M | 581M | 95M | 260k |
|
| 44 |
+
| Model Size | 65 MB | 2.3 GB | 360 MB | 2.3 MB |
|
| 45 |
+
| Execution | CPU | GPU | GPU | CPU |
|
| 46 |
+
|
| 47 |
+
## Evaluation Metrics
|
| 48 |
+
|
| 49 |
+
### Accuracy Results
|
| 50 |
+
|
| 51 |
+
| Model | Balanced Accuracy | AUC Shift | F1 Score Shift | F1 Score Hold | AUC (MST vs FPR) |
|
| 52 |
+
|-------|------------------|-----------|----------------|---------------|-----------------|
|
| 53 |
+
| Krisp TT | 0.82 | 0.89 | 0.80 | 0.83 | 0.21 |
|
| 54 |
+
| VAD-based | 0.59 | -- | 0.48 | 0.70 | -- |
|
| 55 |
+
| SmartTurn V1 | 0.78 | 0.86 | 0.73 | 0.84 | 0.39 |
|
| 56 |
+
| SmartTurn V2 | 0.78 | 0.83 | 0.76 | 0.78 | 0.44 |
|
| 57 |
+
|
| 58 |
+
### Training Data
|
| 59 |
+
|
| 60 |
+
- Approximately 2,000 hours of conversational speech
|
| 61 |
+
- Around 700,000 speaker turns
|
| 62 |
+
- Test dataset: 1,875 manually labeled audio samples
|
| 63 |
+
|
| 64 |
+
## Key Performance Findings
|
| 65 |
+
|
| 66 |
+
The Krisp model achieves "considerably faster average response time (0.9 vs. 1.3 seconds at a 0.06 FPR) compared to SmartTurn" while being 5-10 times smaller and optimized for CPU execution.
|
| 67 |
+
|
| 68 |
+
## Future Development
|
| 69 |
+
|
| 70 |
+
**Planned enhancements include:**
|
| 71 |
+
|
| 72 |
+
- Text-based turn-taking using custom neural networks
|
| 73 |
+
- Multimodal audio-text fusion for improved accuracy
|
| 74 |
+
- Backchannel detection to distinguish meaningful interruptions from casual listening acknowledgments
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
*Article published August 5, 2025 by Krisp Engineering Team*
|
03-finetune-pipecat-pt/references/blogs/krisp_turn_taking_v2.md
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Krisp Turn-Taking v2 - Voice AI VIVA SDK"
|
| 3 |
+
source: https://krisp.ai/blog/krisp-turn-taking-v2-voice-ai-viva-sdk/
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Audio-Only Turn-Taking Model v2 - Krisp
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
Krisp has released Turn-Taking v2, an updated model for detecting end-of-turns in real-time conversational AI systems. The model processes audio input only, making it suitable for "human-bot interactions" and other voice AI applications integrated through Krisp's VIVA SDK.
|
| 12 |
+
|
| 13 |
+
## Technical Architecture
|
| 14 |
+
|
| 15 |
+
The latest iteration, **krisp-viva-tt-v2**, represents a substantial advancement over its predecessor. According to the engineering team, it was "trained on a more diverse and better-structured dataset, with richer data augmentations that help the model perform more reliably in real-world conditions."
|
| 16 |
+
|
| 17 |
+
## Key Improvements
|
| 18 |
+
|
| 19 |
+
- Enhanced robustness in noisy environments
|
| 20 |
+
- Superior accuracy when combined with Krisp's Voice Isolation models
|
| 21 |
+
- Faster and more stable turn detection during live conversations
|
| 22 |
+
|
| 23 |
+
## Performance Benchmarks
|
| 24 |
+
|
| 25 |
+
### Clean Audio Testing
|
| 26 |
+
|
| 27 |
+
Testing evaluated approximately 1,800 real conversation samples (~1,000 "hold" cases and ~800 "shift" cases) with mild background noise:
|
| 28 |
+
|
| 29 |
+
| Model | Balanced Accuracy | AUC | F1 Score |
|
| 30 |
+
|-------|-------------------|-----|----------|
|
| 31 |
+
| krisp-viva-tt-v1 | 0.82 | 0.89 | 0.804 |
|
| 32 |
+
| **krisp-viva-tt-v2** | **0.823** | **0.904** | **0.813** |
|
| 33 |
+
|
| 34 |
+
### Noisy Audio Testing (5-15 dB noise levels)
|
| 35 |
+
|
| 36 |
+
| Model | Balanced Accuracy | AUC | F1 Score |
|
| 37 |
+
|-------|-------------------|-----|----------|
|
| 38 |
+
| krisp-viva-tt-v1 | 0.723 | 0.799 | 0.71 |
|
| 39 |
+
| **krisp-viva-tt-v2** | **0.768** | **0.842** | **0.757** |
|
| 40 |
+
|
| 41 |
+
V2 demonstrated "up to a 6% improvement in F1 score under noisy conditions."
|
| 42 |
+
|
| 43 |
+
### Post-Processing with Voice Isolation
|
| 44 |
+
|
| 45 |
+
After applying background noise and voice removal through krisp-viva-tel-v2:
|
| 46 |
+
|
| 47 |
+
| Model | Balanced Accuracy | AUC | F1 Score |
|
| 48 |
+
|-------|-------------------|-----|----------|
|
| 49 |
+
| krisp-viva-tt-v1 | 0.787 | 0.854 | 0.775 |
|
| 50 |
+
| **krisp-viva-tt-v2** | **0.816** | **0.885** | **0.808** |
|
| 51 |
+
|
| 52 |
+
## Availability
|
| 53 |
+
|
| 54 |
+
The model is now available as part of Krisp's VIVA SDK, designed for developers building Voice AI agents and conversational systems.
|
03-finetune-pipecat-pt/references/blogs/livekit_eot_v0_4.md
ADDED
|
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Improved End-of-Turn Model Cuts Voice AI Interruptions 39%"
|
| 3 |
+
source: https://livekit.com/blog/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Improved End-of-Turn Model Cuts Voice AI Interruptions 39%
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
LiveKit released an updated transformer-based end-of-turn detection model, `v0.4.1-intl`, featuring "a 39.23% relative reduction in false-positive interruptions" compared to the previous version. The model now emphasizes structured data handling and multilingual performance across supported languages.
|
| 12 |
+
|
| 13 |
+
## Key Improvements
|
| 14 |
+
|
| 15 |
+
### Performance Metrics
|
| 16 |
+
|
| 17 |
+
The benchmark demonstrates consistent gains across all tested languages:
|
| 18 |
+
|
| 19 |
+
| Language | v0.3.0 Error Rate | v0.4.1 Error Rate | Relative Improvement |
|
| 20 |
+
|----------|------------------|------------------|----------------------|
|
| 21 |
+
| Chinese | 18.70% | 13.40% | 28.34% |
|
| 22 |
+
| Dutch | 26.10% | 11.90% | 54.33% |
|
| 23 |
+
| English | 16.60% | 13.00% | 21.69% |
|
| 24 |
+
| French | 16.80% | 11.10% | 33.93% |
|
| 25 |
+
| German | 23.40% | 12.20% | 47.86% |
|
| 26 |
+
| Hindi | 5.40% | 3.70% | 31.48% |
|
| 27 |
+
| Indonesian | 17.00% | 10.60% | 37.65% |
|
| 28 |
+
| Italian | 20.10% | 14.90% | 25.87% |
|
| 29 |
+
| Japanese | 19.70% | 11.20% | 43.15% |
|
| 30 |
+
| Korean | 7.90% | 5.50% | 30.38% |
|
| 31 |
+
| Portuguese | 23.30% | 12.60% | 45.97% |
|
| 32 |
+
| Russian | 19.50% | 12.00% | 38.46% |
|
| 33 |
+
| Spanish | 21.50% | 14.00% | 33.88% |
|
| 34 |
+
| Turkish | 25.40% | 12.70% | 50.0% |
|
| 35 |
+
| **All** | **18.66%** | **11.34%** | **39.23%** |
|
| 36 |
+
|
| 37 |
+
Error rate represents the false positive rate at a fixed true positive rate of 99.3%.
|
| 38 |
+
|
| 39 |
+
## Technical Architecture
|
| 40 |
+
|
| 41 |
+
### The Challenge of End-of-Turn Detection
|
| 42 |
+
|
| 43 |
+
Voice AI systems must detect speech completion using three primary cues:
|
| 44 |
+
|
| 45 |
+
- **Semantic content**: Word meaning and language understanding
|
| 46 |
+
- **Context**: Dialogue history and conversational flow
|
| 47 |
+
- **Prosody**: Tone, pauses, and rhythmic patterns
|
| 48 |
+
|
| 49 |
+
The model uses an LLM backbone to effectively combine content and context information.
|
| 50 |
+
|
| 51 |
+
### Structured Data Handling
|
| 52 |
+
|
| 53 |
+
A primary enhancement addresses structured information collection. When users provide phone numbers, emails, or addresses, natural speech markers (intonation changes, grammatical endings) are absent. The updated model "addresses this by inferring expected formats from the agent's prompt," enabling:
|
| 54 |
+
|
| 55 |
+
- Detection of complete phone numbers (typically 10 digits for US numbers)
|
| 56 |
+
- Email pattern recognition
|
| 57 |
+
- Address validation including street, city, state, and zip code
|
| 58 |
+
|
| 59 |
+
The visualization tool demonstrates the improvement: `v0.3.0-intl` incorrectly detected end points after individual digits, while `v0.4.1-intl` waited for the complete sequence.
|
| 60 |
+
|
| 61 |
+
### Handling Speech-to-Text Variability
|
| 62 |
+
|
| 63 |
+
Training data incorporated multiple STT output formats. One engine might transcribe "forty two" as words while another outputs "42" numerically. By training across "common STT output formats," the model maintains consistent performance across different provider implementations.
|
| 64 |
+
|
| 65 |
+
### Multilingual Generalization
|
| 66 |
+
|
| 67 |
+
Despite structured data enhancements targeting English training data, improvements transferred across languages -- particularly for phone number and address detection in Spanish, French, and other languages. This stems from the Qwen2.5 base model's multilingual pretraining, which encodes "knowledge of global formats."
|
| 68 |
+
|
| 69 |
+
## Model Architecture & Optimization
|
| 70 |
+
|
| 71 |
+
### Base Model Selection
|
| 72 |
+
|
| 73 |
+
The team selected **Qwen2.5-0.5B-Instruct** for its balance of performance and low-latency CPU inference capabilities.
|
| 74 |
+
|
| 75 |
+
### Knowledge Distillation
|
| 76 |
+
|
| 77 |
+
To improve multilingual accuracy without sacrificing efficiency:
|
| 78 |
+
|
| 79 |
+
1. A larger **Qwen2.5-7B-Instruct** teacher model was trained
|
| 80 |
+
2. Knowledge was distilled into the smaller 0.5B student model
|
| 81 |
+
3. The distilled model achieves "higher multilingual accuracy with the efficiency of the smaller size"
|
| 82 |
+
|
| 83 |
+
Training curves show the distilled model outperforming the baseline 0.5B version and approaching teacher performance after approximately 1,500 steps.
|
| 84 |
+
|
| 85 |
+
## Availability & Deprecation
|
| 86 |
+
|
| 87 |
+
- **Deployment**: Available in Agents Python 1.3.0 and Agents JS 1.0.19
|
| 88 |
+
- **Model**: `MultilingualModel` now recommended for all use cases
|
| 89 |
+
- **Deprecation**: The legacy `EnglishModel` is being deprecated as the multilingual version "not only matches but in most cases exceeds the performance"
|
| 90 |
+
|
| 91 |
+
## Observability Integration
|
| 92 |
+
|
| 93 |
+
LiveKit Agents now include built-in observability for end-of-turn detection. When Agent Observability is enabled, "every turn detection decision is logged with the exact input the model saw." Production debugging accesses `eou_detection` traces showing full prediction context.
|
| 94 |
+
|
| 95 |
+
## Future Directions
|
| 96 |
+
|
| 97 |
+
Current implementation relies on transcribed text. Future iterations will integrate raw audio features like pauses and emphasis through multimodal architectures, "fusing prosody directly into predictions for more precise detection."
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
**Resources**:
|
| 102 |
+
- [GitHub Repository](https://github.com/livekit/agents)
|
| 103 |
+
- [Visualization Tool](https://huggingface.co/spaces/livekit/eot-visualization)
|
| 104 |
+
- [Voice AI Quickstart](https://docs.livekit.io/agents/start/voice-ai/)
|
03-finetune-pipecat-pt/references/blogs/livekit_transformer_turn_detection.md
ADDED
|
@@ -0,0 +1,87 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Using a Transformer to Improve End of Turn Detection"
|
| 3 |
+
source: https://livekit.com/blog/using-a-transformer-to-improve-end-of-turn-detection
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Using a Transformer to Improve End of Turn Detection
|
| 8 |
+
|
| 9 |
+
**Published:** December 20, 2024 | **Author:** Russ D'Sa | **Reading Time:** 6 minutes
|
| 10 |
+
|
| 11 |
+
## Overview
|
| 12 |
+
|
| 13 |
+
End-of-turn detection represents one of the most challenging problems in voice AI applications. The core task involves determining when a user has finished speaking, allowing an AI model to respond without unintentionally interrupting the user.
|
| 14 |
+
|
| 15 |
+
## Current Approach: Phrase Endpointing
|
| 16 |
+
|
| 17 |
+
The predominant technique for turn detection is phrase endpointing, typically implemented through voice activity detection (VAD). This method operates as follows:
|
| 18 |
+
|
| 19 |
+
- Audio packets are processed through a neural network to determine if human speech is present
|
| 20 |
+
- If speech is detected, the user continues speaking
|
| 21 |
+
- If silence is detected, a timer begins tracking the duration of the absence of detectable human speech
|
| 22 |
+
- Once a configured silence threshold passes, an end-of-turn event triggers, allowing LLM inference to begin
|
| 23 |
+
|
| 24 |
+
LiveKit Agents uses "Silero VAD to detect voice activity and provide timing parameters" to adjust sensitivity. The framework introduces a configurable delay via the `min_endpointing_delay` parameter (default: 500ms) between when VAD detects speech cessation and when LLM inference begins.
|
| 25 |
+
|
| 26 |
+
## The VAD Limitation
|
| 27 |
+
|
| 28 |
+
VAD only detects *when* someone is speaking based on audio signal presence. Humans, however, use semantic understanding -- analyzing what is said and how it's expressed -- to determine turn-taking. For example, "I understand your point, but..." would trigger VAD's end-of-turn signal despite the human listener recognizing the speaker intends to continue.
|
| 29 |
+
|
| 30 |
+
## The EOU (End of Utterance) Model
|
| 31 |
+
|
| 32 |
+
LiveKit released "an open source transformer model that uses the content of speech to predict when a user has" finished speaking. The model incorporates semantic understanding into turn detection.
|
| 33 |
+
|
| 34 |
+
### Model Architecture
|
| 35 |
+
|
| 36 |
+
- **Base Model:** 135M parameter transformer based on SmolLM v2 from HuggingFace
|
| 37 |
+
- **Design Choice:** Small model size enables CPU-based, real-time inference
|
| 38 |
+
- **Context Window:** Sliding window of the last four conversational turns
|
| 39 |
+
- **Language Support:** Currently English transcriptions only; additional languages planned
|
| 40 |
+
|
| 41 |
+
### How It Works
|
| 42 |
+
|
| 43 |
+
During user speech, transcriptions from a speech-to-text (STT) service are appended word-by-word to the model's context window. For each final STT transcription, the model generates a confidence-level prediction regarding whether the current context represents turn completion.
|
| 44 |
+
|
| 45 |
+
The model integrates with VAD by dynamically adjusting the silence timeout: longer silence periods are permitted when EOU indicates the user hasn't finished speaking, reducing interruptions while maintaining responsiveness.
|
| 46 |
+
|
| 47 |
+
## Performance Results
|
| 48 |
+
|
| 49 |
+
Compared to VAD alone, the EOU + VAD approach achieves:
|
| 50 |
+
|
| 51 |
+
- **85% reduction** in unintentional interruptions
|
| 52 |
+
- **3% false negative rate** (incorrectly indicating turn continuation)
|
| 53 |
+
|
| 54 |
+
### Use Cases
|
| 55 |
+
|
| 56 |
+
The model proves particularly valuable for conversational AI and customer support applications requiring data collection:
|
| 57 |
+
|
| 58 |
+
- Conducting interviews
|
| 59 |
+
- Collecting addresses for shipments
|
| 60 |
+
- Gathering phone numbers
|
| 61 |
+
- Processing payment information
|
| 62 |
+
- Ordering transactions (pizza ordering demonstrated)
|
| 63 |
+
|
| 64 |
+
## Implementation
|
| 65 |
+
|
| 66 |
+
The turn detector is packaged as a LiveKit Agents plugin, enabling simple integration through a single additional parameter in the `VoicePipelineAgent` constructor:
|
| 67 |
+
|
| 68 |
+
```
|
| 69 |
+
turn_detector=TurnDetector.from_plugin()
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
Example implementation code is available in the LiveKit agents repository.
|
| 73 |
+
|
| 74 |
+
## Future Development
|
| 75 |
+
|
| 76 |
+
Planned improvements include:
|
| 77 |
+
|
| 78 |
+
- **Multi-language Support:** Extending EOU beyond English
|
| 79 |
+
- **Inference Optimization:** Reducing current ~50ms inference latency
|
| 80 |
+
- **Context Window Expansion:** Increasing the four-turn window
|
| 81 |
+
- **Audio-Based Detection:** Developing models for multimodal systems that process audio directly, accounting for prosodic features like intonation and cadence
|
| 82 |
+
|
| 83 |
+
The team acknowledges that EOU's text-based training limits its use with natively multimodal models such as OpenAI's Realtime API. An audio-native model is under development to address this constraint.
|
| 84 |
+
|
| 85 |
+
## Broader Implications
|
| 86 |
+
|
| 87 |
+
The researchers emphasize that "even humans don't get this right all the time" and that non-verbal cues improve human turn-taking performance. They consider this research essential for developing natural, humanlike AI interactions and anticipate significant community innovation in this area.
|
03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Announcing Smart Turn v3: CPU Inference in Just 12ms"
|
| 3 |
+
source: https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Smart Turn v3: CPU Inference Breakthrough
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
Daily announced Smart Turn v3, a dramatically improved voice turn detection model featuring unprecedented efficiency. The system achieves "CPU inference in just 12ms" while maintaining open-source accessibility for weights, training data, and scripts.
|
| 12 |
+
|
| 13 |
+
## Key Improvements
|
| 14 |
+
|
| 15 |
+
### Size and Performance
|
| 16 |
+
- **Model size:** Reduced to 8 MB (nearly 50x smaller than v2)
|
| 17 |
+
- **CPU inference speed:** 12ms on modern processors; 60ms on budget AWS instances
|
| 18 |
+
- **No GPU required:** Runs directly within Pipecat Cloud instances
|
| 19 |
+
|
| 20 |
+
### Language Coverage
|
| 21 |
+
Expanded to 23 languages including Arabic, Bengali, Chinese, Danish, Dutch, German, English, Finnish, French, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, and Vietnamese.
|
| 22 |
+
|
| 23 |
+
### Architecture
|
| 24 |
+
- **Foundation:** Whisper Tiny encoder (39M parameters)
|
| 25 |
+
- **Classification layers:** Adapted from Smart Turn v2
|
| 26 |
+
- **Total parameters:** 8M
|
| 27 |
+
- **Optimization:** int8 quantization via static QAT
|
| 28 |
+
- **Export format:** ONNX
|
| 29 |
+
|
| 30 |
+
## Competitive Comparison
|
| 31 |
+
|
| 32 |
+
| Metric | Smart Turn v3 | Krisp | Ultravox |
|
| 33 |
+
|--------|---------------|-------|----------|
|
| 34 |
+
| Size | 8 MB | 65 MB | 1.37 GB |
|
| 35 |
+
| Languages | 23 | English only | 26 |
|
| 36 |
+
| Availability | Open weights/data | Proprietary | Open weights |
|
| 37 |
+
| Focus | Single-inference latency | Decision confidence | Conversation context |
|
| 38 |
+
|
| 39 |
+
## Performance Benchmarks
|
| 40 |
+
|
| 41 |
+
CPU inference results (including preprocessing):
|
| 42 |
+
|
| 43 |
+
| Platform | Speed |
|
| 44 |
+
|----------|-------|
|
| 45 |
+
| AWS c7a.2xlarge | 12.6 ms |
|
| 46 |
+
| AWS c8g.2xlarge | 15.2 ms |
|
| 47 |
+
| Modal (6 cores) | 17.7 ms |
|
| 48 |
+
| AWS t3.2xlarge | 33.8 ms |
|
| 49 |
+
| AWS c8g.medium | 59.8 ms |
|
| 50 |
+
| AWS t3.medium | 94.8 ms |
|
| 51 |
+
|
| 52 |
+
GPU performance shows 3.3-6.6ms latency across various NVIDIA processors.
|
| 53 |
+
|
| 54 |
+
## Accuracy Results
|
| 55 |
+
|
| 56 |
+
Highest performers: Turkish (97.10%), Korean (96.85%), Japanese (96.76%)
|
| 57 |
+
|
| 58 |
+
Lower performers: Vietnamese (81.27%), Bengali (84.10%), Marathi (87.60%)
|
| 59 |
+
|
| 60 |
+
English achieved 94.31% accuracy across 2,846 test samples.
|
| 61 |
+
|
| 62 |
+
## Implementation
|
| 63 |
+
|
| 64 |
+
### With Pipecat
|
| 65 |
+
`LocalSmartTurnAnalyzerV3` integration available (v0.0.85+). Users download ONNX model from HuggingFace repository.
|
| 66 |
+
|
| 67 |
+
### Standalone
|
| 68 |
+
Direct ONNX runtime usage via provided inference scripts. Requires accompanying VAD model (Silero recommended) for optimal results.
|
| 69 |
+
|
| 70 |
+
## Resources
|
| 71 |
+
|
| 72 |
+
- Model weights on HuggingFace
|
| 73 |
+
- GitHub repository containing training code and inference examples
|
| 74 |
+
- Open test datasets available for benchmark reproduction
|
| 75 |
+
- Community dataset annotation available at smart-turn-dataset.pipecat.ai
|
03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3_1.md
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Improved Accuracy in Smart Turn v3.1"
|
| 3 |
+
source: https://www.daily.co/blog/improved-accuracy-in-smart-turn-v3-1/
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Improved Accuracy in Smart Turn v3.1
|
| 8 |
+
|
| 9 |
+
Smart Turn v3.1 represents a significant advancement in conversation turn detection, leveraging expanded training datasets to enhance model performance across languages.
|
| 10 |
+
|
| 11 |
+
## Overview
|
| 12 |
+
|
| 13 |
+
Smart Turn v3.1 maintains the same architecture as its predecessor while offering improved accuracy through additional human audio training data and refined quantization methods. The update functions as a direct replacement for v3.0, requiring no code modifications to existing implementations.
|
| 14 |
+
|
| 15 |
+
## Training Data Enhancement
|
| 16 |
+
|
| 17 |
+
The model benefits from contributions by three specialized data partners:
|
| 18 |
+
|
| 19 |
+
**Liva AI** provides "real human voice data to improve speech models" across multiple languages and dialects, founded by researchers with publications in IEEE and machine learning conferences.
|
| 20 |
+
|
| 21 |
+
**Midcentury** develops "multimodal-native research" datasets spanning 12+ languages, emphasizing real-world performance scenarios.
|
| 22 |
+
|
| 23 |
+
**MundoAI** constructs "the world's largest and highest quality multimodal datasets" across 16+ languages, prioritizing multilingual diversity.
|
| 24 |
+
|
| 25 |
+
These partners contributed audio samples in English and Spanish, publicly released through HuggingFace as `smart-turn-data-v3.1-train` and `smart-turn-data-v3.1-test` datasets.
|
| 26 |
+
|
| 27 |
+
## Model Variants
|
| 28 |
+
|
| 29 |
+
Two deployment options accommodate different hardware configurations:
|
| 30 |
+
|
| 31 |
+
- **CPU Model (8MB, int8 quantized)**: Lightweight variant enabling ~12ms CPU inference, matching v3.0 sizing
|
| 32 |
+
- **GPU Model (32MB, unquantized)**: Enhanced variant for GPU deployment with ~1% accuracy improvement
|
| 33 |
+
|
| 34 |
+
## Accuracy Improvements
|
| 35 |
+
|
| 36 |
+
Testing on the new v3.1 dataset demonstrates substantial gains:
|
| 37 |
+
|
| 38 |
+
| Language | v3.0 | v3.1 (8MB) | v3.1 (32MB) |
|
| 39 |
+
|----------|------|-----------|-----------|
|
| 40 |
+
| English | 88.3% | 94.7% | 95.6% |
|
| 41 |
+
| Spanish | 86.7% | 90.1% | 91.0% |
|
| 42 |
+
|
| 43 |
+
The model supports 23 total languages; the remaining 21 maintain parity with v3.0 performance.
|
| 44 |
+
|
| 45 |
+
## Performance Benchmarks
|
| 46 |
+
|
| 47 |
+
Single-inference latency across representative hardware:
|
| 48 |
+
|
| 49 |
+
| Device | v3.1 (8MB) | v3.1 (32MB) | Preprocessing |
|
| 50 |
+
|--------|-----------|-----------|---------------|
|
| 51 |
+
| GPU (NVIDIA L40S) | 2 ms | 1 ms | 1 ms |
|
| 52 |
+
| GPU (NVIDIA T4) | 5 ms | 4 ms | 2 ms |
|
| 53 |
+
| CPU (AWS c7a.2xlarge) | 9 ms | 13 ms | 7 ms |
|
| 54 |
+
| CPU (AWS c8g.2xlarge) | 20 ms | 32 ms | 9 ms |
|
| 55 |
+
| CPU (AWS c7a.medium) | 37 ms | 73 ms | 7 ms |
|
| 56 |
+
| CPU (AWS c8g.medium) | 57 ms | 159 ms | 9 ms |
|
| 57 |
+
|
| 58 |
+
Performance optimization is achievable through environment variable configuration:
|
| 59 |
+
|
| 60 |
+
```
|
| 61 |
+
OMP_NUM_THREADS=1
|
| 62 |
+
OMP_WAIT_POLICY="PASSIVE"
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
## Resources
|
| 66 |
+
|
| 67 |
+
- Model weights: https://huggingface.co/pipecat-ai/smart-turn-v3
|
| 68 |
+
- GitHub repository: https://github.com/pipecat-ai/smart-turn
|
| 69 |
+
- Datasets: https://huggingface.co/pipecat-ai
|
03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3_2.md
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Smart Turn v3.2: Handling Noisy Environments and Short Responses"
|
| 3 |
+
source: https://www.daily.co/blog/smart-turn-v3-2-handling-noisy-environments-and-short-responses/
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Smart Turn v3.2: Better Accuracy for AI Voice Agents
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
Smart Turn v3.2 represents an update to an open-source turn detection model designed to help AI voice agents identify when users finish speaking. The model listens to raw audio and determines optimal response timing -- preventing interruptions while eliminating unnecessary delays.
|
| 12 |
+
|
| 13 |
+
## Key Improvements in v3.2
|
| 14 |
+
|
| 15 |
+
### Short Utterances Enhancement
|
| 16 |
+
The model now handles brief verbal responses significantly better. Single words like "yes" or "okay" are "miscategorized 40% less often according to our public benchmarks." Two changes enabled this improvement:
|
| 17 |
+
|
| 18 |
+
- Introduction of a new dataset focused on short utterances (planned for expansion)
|
| 19 |
+
- Resolution of a padding issue during training that was compromising accuracy
|
| 20 |
+
|
| 21 |
+
### Background Noise Robustness
|
| 22 |
+
The updated version performs better in real-world environments by incorporating realistic cafe and office noise into training and testing datasets, moving beyond studio-quality audio assumptions.
|
| 23 |
+
|
| 24 |
+
## Technical Specifications
|
| 25 |
+
|
| 26 |
+
**Model Variants:**
|
| 27 |
+
- CPU version: 8MB
|
| 28 |
+
- GPU version: 32MB
|
| 29 |
+
|
| 30 |
+
Both serve as drop-in replacements for v3.1, maintaining compatibility with existing implementations.
|
| 31 |
+
|
| 32 |
+
## Available Resources
|
| 33 |
+
|
| 34 |
+
**Open-source components:**
|
| 35 |
+
- Model weights: Available on HuggingFace
|
| 36 |
+
- Training code: Published on GitHub
|
| 37 |
+
- Datasets: Two new datasets released for training and testing purposes
|
| 38 |
+
|
| 39 |
+
## Integration
|
| 40 |
+
|
| 41 |
+
The model integrates with Pipecat through the `LocalSmartTurnAnalyzerV3` constructor, allowing immediate use via the `smart_turn_model_path` parameter.
|
03-finetune-pipecat-pt/references/blogs/speechmatics_semantic_turn.md
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "How to Build Smarter Turn Detection for Voice AI"
|
| 3 |
+
source: https://blog.speechmatics.com/semantic-turn-detection
|
| 4 |
+
date_accessed: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# How to Build Smarter Turn Detection for Voice AI
|
| 8 |
+
|
| 9 |
+
**Published:** May 12, 2025 | **Author:** Aaron Ng, Machine Learning Engineer | **Read Time:** 13 minutes
|
| 10 |
+
|
| 11 |
+
## Introduction
|
| 12 |
+
|
| 13 |
+
Voice AI systems face a fundamental challenge: determining when users have finished speaking. Traditional approaches relying on fixed silence periods (like 500ms) create frustrating interruptions. This article explores how semantic understanding can improve turn detection.
|
| 14 |
+
|
| 15 |
+
## The Problem with Traditional Voice Activity Detection
|
| 16 |
+
|
| 17 |
+
VAD systems detect speech boundaries through audio patterns alone. As the article explains, "VAD only understands audio patterns. It knows _when_ there's speech and when there isn't. What it doesn't know is _why_ there's a pause."
|
| 18 |
+
|
| 19 |
+
The example demonstrates this limitation: when a customer pauses to check information ("Sure it's 123 764... (pauses to check their notes)"), VAD incorrectly interprets the silence as turn completion and triggers an interruption.
|
| 20 |
+
|
| 21 |
+
## Semantic Turn Detection Solution
|
| 22 |
+
|
| 23 |
+
Rather than relying solely on silence detection, the proposed approach uses instruction-tuned Small Language Models (SLMs) to understand conversational context. These models contain fewer than 10 billion parameters, enabling fast local inference critical for voice interactions.
|
| 24 |
+
|
| 25 |
+
### Key Benefits
|
| 26 |
+
|
| 27 |
+
- **Reduced Latency**: Local SLM inference avoids the 500ms+ delays of external API calls
|
| 28 |
+
- **Cost Savings**: Fewer false interruptions mean fewer unnecessary LLM API calls for responses that get discarded
|
| 29 |
+
- **Better UX**: Systems recognize contextual pauses versus actual turn completion
|
| 30 |
+
|
| 31 |
+
## Technical Implementation
|
| 32 |
+
|
| 33 |
+
### Core Mechanism
|
| 34 |
+
|
| 35 |
+
The approach monitors the probability of the `<|im_end|>` token -- a special ChatML marker indicating turn completion. When this token's probability is high, the model predicts the user has finished speaking.
|
| 36 |
+
|
| 37 |
+
The article illustrates this with examples:
|
| 38 |
+
- "Can I have two chicken McNuggets and" -> Low `<|im_end|>` probability (incomplete thought)
|
| 39 |
+
- "I have a problem with my card" -> Higher `<|im_end|>` probability (complete thought)
|
| 40 |
+
|
| 41 |
+
### Architecture Components
|
| 42 |
+
|
| 43 |
+
**1. Message Tokenization**
|
| 44 |
+
|
| 45 |
+
Messages are formatted using ChatML structure with special tokens (`<|im_start|>`, `<|im_end|>`, `<|im_sep|>`). The implementation removes the final `<|im_end|>` token since the model predicts its presence.
|
| 46 |
+
|
| 47 |
+
**2. Model Inference**
|
| 48 |
+
|
| 49 |
+
The code uses `AutoModelForCausalLM` to extract logits for the final token position, computing log-softmax probabilities across the vocabulary.
|
| 50 |
+
|
| 51 |
+
**3. Token Probability Extraction**
|
| 52 |
+
|
| 53 |
+
The system identifies the `<|im_end|>` token among top-k predictions (k=20) and converts log probabilities to standard probabilities using exponential transformation.
|
| 54 |
+
|
| 55 |
+
### Code Example Structure
|
| 56 |
+
|
| 57 |
+
The implementation includes an `EndOfTurnModel` class with:
|
| 58 |
+
|
| 59 |
+
- `_convert_messages_to_chatml()`: Formats conversations in ChatML format
|
| 60 |
+
- `get_next_token_logprobs()`: Performs local inference to retrieve next-token probabilities
|
| 61 |
+
- `process_result()`: Extracts target token probabilities
|
| 62 |
+
- `predict_eot_prob()`: Orchestrates the full pipeline returning probability scores
|
| 63 |
+
|
| 64 |
+
**Model Details**: The implementation uses `SmolLM2-360M-Instruct` from Hugging Face, chosen for efficiency and CPU compatibility.
|
| 65 |
+
|
| 66 |
+
## Configuration Parameters
|
| 67 |
+
|
| 68 |
+
- **MAX_HISTORY**: 4 messages (recent context window)
|
| 69 |
+
- **DEFAULT_THRESHOLD**: 0.03 (default probability threshold)
|
| 70 |
+
- **Top-k consideration**: 20 tokens
|
| 71 |
+
|
| 72 |
+
## Practical Considerations
|
| 73 |
+
|
| 74 |
+
### Token Selection Strategy
|
| 75 |
+
|
| 76 |
+
Beyond `<|im_end|>`, punctuation marks (periods, question marks, exclamation points) can signal thought completion. Hybrid approaches monitoring multiple tokens may improve accuracy but risk introducing noise.
|
| 77 |
+
|
| 78 |
+
### Hybrid Approach with VAD
|
| 79 |
+
|
| 80 |
+
The article recommends combining semantic detection with VAD:
|
| 81 |
+
- VAD provides high recall detecting speech presence
|
| 82 |
+
- Semantic turn detection improves precision to reduce false interruptions
|
| 83 |
+
- Dynamic grace periods can adjust based on speaking patterns
|
| 84 |
+
|
| 85 |
+
### Threshold Determination
|
| 86 |
+
|
| 87 |
+
The default 0.03 threshold serves as a starting point. "The optimal threshold depends on your specific SLM and can vary significantly between models." Precision-focused optimization using representative test sets is recommended.
|
| 88 |
+
|
| 89 |
+
### Multilingual Support
|
| 90 |
+
|
| 91 |
+
Supporting multiple languages requires instruction-tuned SLMs trained on diverse multilingual data, accounting for varied grammar and conversational norms.
|
| 92 |
+
|
| 93 |
+
### Fine-Tuning Opportunities
|
| 94 |
+
|
| 95 |
+
While general SLMs provide solid conversational understanding, task-specific fine-tuning on domain conversational data can improve accuracy for specialized terminology or industry-specific use cases.
|
| 96 |
+
|
| 97 |
+
## Future Directions
|
| 98 |
+
|
| 99 |
+
The article acknowledges limitations: "True conversational understanding requires more than just reading text or listening for gaps -- it needs an audio-native model" that processes tone, cadence, hesitation, and emphasis directly from audio signals.
|
| 100 |
+
|
| 101 |
+
## Resource Reference
|
| 102 |
+
|
| 103 |
+
A complete implementation is available on GitHub at the repository referenced in the article.
|
03-finetune-pipecat-pt/references/guides/focal_loss_calibration_emnlp_2022.md
ADDED
|
@@ -0,0 +1,209 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Calibrating Imbalanced Classifiers with Focal Loss: An Empirical Study"
|
| 3 |
+
source: https://aclanthology.org/2022.emnlp-industry.14/
|
| 4 |
+
date: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Calibrating Imbalanced Classifiers with Focal Loss: An Empirical Study
|
| 8 |
+
|
| 9 |
+
**Authors:** Cheng Wang, Jorge Balazs, Gyuri Szarvas, Patrick Ernst, Lahari Poddar, Pavel Danchenko (Amazon)
|
| 10 |
+
|
| 11 |
+
**Venue:** EMNLP 2022 Industry Track, December 9-11, Abu Dhabi, UAE, pages 155-163
|
| 12 |
+
|
| 13 |
+
**DOI:** 10.18653/v1/2022.emnlp-industry.14
|
| 14 |
+
|
| 15 |
+
**PDF:** https://aclanthology.org/2022.emnlp-industry.14.pdf
|
| 16 |
+
|
| 17 |
+
## Abstract
|
| 18 |
+
|
| 19 |
+
Imbalanced data distribution is a practical and common challenge in building ML models in industry, where data usually exhibits long-tail distributions. For instance, in virtual AI Assistants (Google Assistant, Amazon Alexa, Apple Siri), the "play music" or "set timer" utterance is exposed to an order of magnitude more traffic than other skills. This can easily cause trained models to overfit to the majority classes, leading to model miscalibration. The uncalibrated models output unreliable (mostly overconfident) predictions, which are at high risk of affecting downstream decision-making systems. The authors empirically show the effectiveness of model training with focal loss in learning better calibrated models, as compared to standard cross-entropy loss. Better calibration enables better control of the precision-recall trade-off for trained models.
|
| 20 |
+
|
| 21 |
+
## 1. Introduction
|
| 22 |
+
|
| 23 |
+
Building ML models in industry faces practical challenges from imbalanced data distributions, particularly long-tail distributions, which make models overfit to majority data classes and lead to miscalibration -- the model-predicted probability fails to estimate the likelihood of true correctness and provides over- or under-confident predictions.
|
| 24 |
+
|
| 25 |
+
The study focuses on **return reason code prediction** in customer service chatbots as a practical application of focal loss for calibration.
|
| 26 |
+
|
| 27 |
+
### Contributions
|
| 28 |
+
|
| 29 |
+
- Empirically examine the effectiveness of using focal loss in handling model miscalibration in a practical application setting
|
| 30 |
+
- Show that good calibration is important to achieve a desired precision or recall target by tuning classification thresholds (standard cross-entropy loss is incapable of this due to skewed predicted probability distribution)
|
| 31 |
+
- Demonstrate performance of calibrated models through a chatbot serving customers across three conversational bot use-cases
|
| 32 |
+
|
| 33 |
+
## 2. Focal Loss Formulation
|
| 34 |
+
|
| 35 |
+
Focal loss is defined as:
|
| 36 |
+
|
| 37 |
+
```
|
| 38 |
+
L_f = - sum_{i=1}^{N} (1 - p_{i,y_i})^gamma * log(p_{i,y_i})
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
where `p_{i,y_i}` is predicted probability of the i-th sample and `gamma` is a hyper-parameter typically set to `gamma = 1`.
|
| 42 |
+
|
| 43 |
+
### Theoretical Interpretation
|
| 44 |
+
|
| 45 |
+
Focal loss can be interpreted as a trade-off between minimizing KL divergence and maximizing entropy:
|
| 46 |
+
|
| 47 |
+
```
|
| 48 |
+
L_f >= KL(q || p) + H(q) - gamma * H(p)
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
The rationale: we learn a probability `p` to have a high value (confident) due to the KL term, but not too high (overconfident) due to the entropy regularization term.
|
| 52 |
+
|
| 53 |
+
### Practical Advantages
|
| 54 |
+
|
| 55 |
+
Compared to other calibration methods (temperature scaling, Bayesian methods, label smoothing, kernel-based methods), focal loss:
|
| 56 |
+
- Neither increases computational overhead nor requires architectural modifications
|
| 57 |
+
- Offers **in-training implicit calibration** (unlike temperature scaling which requires post-training calibration)
|
| 58 |
+
|
| 59 |
+
## 3. Calibration Metrics
|
| 60 |
+
|
| 61 |
+
### Reliability Diagrams
|
| 62 |
+
|
| 63 |
+
Predictions are grouped into N interval bins; accuracy vs. confidence is computed per bin:
|
| 64 |
+
|
| 65 |
+
```
|
| 66 |
+
acc(b_n) = (1/I_n) * sum_i 1(y_hat_i = y_i)
|
| 67 |
+
conf(b_n) = (1/I_n) * sum_i p_hat_i
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
A perfectly calibrated model has `acc(b_n) = conf(b_n)` for all n.
|
| 71 |
+
|
| 72 |
+
### Expected Calibration Error (ECE)
|
| 73 |
+
|
| 74 |
+
```
|
| 75 |
+
ECE = (1/I) * sum_{n=1}^{N} I_n * |acc(b_n) - conf(b_n)|
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
### Maximum Calibration Error (MCE)
|
| 79 |
+
|
| 80 |
+
```
|
| 81 |
+
MCE = max_n |acc(b_n) - conf(b_n)|
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
Particularly important in high-risk applications. For a perfectly calibrated classifier, both ECE and MCE equal 0.
|
| 85 |
+
|
| 86 |
+
## 4. Datasets and Implementation
|
| 87 |
+
|
| 88 |
+
### Task
|
| 89 |
+
|
| 90 |
+
Binary and multi-class return reason code prediction in an online retail store:
|
| 91 |
+
- **Binary:** "item is defective or does not work" (Label 0) vs. OTHERS (Label 1)
|
| 92 |
+
- **Multi-class:** 4 specific return reasons + OTHERS (5 total classes)
|
| 93 |
+
|
| 94 |
+
Both datasets exhibit class imbalance with OTHERS as the most frequent class.
|
| 95 |
+
|
| 96 |
+
### Model Architecture
|
| 97 |
+
|
| 98 |
+
- 2 bidirectional LSTM layers + 2 dense layers
|
| 99 |
+
- Embedding dimension: 1024
|
| 100 |
+
- Hidden layer dimension: 128 (binary), 512 (multi-class)
|
| 101 |
+
- Dropout: 0.1 (embedding), 0.2 (dense)
|
| 102 |
+
- Optimizer: Adam
|
| 103 |
+
- Framework: PyTorch
|
| 104 |
+
|
| 105 |
+
### Golden Dataset
|
| 106 |
+
|
| 107 |
+
- Binary model: 1,013 human-annotated samples
|
| 108 |
+
- Multi-class model: 1,839 human-annotated samples
|
| 109 |
+
- Data split: 8:1:1 (train/val/test)
|
| 110 |
+
|
| 111 |
+
## 5. Results
|
| 112 |
+
|
| 113 |
+
### Binary Reason Code Prediction (Table 1)
|
| 114 |
+
|
| 115 |
+
| Metric | CE | FL1 | FL3 | FL5 | FL10 |
|
| 116 |
+
|--------|------|------|------|------|------|
|
| 117 |
+
| Accuracy | **0.836** | 0.824 | 0.831 | 0.834 | 0.816 |
|
| 118 |
+
| Precision | **0.838** | 0.822 | 0.830 | 0.834 | 0.807 |
|
| 119 |
+
| Recall | **0.823** | 0.814 | 0.821 | 0.823 | 0.805 |
|
| 120 |
+
| F1 | **0.828** | 0.817 | 0.824 | 0.827 | 0.806 |
|
| 121 |
+
| NLL | 2.159 | 1.438 | 0.608 | 0.258 | **0.178** |
|
| 122 |
+
| ECE | 0.168 | 0.166 | 0.139 | 0.080 | **0.078** |
|
| 123 |
+
| MCE | 0.720 | 0.730 | 0.236 | **0.134** | 0.143 |
|
| 124 |
+
|
| 125 |
+
**Key finding:** CE achieves best predictive performance, but FL significantly outperforms on calibration metrics (NLL, ECE, MCE). Higher gamma yields better calibration with modest predictive performance loss.
|
| 126 |
+
|
| 127 |
+
### Multi-Reason Code Prediction (Table 2)
|
| 128 |
+
|
| 129 |
+
| Metric | CE | FL1 | FL5 |
|
| 130 |
+
|--------|------|------|------|
|
| 131 |
+
| Accuracy | 0.751 | **0.760** | 0.751 |
|
| 132 |
+
| Precision | **0.814** | 0.807 | **0.814** |
|
| 133 |
+
| Recall | **0.757** | 0.755 | **0.757** |
|
| 134 |
+
| F1 | **0.764** | 0.760 | **0.764** |
|
| 135 |
+
| NLL | 0.599 | 0.429 | **0.309** |
|
| 136 |
+
| ECE | 0.037 | **0.023** | 0.037 |
|
| 137 |
+
| MCE | 0.296 | **0.197** | 0.299 |
|
| 138 |
+
|
| 139 |
+
### Reliability Diagrams
|
| 140 |
+
|
| 141 |
+
- CE model: ECE = 16.78%, MCE = 72.00% (binary)
|
| 142 |
+
- FL5 model: ECE = 7.98%, MCE = 13.40% (binary)
|
| 143 |
+
- FL10 model: ECE = 7.76%, MCE = 14.34% (binary)
|
| 144 |
+
|
| 145 |
+
As gamma increases from CE to FL10, probability distributions shift from "spiking" (overconfident, p close to 0 or 1) to flatter distributions (e.g., p = {0.6, 0.4}).
|
| 146 |
+
|
| 147 |
+
### Precision-Recall Trade-Off
|
| 148 |
+
|
| 149 |
+
CE model produces polarized probabilities, making it difficult to tune precision based on a given recall or vice versa. FL models learn better-distributed probabilities across [0, 1], enabling effective threshold-based precision/recall tuning.
|
| 150 |
+
|
| 151 |
+
## 6. Deployment Results (Online A/B Test)
|
| 152 |
+
|
| 153 |
+
Model deployed in three conversational chatbot use-cases with threshold=0.512 for 85% target precision:
|
| 154 |
+
|
| 155 |
+
### Online Evaluation (Table 3) -- Relative Improvements
|
| 156 |
+
|
| 157 |
+
| Application | Metric | Treatment vs. Control |
|
| 158 |
+
|------------|--------|----------------------|
|
| 159 |
+
| Use-case A | AR | +2.13% |
|
| 160 |
+
| Use-case A | PRR | +3.18% |
|
| 161 |
+
| Use-case A | 24RR | -0.65% |
|
| 162 |
+
| Use-case B | AR | +2.10% |
|
| 163 |
+
| Use-case B | PRR | +0.97% |
|
| 164 |
+
| Use-case B | 24RR | -0.68% |
|
| 165 |
+
| Use-case C | AR | +3.98% |
|
| 166 |
+
| Use-case C | PRR | +12.85% |
|
| 167 |
+
| Use-case C | 24RR | -1.02% |
|
| 168 |
+
|
| 169 |
+
**Metrics:**
|
| 170 |
+
- **AR (Automation Rate):** % contacts resolved without human involvement (higher = better)
|
| 171 |
+
- **PRR (Positive Response Rate):** % positive customer responses to chatbot resolution (higher = better)
|
| 172 |
+
- **24RR (Repeat Rate):** % customers contacting again within 24h for same issue (lower = better)
|
| 173 |
+
|
| 174 |
+
### Intrinsic Evaluation
|
| 175 |
+
|
| 176 |
+
- Precision on deployed model: 384/485 = 83.8% (aligns with offline 81.4%)
|
| 177 |
+
- Negative predictive value: 194/200 = 97% for OTHERS class
|
| 178 |
+
|
| 179 |
+
## 7. Key Takeaways
|
| 180 |
+
|
| 181 |
+
1. Focal loss provides simple, effective in-training calibration via entropy regularization
|
| 182 |
+
2. Higher gamma values improve calibration without significantly hurting predictive performance
|
| 183 |
+
3. Well-calibrated models enable practical precision/recall threshold tuning that CE-trained models cannot achieve
|
| 184 |
+
4. The discrimination-calibration trade-off is modest -- best model balances both (FL5 recommended)
|
| 185 |
+
5. Better calibration directly translates to improved downstream application metrics
|
| 186 |
+
|
| 187 |
+
## Citation
|
| 188 |
+
|
| 189 |
+
```bibtex
|
| 190 |
+
@inproceedings{wang-etal-2022-calibrating,
|
| 191 |
+
title = "Calibrating Imbalanced Classifiers with Focal Loss: An Empirical Study",
|
| 192 |
+
author = {Wang, Cheng and Balazs, Jorge and Szarvas, Gy{\"o}rgy and Ernst, Patrick and Poddar, Lahari and Danchenko, Pavel},
|
| 193 |
+
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track",
|
| 194 |
+
month = dec,
|
| 195 |
+
year = "2022",
|
| 196 |
+
address = "Abu Dhabi, UAE",
|
| 197 |
+
publisher = "Association for Computational Linguistics",
|
| 198 |
+
pages = "155--163",
|
| 199 |
+
doi = "10.18653/v1/2022.emnlp-industry.14"
|
| 200 |
+
}
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
## References (Selected)
|
| 204 |
+
|
| 205 |
+
- Lin et al., 2017 -- Focal loss for dense object detection (original focal loss paper)
|
| 206 |
+
- Mukhoti et al., 2020 -- Calibrating deep neural networks using focal loss (NeurIPS)
|
| 207 |
+
- Guo et al., 2017 -- On calibration of modern neural networks (ICML)
|
| 208 |
+
- Pereyra et al., 2017 -- Regularizing neural networks by penalizing confident output distributions
|
| 209 |
+
- Naeini et al., 2015 -- Obtaining well calibrated probabilities using Bayesian binning (AAAI)
|
03-finetune-pipecat-pt/references/guides/livekit_turn_detector.md
ADDED
|
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "LiveKit Turn Detector Model Card"
|
| 3 |
+
source: https://huggingface.co/livekit/turn-detector
|
| 4 |
+
date: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# LiveKit Turn Detector
|
| 8 |
+
|
| 9 |
+
An open-weights language model for contextually-aware end-of-utterance (EOU) detection in voice AI applications. The model predicts whether a user has finished speaking based on the semantic content of their transcribed speech, providing a critical complement to voice activity detection (VAD) systems.
|
| 10 |
+
|
| 11 |
+
> For installation, usage examples, and integration guides, see the [LiveKit documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/).
|
| 12 |
+
|
| 13 |
+
## Overview
|
| 14 |
+
|
| 15 |
+
Traditional voice agents rely on voice activity detection (VAD) to determine when a user has finished speaking. VAD works by detecting the presence or absence of speech in an audio signal and applying a silence timer. While effective for detecting pauses, VAD lacks language understanding and frequently causes false positives.
|
| 16 |
+
|
| 17 |
+
**Example:** A user who says *"I need to think about that for a moment..."* and then pauses will be interrupted by a VAD-only system, even though they clearly intend to continue.
|
| 18 |
+
|
| 19 |
+
This model adds semantic understanding to the turn detection process by:
|
| 20 |
+
- Analyzing transcribed text of conversations in real time
|
| 21 |
+
- Predicting the probability that the user has completed their turn
|
| 22 |
+
- Reducing unwanted interruptions when integrated with VAD
|
| 23 |
+
- Handling structured data input effectively (addresses, phone numbers, email addresses, credit card numbers)
|
| 24 |
+
|
| 25 |
+
## Model Variants
|
| 26 |
+
|
| 27 |
+
**Multilingual** (recommended) and **English-only** (deprecated) are distributed as INT8 quantized ONNX models (`model_q8.onnx`) optimized for CPU inference.
|
| 28 |
+
|
| 29 |
+
> **The English-only model (`EnglishModel`) is deprecated.** Use the **multilingual model (`MultilingualModel`)** for all new projects. The multilingual model provides better accuracy across all languages thanks to knowledge distillation from a larger teacher model and expanded training dataset.
|
| 30 |
+
|
| 31 |
+
## How It Works
|
| 32 |
+
|
| 33 |
+
The model operates on transcribed text from a speech-to-text (STT) system, not raw audio.
|
| 34 |
+
|
| 35 |
+
### Process Flow
|
| 36 |
+
|
| 37 |
+
1. **Input**: Recent conversation history (up to **6 turns**, truncated to **128 tokens**) is formatted using the [Qwen chat template](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) with `<|im_start|>` / `<|im_end|>` delimiters. The final user message is left _without_ the closing `<|im_end|>` token.
|
| 38 |
+
|
| 39 |
+
2. **Prediction**: The model predicts the probability of the `<|im_end|>` token appearing next:
|
| 40 |
+
- **High probability** -> user has likely finished their utterance
|
| 41 |
+
- **Low probability** -> user is likely to continue
|
| 42 |
+
|
| 43 |
+
3. **Thresholding**: Per-language thresholds (stored in `languages.json`) convert raw probability into a binary decision, tuned to balance responsiveness and accuracy for each supported language.
|
| 44 |
+
|
| 45 |
+
4. **Integration with VAD**: Works alongside the [Silero VAD](https://docs.livekit.io/agents/logic/turns/vad/) plugin. VAD handles speech presence detection and interruption triggering, while this model provides the semantic signal for when to commit a turn.
|
| 46 |
+
|
| 47 |
+
### Text Preprocessing
|
| 48 |
+
|
| 49 |
+
**Multilingual variant** applies:
|
| 50 |
+
- NFKC unicode normalization
|
| 51 |
+
- Lowercasing
|
| 52 |
+
- Punctuation removal (preserving apostrophes and hyphens)
|
| 53 |
+
- Whitespace collapsing
|
| 54 |
+
|
| 55 |
+
**English-only variant** passes raw transcribed text without normalization.
|
| 56 |
+
|
| 57 |
+
## Architecture and Training
|
| 58 |
+
|
| 59 |
+
### Base Model
|
| 60 |
+
|
| 61 |
+
Both variants are fine-tuned from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), selected for strong performance on this task while enabling low-latency CPU inference.
|
| 62 |
+
|
| 63 |
+
**Model Specifications:**
|
| 64 |
+
- **Parameters**: 0.1B
|
| 65 |
+
- **Tensor Type**: F32
|
| 66 |
+
- **Chat Template**: Qwen format
|
| 67 |
+
|
| 68 |
+
### Knowledge Distillation
|
| 69 |
+
|
| 70 |
+
A **Qwen2.5-7B-Instruct** model was first fine-tuned as a teacher on end-of-turn prediction. Its knowledge was then distilled into the 0.5B student model:
|
| 71 |
+
- Distilled model approaches teacher-level accuracy
|
| 72 |
+
- Maintains efficiency of smaller architecture
|
| 73 |
+
- Converges after approximately 1,500 training steps
|
| 74 |
+
|
| 75 |
+
### Training Data
|
| 76 |
+
|
| 77 |
+
The training dataset is a mix of:
|
| 78 |
+
|
| 79 |
+
- **Real call center transcripts** covering diverse conversational patterns
|
| 80 |
+
- **Synthetic dialogues** emphasizing structured data input:
|
| 81 |
+
- Addresses
|
| 82 |
+
- Email addresses
|
| 83 |
+
- Phone numbers
|
| 84 |
+
- Credit card numbers
|
| 85 |
+
- **Multi-format STT outputs** to handle provider variation (e.g., "forty two" vs. "42"), ensuring consistent predictions across different STT engines without runtime overhead
|
| 86 |
+
|
| 87 |
+
*Note: Structured data enhancements were added only to the English training set, but performance improvements generalized across languages due to the multilingual knowledge encoded in Qwen2.5.*
|
| 88 |
+
|
| 89 |
+
### Quantization
|
| 90 |
+
|
| 91 |
+
The trained model is exported to ONNX format and quantized to INT8 (`model_q8.onnx`), enabling efficient CPU-only inference with ONNX Runtime.
|
| 92 |
+
|
| 93 |
+
## Supported Languages
|
| 94 |
+
|
| 95 |
+
The multilingual model supports **14 languages**. The model relies on the STT provider to report the detected language, which is then used to select the appropriate per-language threshold.
|
| 96 |
+
|
| 97 |
+
**Supported:** English, Spanish, French, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Indonesian, Turkish, Russian, Hindi
|
| 98 |
+
|
| 99 |
+
## Benchmarks
|
| 100 |
+
|
| 101 |
+
### Detection Accuracy (Multilingual Variant)
|
| 102 |
+
|
| 103 |
+
- **True positive** -- the model correctly identifies the user has finished speaking
|
| 104 |
+
- **True negative** -- the model correctly identifies the user will continue speaking
|
| 105 |
+
|
| 106 |
+
| Language | True Positive Rate | True Negative Rate |
|
| 107 |
+
|----------|-------------------|-------------------|
|
| 108 |
+
| Hindi | 99.4% | 96.3% |
|
| 109 |
+
| Korean | 99.3% | 94.5% |
|
| 110 |
+
| French | 99.3% | 88.9% |
|
| 111 |
+
| Indonesian | 99.3% | 89.4% |
|
| 112 |
+
| Japanese | 99.3% | 88.8% |
|
| 113 |
+
| Dutch | 99.3% | 88.1% |
|
| 114 |
+
| Russian | 99.3% | 88.0% |
|
| 115 |
+
| German | 99.3% | 87.8% |
|
| 116 |
+
| Portuguese | 99.4% | 87.4% |
|
| 117 |
+
| Turkish | 99.3% | 87.3% |
|
| 118 |
+
| English | 99.3% | 87.0% |
|
| 119 |
+
| Chinese | 99.3% | 86.6% |
|
| 120 |
+
| Spanish | 99.3% | 86.0% |
|
| 121 |
+
| Italian | 99.3% | 85.1% |
|
| 122 |
+
|
| 123 |
+
### Improvement Over Prior Version
|
| 124 |
+
|
| 125 |
+
The multilingual v0.4.1 release achieved a **39.23% relative improvement** in handling structured inputs (emails, addresses, phone numbers, credit card numbers) compared to the prior version, reducing premature interruptions during data collection scenarios.
|
| 126 |
+
|
| 127 |
+
## Usage
|
| 128 |
+
|
| 129 |
+
The model is designed for use as a turn detection plugin within the [LiveKit Agents](https://github.com/livekit/agents) framework.
|
| 130 |
+
|
| 131 |
+
For complete installation instructions, code examples (Python and Node.js), and configuration options, see the **[LiveKit turn detector plugin documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/)**.
|
| 132 |
+
|
| 133 |
+
For broader context on how turn detection fits into the voice pipeline -- including VAD configuration, interruption handling, and manual turn control -- see the **[Turns overview](https://docs.livekit.io/agents/logic/turns/)**.
|
| 134 |
+
|
| 135 |
+
## Deployment Requirements
|
| 136 |
+
|
| 137 |
+
- **Runtime**: CPU-only (no GPU required). Uses [ONNX Runtime](https://onnxruntime.ai/) with the `CPUExecutionProvider`.
|
| 138 |
+
- **RAM**: <500 MB for the multilingual model
|
| 139 |
+
- **Instance type**: Use compute-optimized instances (e.g., AWS c6i, c7i). Avoid burstable instances (e.g., AWS t3, t4g) to prevent inference timeouts from CPU credit exhaustion.
|
| 140 |
+
- **LiveKit Cloud**: The model is deployed globally on LiveKit Cloud. Agents running there automatically use the optimized remote inference service with no local resource requirements.
|
| 141 |
+
|
| 142 |
+
## Limitations
|
| 143 |
+
|
| 144 |
+
- **Text-only input**: The model operates on STT-transcribed text and cannot incorporate prosodic cues such as pauses, intonation, or emphasis. Future versions may integrate multimodal audio features.
|
| 145 |
+
- **STT dependency**: Prediction quality depends on the accuracy and output format of the upstream STT provider. Mismatches between training and deployment STT formats may degrade performance.
|
| 146 |
+
- **Context window**: Limited to 128 tokens across a maximum of 6 conversation turns.
|
| 147 |
+
- **Language coverage**: Currently supports 14 languages. Performance on unsupported languages is undefined.
|
| 148 |
+
- **Realtime model compatibility**: Cannot be used with audio-native realtime models (e.g., OpenAI Realtime API) without adding a separate STT service, which incurs additional cost and latency.
|
| 149 |
+
|
| 150 |
+
## License
|
| 151 |
+
|
| 152 |
+
This model is released under the [LiveKit Model License](https://huggingface.co/livekit/turn-detector/blob/main/LICENSE).
|
| 153 |
+
|
| 154 |
+
## Resources
|
| 155 |
+
|
| 156 |
+
- **[Documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/)**: Full plugin documentation, installation, and integration guide
|
| 157 |
+
- **[Turns Overview](https://docs.livekit.io/agents/logic/turns/)**: How turn detection fits into the LiveKit Agents voice pipeline
|
| 158 |
+
- **[Blog: Improved End-of-Turn Model](https://blog.livekit.io/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/)**: Technical deep dive on multilingual distillation approach and benchmarks
|
| 159 |
+
- **[Blog: Using a Transformer for Turn Detection](https://blog.livekit.io/using-a-transformer-to-improve-end-of-turn-detection/)**: Original blog post introducing the concept and architecture
|
| 160 |
+
- **[Video: LiveKit Turn Detector](https://youtu.be/OZG0oZKctgw)**: Overview video demonstrating the plugin
|
| 161 |
+
- **[GitHub: Plugin Source](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-turn-detector)**: Source code for the `livekit-plugins-turn-detector` package
|
| 162 |
+
- **[PyPI](https://pypi.org/project/livekit-plugins-turn-detector/)** | **[npm](https://www.npmjs.com/package/@livekit/agents-plugin-livekit)**: Package registries
|
| 163 |
+
|
| 164 |
+
---
|
| 165 |
+
|
| 166 |
+
**Model Stats:**
|
| 167 |
+
- Downloads last month: 240,507
|
| 168 |
+
- Model size: 0.1B params
|
| 169 |
+
- Format: Safetensors (INT8 quantized ONNX)
|
| 170 |
+
- Base Model: Qwen/Qwen2.5-0.5B-Instruct
|
03-finetune-pipecat-pt/references/guides/pipecat_data_generation_guide.md
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Pipecat Smart Turn - Data Generation Contribution Guide"
|
| 3 |
+
source: https://raw.githubusercontent.com/pipecat-ai/smart-turn/main/docs/data_generation_contribution_guide.md
|
| 4 |
+
date: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Contributing Training Data to Smart Turn
|
| 8 |
+
|
| 9 |
+
## Audio Format Requirements
|
| 10 |
+
|
| 11 |
+
**FLAC is the preferred format** for training data, with lossy formats like MP3 or Opus discouraged. Audio should be **mono audio with a bit depth of 16 bits** and works optimally at 16kHz sample rates, though higher rates are acceptable.
|
| 12 |
+
|
| 13 |
+
## File Organization
|
| 14 |
+
|
| 15 |
+
Contributors have flexibility in naming and directory structure. The recommended approach uses unique identifiers for files combined with directory labels by language and completeness status (e.g., `eng/incomplete/b3799254-8d6c-11f0-a90e-e7e92780240b.flac`).
|
| 16 |
+
|
| 17 |
+
## Audio Length and Variation
|
| 18 |
+
|
| 19 |
+
Each file must contain **one speech sample, no longer than 16 seconds.** Variety in length is encouraged, ranging from single words to complex sentences.
|
| 20 |
+
|
| 21 |
+
## Content Guidelines
|
| 22 |
+
|
| 23 |
+
Samples should represent **a single turn in the conversation** with only one speaker per file. Speech should resemble interactions with voice assistants or customer service representatives. The documentation emphasizes avoiding sentence repetition and background noise while excluding real personally identifiable information.
|
| 24 |
+
|
| 25 |
+
## Complete vs. Incomplete Classification
|
| 26 |
+
|
| 27 |
+
Samples require binary labeling. "Complete" samples represent finished thoughts suitable for immediate response, while "Incomplete" samples suggest the speaker will continue, ending with filler words, connectives, or suggestive prosody. Critically, incomplete samples **must not be cut off in the middle of a word** but rather end with full words and approximately 200ms of silence.
|
| 28 |
+
|
| 29 |
+
A 50:50 split between categories is recommended for unbiased training.
|
| 30 |
+
|
| 31 |
+
## Licensing and Submission
|
| 32 |
+
|
| 33 |
+
Contributors must own recordings and secure speaker consent for public release. Datasets are published via HuggingFace. Submissions can occur through various cloud storage methods, with GitHub issues as the contact point.
|
03-finetune-pipecat-pt/references/guides/pipecat_dataset_v3_2.md
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Pipecat Smart Turn Data v3.2 - Training Dataset Card"
|
| 3 |
+
source: https://huggingface.co/datasets/pipecat-ai/smart-turn-data-v3.2-train
|
| 4 |
+
date: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Smart Turn Data v3.2 Training Dataset
|
| 8 |
+
|
| 9 |
+
The official training dataset for **Smart Turn v3.2**, hosted on Hugging Face Datasets.
|
| 10 |
+
|
| 11 |
+
## Key Specifications
|
| 12 |
+
|
| 13 |
+
| Attribute | Value |
|
| 14 |
+
|-----------|-------|
|
| 15 |
+
| **Organization** | Pipecat |
|
| 16 |
+
| **Dataset Size** | 100K - 1M rows |
|
| 17 |
+
| **Total Rows** | 270,946 |
|
| 18 |
+
| **Download Size** | 41.4 GB |
|
| 19 |
+
| **Parquet Size** | 41.4 GB |
|
| 20 |
+
| **Format** | Parquet |
|
| 21 |
+
| **Split** | train (271k rows) |
|
| 22 |
+
| **Subset** | default (271k rows) |
|
| 23 |
+
|
| 24 |
+
## Data Modalities
|
| 25 |
+
|
| 26 |
+
- **Audio** - Audio samples with duration ranging from 0.36s to 32.6s
|
| 27 |
+
- **Text** - Language and metadata fields
|
| 28 |
+
|
| 29 |
+
## Supported Libraries
|
| 30 |
+
|
| 31 |
+
- Datasets
|
| 32 |
+
- Dask
|
| 33 |
+
- Polars
|
| 34 |
+
|
| 35 |
+
## Dataset Schema
|
| 36 |
+
|
| 37 |
+
### Fields
|
| 38 |
+
|
| 39 |
+
| Field | Type | Description |
|
| 40 |
+
|-------|------|-------------|
|
| 41 |
+
| `audio` | AudioObject | Audio samples |
|
| 42 |
+
| `id` | Text | UUID identifier (36 chars) |
|
| 43 |
+
| `language` | Text | 23 language classes |
|
| 44 |
+
| `endpoint_bool` | Boolean | End-of-turn indicator |
|
| 45 |
+
| `midfiller` | Boolean | Mid-utterance filler detection |
|
| 46 |
+
| `endfiller` | Boolean | End-of-utterance filler detection |
|
| 47 |
+
| `synthetic` | Boolean | Synthetic data flag |
|
| 48 |
+
| `dataset` | Text | Source dataset identifier (12 classes) |
|
| 49 |
+
| `spoken_text` | Null | Skipped column |
|
| 50 |
+
|
| 51 |
+
### Language Coverage
|
| 52 |
+
|
| 53 |
+
English, Chinese (zho), Vietnamese, Finnish, Spanish, Bengali, Hindi, Japanese, Portuguese, Russian, Korean, German, French, Dutch, Arabic, Marathi, Turkish, Icelandic, Polish, Ukrainian, Italian, Indonesian, Norwegian
|
| 54 |
+
|
| 55 |
+
### Dataset Sources
|
| 56 |
+
|
| 57 |
+
- chirp3_1, chirp3_2
|
| 58 |
+
- liva_1
|
| 59 |
+
- midcentury_1
|
| 60 |
+
- mundo_1
|
| 61 |
+
- rime_2
|
| 62 |
+
- orpheus_endfiller_1
|
| 63 |
+
|
| 64 |
+
## Metadata (Croissant Format)
|
| 65 |
+
|
| 66 |
+
The dataset uses **Croissant ML Commons** metadata schema (v1.1):
|
| 67 |
+
|
| 68 |
+
```json
|
| 69 |
+
{
|
| 70 |
+
"@context": "https://schema.org/",
|
| 71 |
+
"conformsTo": "http://mlcommons.org/croissant/1.1",
|
| 72 |
+
"@type": "sc:Dataset",
|
| 73 |
+
"name": "smart-turn-data-v3.2-train"
|
| 74 |
+
}
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
## Contributors
|
| 78 |
+
|
| 79 |
+
### Direct Contributors
|
| 80 |
+
- The Pipecat team
|
| 81 |
+
- [Liva AI](https://www.theliva.ai/)
|
| 82 |
+
- [Midcentury](https://www.midcentury.xyz/)
|
| 83 |
+
- [MundoAI](https://mundoai.world/)
|
| 84 |
+
|
| 85 |
+
### Background Noise Attribution
|
| 86 |
+
CC-0 licensed background noise samples sourced from Freesound.org contributors including:
|
| 87 |
+
- 4team, tomhannen, craigsmith, mrmayo, martats, and 26+ additional contributors
|
| 88 |
+
|
| 89 |
+
## Access Methods
|
| 90 |
+
|
| 91 |
+
- **Dataset Viewer:** https://huggingface.co/datasets/pipecat-ai/smart-turn-data-v3.2-train/viewer/
|
| 92 |
+
- **Data Studio:** https://huggingface.co/datasets/pipecat-ai/smart-turn-data-v3.2-train/viewer/default/train
|
| 93 |
+
- **Files Browser:** Git repository with parquet conversion available
|
03-finetune-pipecat-pt/references/guides/pipecat_smart_turn_readme.md
ADDED
|
@@ -0,0 +1,115 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Pipecat Smart Turn v3.2 - README"
|
| 3 |
+
source: https://raw.githubusercontent.com/pipecat-ai/smart-turn/main/README.md
|
| 4 |
+
date: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Smart Turn v3.2
|
| 8 |
+
|
| 9 |
+
An open-source, community-driven native audio turn detection model designed to improve upon traditional voice activity detection (VAD) approaches. The model determines when a voice agent should respond to human speech by analyzing linguistic and acoustic cues rather than simply detecting speech presence.
|
| 10 |
+
|
| 11 |
+
## Key Features
|
| 12 |
+
|
| 13 |
+
**Language Support:** The system supports 23 languages including Arabic, Bengali, Chinese, Danish, Dutch, German, English, Finnish, French, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, and Vietnamese.
|
| 14 |
+
|
| 15 |
+
**Performance Characteristics:**
|
| 16 |
+
- Inference completes in as little as 10ms on certain CPUs
|
| 17 |
+
- Most cloud instances see sub-100ms execution times
|
| 18 |
+
- Integrates efficiently with lightweight VAD models like Silero
|
| 19 |
+
- Two versions available: 8MB quantized CPU variant and 32MB unquantized GPU variant
|
| 20 |
+
- GPU version uses fp32 weights for marginally faster inference and ~1% accuracy improvement
|
| 21 |
+
- CPU version uses int8 quantization for reduced size and speed with minimal accuracy trade-off
|
| 22 |
+
|
| 23 |
+
**Technical Approach:** The model operates directly on PCM audio samples rather than text transcriptions, capturing prosodic nuances that inform turn-taking decisions.
|
| 24 |
+
|
| 25 |
+
## Setup Instructions
|
| 26 |
+
|
| 27 |
+
**Environment Configuration:**
|
| 28 |
+
```bash
|
| 29 |
+
python3.12 -m venv venv
|
| 30 |
+
source venv/bin/activate
|
| 31 |
+
pip install -r requirements.txt
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
**Platform-Specific Dependencies:**
|
| 35 |
+
|
| 36 |
+
Ubuntu/Debian systems require:
|
| 37 |
+
```bash
|
| 38 |
+
sudo apt-get update
|
| 39 |
+
sudo apt-get install portaudio19-dev python3-dev
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
macOS with Homebrew:
|
| 43 |
+
```bash
|
| 44 |
+
brew install portaudio
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
**Running the Demonstration:**
|
| 48 |
+
```bash
|
| 49 |
+
python record_and_predict.py
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
Initial startup requires approximately 30 seconds. Test phrases include "I can't seem to, um ..." and "I can't seem to, um, find the return label."
|
| 53 |
+
|
| 54 |
+
## Audio Input Specifications
|
| 55 |
+
|
| 56 |
+
The model accepts 16kHz mono PCM audio with maximum duration of 8 seconds. The recommended approach involves providing the complete audio of the current user turn. When audio exceeds 8 seconds, truncate from the beginning while maintaining context.
|
| 57 |
+
|
| 58 |
+
For shorter audio, prepend zero-value padding to reach the required length, ensuring actual speech content occupies the end of the input vector.
|
| 59 |
+
|
| 60 |
+
## Model Architecture
|
| 61 |
+
|
| 62 |
+
Smart Turn v3 employs Whisper Tiny as its foundation with an added linear classification layer. The transformer-based architecture contains approximately 8 million parameters. Development involved experimentation with wav2vec2-BERT, wav2vec2, LSTM implementations, and additional transformer classifier configurations.
|
| 63 |
+
|
| 64 |
+
## Integration Methods
|
| 65 |
+
|
| 66 |
+
**Pipecat Integration:** The framework supports local inference via `LocalSmartTurnAnalyzerV3` (version 0.0.85 and later). On Pipecat Cloud's standard 1x instance, inference typically completes in around 65ms.
|
| 67 |
+
|
| 68 |
+
**Direct Integration:** Import `model.py` and `inference.py` from the Smart Turn repository and invoke the `predict_endpoint()` function. Reference implementation available in `predict.py`.
|
| 69 |
+
|
| 70 |
+
## Training Infrastructure
|
| 71 |
+
|
| 72 |
+
Training code resides in `train.py` and downloads datasets from the pipecat-ai HuggingFace repository. Training can execute locally or via Modal using `train_modal.py`. Training sessions log to Weights & Biases unless disabled.
|
| 73 |
+
|
| 74 |
+
**Training command for Modal:**
|
| 75 |
+
```bash
|
| 76 |
+
modal run --detach train_modal.py
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
**Current Datasets:**
|
| 80 |
+
- pipecat-ai/smart-turn-data-v3.2-train
|
| 81 |
+
- pipecat-ai/smart-turn-data-v3.2-test
|
| 82 |
+
|
| 83 |
+
## Community Contributions
|
| 84 |
+
|
| 85 |
+
**Data Classification:** Manual training data categorization assistance needed at https://smart-turn-dataset.pipecat.ai/
|
| 86 |
+
|
| 87 |
+
**Human Data Contribution:** Participants can contribute through turn training games at https://turn-training.pipecat.ai/ or by following the data generation contribution guide.
|
| 88 |
+
|
| 89 |
+
**Future Development Areas:**
|
| 90 |
+
- Additional language support expansion
|
| 91 |
+
- Performance optimization and architecture refinement
|
| 92 |
+
- Expanded human dataset collection
|
| 93 |
+
- Text-conditioned inference for specialized input modes
|
| 94 |
+
- Training platform diversification
|
| 95 |
+
|
| 96 |
+
## Licensing and Attribution
|
| 97 |
+
|
| 98 |
+
Smart Turn operates under the BSD 2-clause license, permitting unrestricted usage, modification, and contribution.
|
| 99 |
+
|
| 100 |
+
**Project Contributors:**
|
| 101 |
+
- Marcus (marcus-daily)
|
| 102 |
+
- Eli (ebb351)
|
| 103 |
+
- Mark (markbackman)
|
| 104 |
+
- Kwindla (kwindla)
|
| 105 |
+
|
| 106 |
+
**Data Contributors:**
|
| 107 |
+
- Liva AI
|
| 108 |
+
- Midcentury
|
| 109 |
+
- MundoAI
|
| 110 |
+
|
| 111 |
+
## Resources
|
| 112 |
+
|
| 113 |
+
- [HuggingFace Model Repository](https://huggingface.co/pipecat-ai/smart-turn-v3)
|
| 114 |
+
- [Pipecat Documentation](https://docs.pipecat.ai/server/utilities/smart-turn/smart-turn-overview)
|
| 115 |
+
- [Pipecat Framework](https://pipecat.ai)
|
03-finetune-pipecat-pt/references/guides/pipecat_train_py.md
ADDED
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Pipecat Smart Turn - Training Code (train.py)"
|
| 3 |
+
source: https://raw.githubusercontent.com/pipecat-ai/smart-turn/main/train.py
|
| 4 |
+
date: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# SmartTurn V3 Speech Endpointing Model — Training Code Reference
|
| 8 |
+
|
| 9 |
+
> This document describes the training pipeline implemented in `train.py` from the
|
| 10 |
+
> [pipecat-ai/smart-turn](https://github.com/pipecat-ai/smart-turn) repository.
|
| 11 |
+
> The original file is Python source code; key architecture and configuration details
|
| 12 |
+
> are extracted below.
|
| 13 |
+
|
| 14 |
+
## Core Architecture
|
| 15 |
+
|
| 16 |
+
### SmartTurnV3Model
|
| 17 |
+
|
| 18 |
+
The model extends `WhisperPreTrainedModel` and uses Whisper's encoder as its backbone:
|
| 19 |
+
|
| 20 |
+
- **Input**: Log-mel spectrogram features with shape `(batch_size, 80, 800)`
|
| 21 |
+
- **Encoder**: Modified Whisper encoder with `max_source_positions=400`
|
| 22 |
+
- **Attention Pooling**: Neural network layer that learns weighted attention across time steps
|
| 23 |
+
- **Binary Classifier**: Multi-layer sequential network outputting probability scores
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
Input Features (batch, 80, 800)
|
| 27 |
+
|
|
| 28 |
+
Whisper Encoder
|
| 29 |
+
|
|
| 30 |
+
Attention Pooling [Linear -> Tanh -> Linear -> Softmax]
|
| 31 |
+
|
|
| 32 |
+
Weighted Pooling (reduces to batch_size, hidden_size)
|
| 33 |
+
|
|
| 34 |
+
Classifier [Linear -> LayerNorm -> GELU -> Dropout -> Linear -> GELU -> Linear]
|
| 35 |
+
|
|
| 36 |
+
Sigmoid Output (probability of completion)
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
### Loss Function
|
| 40 |
+
|
| 41 |
+
Uses `BCEWithLogitsLoss` with dynamic positive sample weighting, clamped between 0.1 and 10.0 to handle class imbalance.
|
| 42 |
+
|
| 43 |
+
## Training Configuration
|
| 44 |
+
|
| 45 |
+
```
|
| 46 |
+
Base Model: openai/whisper-tiny
|
| 47 |
+
Learning Rate: 5e-5
|
| 48 |
+
Epochs: 4
|
| 49 |
+
Train Batch Size: 384
|
| 50 |
+
Eval Batch Size: 128
|
| 51 |
+
Warmup Ratio: 0.2
|
| 52 |
+
Weight Decay: 0.01
|
| 53 |
+
Eval/Save Steps: 500
|
| 54 |
+
Logging Steps: 100
|
| 55 |
+
LR Scheduler: Cosine
|
| 56 |
+
Dataloader Workers: 6
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## Datasets
|
| 60 |
+
|
| 61 |
+
**Training Sources**:
|
| 62 |
+
- `pipecat-ai/smart-turn-data-v3.2-train` (split 90/10 for train/eval)
|
| 63 |
+
|
| 64 |
+
**Test Sources**:
|
| 65 |
+
- `pipecat-ai/smart-turn-data-v3.2-test`
|
| 66 |
+
|
| 67 |
+
The system supports stratified analysis by language, midfiller presence, and endfiller presence.
|
| 68 |
+
|
| 69 |
+
## Data Pipeline
|
| 70 |
+
|
| 71 |
+
### OnDemandSmartTurnDataset
|
| 72 |
+
|
| 73 |
+
On-demand feature extraction pipeline:
|
| 74 |
+
- Audio truncation to last 8 seconds
|
| 75 |
+
- 16kHz sampling rate
|
| 76 |
+
- Padding to max length: `8 * 16000 = 128,000 samples`
|
| 77 |
+
- Normalization enabled
|
| 78 |
+
|
| 79 |
+
### SmartTurnDataCollator
|
| 80 |
+
|
| 81 |
+
Batches samples while preserving metadata (language, midfiller, endfiller flags) for downstream analysis.
|
| 82 |
+
|
| 83 |
+
## Export and Quantization
|
| 84 |
+
|
| 85 |
+
### ONNX FP32 Export
|
| 86 |
+
|
| 87 |
+
The wrapper ensures consistent output shapes across variable batch sizes:
|
| 88 |
+
- Dynamic batch dimension
|
| 89 |
+
- Fixed output shape: `(batch_size, 1)` for compatibility
|
| 90 |
+
- ONNX opset version 18
|
| 91 |
+
- Validation with batch sizes 1 and 2
|
| 92 |
+
|
| 93 |
+
### INT8 Static Quantization
|
| 94 |
+
|
| 95 |
+
**Configuration**:
|
| 96 |
+
- Calibration dataset size: 1024 samples
|
| 97 |
+
- Quantization format: QDQ (Quantize-Dequantize)
|
| 98 |
+
- Activation type: QUInt8
|
| 99 |
+
- Weight type: QInt8
|
| 100 |
+
- Per-channel quantization enabled
|
| 101 |
+
- Calibration method: Entropy
|
| 102 |
+
- Quantized operations: Conv, MatMul, Gemm
|
| 103 |
+
|
| 104 |
+
Process:
|
| 105 |
+
1. `quant_pre_process` for optimization and shape inference
|
| 106 |
+
2. Entropy-based calibration on training data subset
|
| 107 |
+
3. Static quantization with calibration reader
|
| 108 |
+
|
| 109 |
+
## Evaluation Metrics
|
| 110 |
+
|
| 111 |
+
Per-sample metrics computed during training and evaluation:
|
| 112 |
+
|
| 113 |
+
- Accuracy
|
| 114 |
+
- Precision (with zero_division warning handling)
|
| 115 |
+
- Recall
|
| 116 |
+
- F1 Score
|
| 117 |
+
- Confusion matrix components (TP, FP, TN, FN)
|
| 118 |
+
- Predicted positive/negative counts
|
| 119 |
+
|
| 120 |
+
### External Evaluation Callback
|
| 121 |
+
|
| 122 |
+
During training checkpoints, the system evaluates on all test splits and logs:
|
| 123 |
+
- Dataset-specific accuracies
|
| 124 |
+
- Language-stratified metrics
|
| 125 |
+
- Midfiller-stratified metrics
|
| 126 |
+
- Probability distributions (Weights & Biases histograms)
|
| 127 |
+
- Min/max/mean accuracy across categories
|
| 128 |
+
- Sample count distributions
|
| 129 |
+
|
| 130 |
+
## Key Utility Functions
|
| 131 |
+
|
| 132 |
+
- **`truncate_audio_to_last_n_seconds`**: Crops audio to preserve final n seconds (prevents padding inflation)
|
| 133 |
+
- **`process_predictions`**: Converts logits to probabilities and binary predictions using 0.5 threshold, with NaN validation
|
| 134 |
+
- **`compute_metrics`**: Scikit-learn based metric computation with detailed confusion matrix breakdown
|
| 135 |
+
|
| 136 |
+
## Workflows
|
| 137 |
+
|
| 138 |
+
### Training Run (`do_training_run`)
|
| 139 |
+
|
| 140 |
+
1. Initialize Weights & Biases project "speech-endpointing"
|
| 141 |
+
2. Load pretrained Whisper-tiny with custom head
|
| 142 |
+
3. Prepare datasets (load, split, wrap)
|
| 143 |
+
4. Train with specified hyperparameters
|
| 144 |
+
5. Save final model and feature extractor
|
| 145 |
+
6. Export to ONNX FP32
|
| 146 |
+
7. Return export path
|
| 147 |
+
|
| 148 |
+
### Quantization Run (`do_quantization_run`)
|
| 149 |
+
|
| 150 |
+
1. Load FP32 ONNX model
|
| 151 |
+
2. Create calibration dataset from training split
|
| 152 |
+
3. Apply pre-processing (optimization)
|
| 153 |
+
4. Execute static INT8 quantization
|
| 154 |
+
5. Return quantized model path
|
| 155 |
+
|
| 156 |
+
### Benchmark Run (`do_benchmark_run`)
|
| 157 |
+
|
| 158 |
+
1. Load feature extractor
|
| 159 |
+
2. Prepare test merged dataset
|
| 160 |
+
3. For each model path:
|
| 161 |
+
- Create benchmark output directory
|
| 162 |
+
- Run benchmark with batch size 256
|
| 163 |
+
- Generate markdown report
|
| 164 |
+
|
| 165 |
+
## Dependencies
|
| 166 |
+
|
| 167 |
+
- PyTorch with ONNX export capabilities
|
| 168 |
+
- Transformers (HuggingFace) for Whisper and training
|
| 169 |
+
- ONNX Runtime with quantization support
|
| 170 |
+
- Scikit-learn for metrics
|
| 171 |
+
- Weights & Biases for experiment tracking
|
| 172 |
+
- torchcodec (runtime requirement for audio decoding)
|
| 173 |
+
|
| 174 |
+
## Error Handling
|
| 175 |
+
|
| 176 |
+
- Fast-fail check for missing `torchcodec` dependency
|
| 177 |
+
- ONNX model validation after export
|
| 178 |
+
- Non-finite value detection in predictions
|
| 179 |
+
- Exception logging for failed exports
|
03-finetune-pipecat-pt/references/guides/vogent_turn_80m.md
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Vogent-Turn-80M Model Card"
|
| 3 |
+
source: https://huggingface.co/vogent/Vogent-Turn-80M
|
| 4 |
+
date: 2026-03-16
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Vogent-Turn-80M
|
| 8 |
+
|
| 9 |
+
A state-of-the-art multimodal turn detection model for voice AI systems, achieving **94.1% accuracy** by combining acoustic and linguistic signals for real-time conversational applications.
|
| 10 |
+
|
| 11 |
+
## Overview
|
| 12 |
+
|
| 13 |
+
| Attribute | Value |
|
| 14 |
+
|-----------|-------|
|
| 15 |
+
| **Developed by** | Vogent AI |
|
| 16 |
+
| **Model type** | Multimodal Turn Detection (Binary Classification) |
|
| 17 |
+
| **Language(s)** | English |
|
| 18 |
+
| **License** | Modified Apache-2.0 (horizontal voice agent platforms cannot set as default) |
|
| 19 |
+
| **Base model** | SmolLM2-135M (reduced to 80M parameters using first 12 layers) |
|
| 20 |
+
| **Model size** | 79.2M parameters (F32) |
|
| 21 |
+
| **Inference speed** | ~7ms on T4 GPU |
|
| 22 |
+
|
| 23 |
+
## Model Description
|
| 24 |
+
|
| 25 |
+
Vogent-Turn-80M determines when a speaker has finished their turn in a conversation by processing both:
|
| 26 |
+
- **Acoustic features** (via Whisper encoder)
|
| 27 |
+
- **Semantic context** (via language model)
|
| 28 |
+
|
| 29 |
+
This multimodal approach enables real-time, accurate predictions where audio-only or text-only methods fail.
|
| 30 |
+
|
| 31 |
+
### Resources
|
| 32 |
+
|
| 33 |
+
- **GitHub Repository:** https://github.com/vogent/vogent-turn
|
| 34 |
+
- **Technical Report:** https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents
|
| 35 |
+
- **Demo Space:** https://huggingface.co/spaces/vogent/vogent-turn-demo
|
| 36 |
+
|
| 37 |
+
## Quick Install
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
+
git clone https://github.com/vogent/vogent-turn.git
|
| 41 |
+
cd vogent-turn
|
| 42 |
+
pip install -e .
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
## Basic Usage
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
from vogent_turn import TurnDetector
|
| 49 |
+
import soundfile as sf
|
| 50 |
+
import urllib.request
|
| 51 |
+
|
| 52 |
+
# Initialize detector
|
| 53 |
+
detector = TurnDetector(compile_model=True, warmup=True)
|
| 54 |
+
|
| 55 |
+
# Download and load audio
|
| 56 |
+
audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
|
| 57 |
+
urllib.request.urlretrieve(audio_url, "sample.wav")
|
| 58 |
+
audio, sr = sf.read("sample.wav")
|
| 59 |
+
|
| 60 |
+
# Run turn detection with conversational context
|
| 61 |
+
result = detector.predict(
|
| 62 |
+
audio,
|
| 63 |
+
prev_line="What is your phone number",
|
| 64 |
+
curr_line="My number is 804",
|
| 65 |
+
sample_rate=sr,
|
| 66 |
+
return_probs=True,
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
print(f"Turn complete: {result['is_endpoint']}")
|
| 70 |
+
print(f"Done speaking probability: {result['prob_endpoint']:.1%}")
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Available Interfaces
|
| 74 |
+
|
| 75 |
+
- **Python Library:** Direct integration with `TurnDetector` class
|
| 76 |
+
- **CLI Tool:** `vogent-turn-predict speech.wav --prev "What is your phone number" --curr "My number is 804"`
|
| 77 |
+
|
| 78 |
+
## Technical Architecture
|
| 79 |
+
|
| 80 |
+
### Model Architecture
|
| 81 |
+
|
| 82 |
+
**Components:**
|
| 83 |
+
1. **Audio Encoder:** Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
|
| 84 |
+
2. **Text Model:** SmolLM-135M (12 layers, ~80M parameters)
|
| 85 |
+
3. **Multimodal Fusion:** Audio embeddings projected into LLM's input space
|
| 86 |
+
4. **Classifier:** Binary classification head (turn complete/incomplete)
|
| 87 |
+
|
| 88 |
+
### Processing Flow
|
| 89 |
+
|
| 90 |
+
```
|
| 91 |
+
1. Audio (16kHz PCM) -> Whisper Encoder -> Audio Embeddings (~400 tokens)
|
| 92 |
+
2. Text Context -> SmolLM Tokenizer -> Text Embeddings
|
| 93 |
+
3. Concatenate embeddings -> SmolLM Transformer -> Last token hidden state
|
| 94 |
+
4. Linear Classifier -> Softmax -> [P(continue), P(endpoint)]
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
## Training Details
|
| 98 |
+
|
| 99 |
+
### Preprocessing
|
| 100 |
+
|
| 101 |
+
- **Audio:** Last 8 seconds extracted via Whisper-Tiny encoder -> ~400 audio tokens
|
| 102 |
+
- **Text:** Full conversational context (assistant and user utterances)
|
| 103 |
+
- **Labels:** Binary classification (turn complete/incomplete)
|
| 104 |
+
- **Fusion:** Audio embeddings projected into LLM's input space and concatenated with text
|
| 105 |
+
|
| 106 |
+
### Training Hyperparameters
|
| 107 |
+
|
| 108 |
+
- **Training regime:** fp16 mixed precision
|
| 109 |
+
- **Base model:** SmolLM2-135M (first 12 layers)
|
| 110 |
+
- **Architecture:** Reduced from 135M to ~80M parameters via layer ablation
|
| 111 |
+
|
| 112 |
+
### Training Data
|
| 113 |
+
|
| 114 |
+
Diverse dataset combining:
|
| 115 |
+
- Human-collected conversational data
|
| 116 |
+
- Synthetic conversational data
|
| 117 |
+
|
| 118 |
+
## Evaluation Results
|
| 119 |
+
|
| 120 |
+
### Performance Metrics
|
| 121 |
+
|
| 122 |
+
- **Accuracy:** 94.1%
|
| 123 |
+
- **AUPRC:** 0.975
|
| 124 |
+
|
| 125 |
+
### Testing Data
|
| 126 |
+
|
| 127 |
+
Internal test set covering diverse conversational scenarios and edge cases where audio-only or text-only approaches fail.
|
| 128 |
+
|
| 129 |
+
## Compute Infrastructure
|
| 130 |
+
|
| 131 |
+
### Hardware Optimization
|
| 132 |
+
|
| 133 |
+
- `torch.compile` with max-autotune mode
|
| 134 |
+
- Dynamic tensor shapes without recompilation
|
| 135 |
+
- Pre-warmed bucket sizes (64, 128, 256, 512, 1024)
|
| 136 |
+
|
| 137 |
+
### Software
|
| 138 |
+
|
| 139 |
+
- **Framework:** PyTorch with torch.compile
|
| 140 |
+
- **Audio processing:** Whisper encoder (up to 8 seconds)
|
| 141 |
+
|
| 142 |
+
## Limitations
|
| 143 |
+
|
| 144 |
+
- **English-only** support; turn-taking conventions vary across languages and cultures
|
| 145 |
+
- **CPU inference** may be too slow for some real-time applications
|
| 146 |
+
|
| 147 |
+
## Citation
|
| 148 |
+
|
| 149 |
+
```bibtex
|
| 150 |
+
@misc{voturn2025,
|
| 151 |
+
title={Vogent-Turn-80M: State-of-the-Art Turn Detection for Voice Agents},
|
| 152 |
+
author={Varadarajan, Vignesh and Vytheeswaran, Jagath},
|
| 153 |
+
year={2025},
|
| 154 |
+
publisher={Vogent AI},
|
| 155 |
+
howpublished={\url{https://huggingface.co/vogent/Vogent-Turn-80M}},
|
| 156 |
+
note={Blog: \url{https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents}}
|
| 157 |
+
}
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
## Additional Resources
|
| 161 |
+
|
| 162 |
+
- **Full Documentation & Code:** https://github.com/vogent/vogent-turn
|
| 163 |
+
- **Platform:** https://vogent.ai
|
| 164 |
+
- **Enterprise Contact:** j@vogent.ai
|
| 165 |
+
- **Issues:** https://github.com/vogent/vogent-turn/issues
|
| 166 |
+
|
| 167 |
+
## Upcoming Releases
|
| 168 |
+
|
| 169 |
+
- Int8 quantized model for faster CPU deployment
|
| 170 |
+
- Multilingual versions
|
| 171 |
+
- Domain-specific adaptations
|
03-finetune-pipecat-pt/references/hesitation_turn_taking_l2_review.md
ADDED
|
@@ -0,0 +1,453 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Pausas de Hesitacao, Deteccao de Fim de Turno e Fala L2: Uma Revisao para Fine-Tuning em Portugues
|
| 2 |
+
|
| 3 |
+
**Contexto**: BabelCast — traducao simultanea em reunioes. O modelo de deteccao de fim de turno precisa distinguir pausas de hesitacao (falante ainda vai continuar) de fim de turno real (pode comecar a traduzir). O desafio e especialmente critico para falantes de frances aprendendo portugues, que produzem pausas longas de hesitacao frequentemente confundidas com fim de turno.
|
| 4 |
+
|
| 5 |
+
**Autores**: Marcos Remar, com assistencia de Claude (Anthropic)
|
| 6 |
+
**Data**: Março 2026
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## 1. O Problema: Silencio Nao Significa Fim de Turno
|
| 11 |
+
|
| 12 |
+
Sistemas comerciais de IA conversacional usam thresholds de silencio entre 700ms e 1000ms para detectar fim de turno (Castillo-Lopez, de Chalendar & Semmar, 2025). Esse approach e fundamentalmente inadequado por dois motivos:
|
| 13 |
+
|
| 14 |
+
1. **Humanos sao mais rapidos**: o gap medio entre turnos em conversacao natural e de apenas ~200ms (Levinson & Torreira, 2015, citado em Skantze, 2021). Esperar 700ms+ resulta em respostas percebidas como lentas.
|
| 15 |
+
|
| 16 |
+
2. **Pausas dentro de turnos sao frequentemente mais longas que gaps entre turnos** (Skantze, 2021). Um falante pode pausar 2-3 segundos no meio de uma frase (buscando uma palavra ou planejando o proximo trecho) e continuar normalmente. Tratar essa pausa como fim de turno causa interrupcao.
|
| 17 |
+
|
| 18 |
+
O problema se agrava dramaticamente quando o falante esta usando uma segunda lingua (L2), como demonstrado na literatura a seguir.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## 2. Pausas de Hesitacao em Falantes L2
|
| 23 |
+
|
| 24 |
+
### 2.1 Duracao: L2 pausa 39% mais que nativos
|
| 25 |
+
|
| 26 |
+
Kosmala & Crible (2022) analisaram pausas preenchidas (filled pauses, FPs) em falantes nativos e nao-nativos de frances, usando o corpus SITAF:
|
| 27 |
+
|
| 28 |
+
| Metrica | Nativos | Nao-nativos | Diferenca |
|
| 29 |
+
|---------|---------|-------------|-----------|
|
| 30 |
+
| Duracao media FP | 378ms (SD=200) | **524ms** (SD=222) | +146ms (+39%) |
|
| 31 |
+
| Taxa FP/100 palavras | 4.4 | 5.3 | +0.9 (n.s.) |
|
| 32 |
+
| Clustering com outros fluencemas | 72% | **82%** | +10 p.p. |
|
| 33 |
+
|
| 34 |
+
A diferenca de **frequencia** nao e significativa — nao-nativos nao produzem mais pausas, mas pausas **significativamente mais longas** (p < .001). A duracao e um indicador mais confiavel de proficiencia que a frequencia.
|
| 35 |
+
|
| 36 |
+
Em situacoes de reparo (self-repair), as pausas sao ainda mais extremas. Kosmala (2025) analisou 167 reparos de alunos franceses falando ingles e encontrou:
|
| 37 |
+
|
| 38 |
+
- Pausa silenciosa media na fase de edicao: **844ms** (SD=573ms)
|
| 39 |
+
- Pausa preenchida media na fase de edicao: **522ms** (SD=571ms)
|
| 40 |
+
- 82% dos auto-reparos contem uma fase de edicao (pausa e/ou filler)
|
| 41 |
+
|
| 42 |
+
Esses 844ms estao **muito acima** de qualquer threshold de silencio usado em sistemas comerciais (700ms).
|
| 43 |
+
|
| 44 |
+
### 2.2 Tipos de pausa: silenciosas dominam
|
| 45 |
+
|
| 46 |
+
Cenoz (2000) estudou 15 falantes nativos de espanhol falando ingles como L2, analisando 1.085 pausas nao-juntivas:
|
| 47 |
+
|
| 48 |
+
| Tipo | Proporcao | Faixa de duracao |
|
| 49 |
+
|------|-----------|------------------|
|
| 50 |
+
| Silenciosas | **64%** | 205ms a 11.569ms |
|
| 51 |
+
| Preenchidas | 36% | — |
|
| 52 |
+
|
| 53 |
+
Distribuicao por duracao das pausas silenciosas:
|
| 54 |
+
- 70% entre 200ms e 1.000ms
|
| 55 |
+
- 21% entre 1.001ms e 2.000ms
|
| 56 |
+
- 7% entre 2.001ms e 4.000ms
|
| 57 |
+
- 2% acima de 4.000ms
|
| 58 |
+
|
| 59 |
+
**Implicacao critica**: a maioria das hesitacoes sao **silencio puro** — nao contem fillers como "euh" ou "hum" que o modelo poderia usar como pista. Isso torna a deteccao mais dificil, pois o modelo precisa distinguir silencio de hesitacao de silencio de fim de turno usando apenas prosodia.
|
| 60 |
+
|
| 61 |
+
### 2.3 Funcao das pausas: planejamento vs busca lexical
|
| 62 |
+
|
| 63 |
+
Cenoz (2000) classificou as funcoes das pausas:
|
| 64 |
+
|
| 65 |
+
| Funcao | Pausas silenciosas | Pausas preenchidas |
|
| 66 |
+
|--------|-------------------|-------------------|
|
| 67 |
+
| Planejamento (frases, sintaxe) | 59% | **73%** |
|
| 68 |
+
| Busca lexical (palavras) | 36% | 26% |
|
| 69 |
+
| Outros | 5% | 1% |
|
| 70 |
+
|
| 71 |
+
Pausas preenchidas sao usadas primariamente para **manter o turno** (floor-holding): 77% ocorrem isoladas, sem outros marcadores de hesitacao. Ja pausas silenciosas co-ocorrem com outras estrategias de reparo 54% das vezes.
|
| 72 |
+
|
| 73 |
+
### 2.4 Paradoxo de proficiencia
|
| 74 |
+
|
| 75 |
+
Um achado contraintuitivo: falantes **mais proficientes** produzem **mais** pausas preenchidas, nao menos (Cenoz, 2000). Na faixa de maior proficiencia, a proporcao muda para 53% silenciosas + 46% preenchidas (vs. 69%/31% nos menos proficientes). Isso ocorre porque falantes avancados aprenderam a usar fillers como estrategia de floor-holding — sinalizam ao interlocutor que ainda estao falando.
|
| 76 |
+
|
| 77 |
+
A variacao individual e enorme: a proporcao de pausas preenchidas varia de **4% a 74.5%** entre falantes individuais.
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## 3. Fillers Franceses na Fala L2
|
| 82 |
+
|
| 83 |
+
### 3.1 Transferencia linguistica
|
| 84 |
+
|
| 85 |
+
Falantes franceses transferem seus fillers nativos para a L2. Lo (2018) analisou 15 bilingues alemao-frances e mostrou que as propriedades acusticas dos fillers sao distintas por lingua:
|
| 86 |
+
|
| 87 |
+
- **Frances "euh"**: duracao mais longa, F1-F3 mais altos (vogal central schwa)
|
| 88 |
+
- **Alemao "ah"**: duracao mais curta, F1-F3 mais baixos (vogal aberta)
|
| 89 |
+
|
| 90 |
+
Bilingues mantem formas foneticamente distintas para cada lingua, indicando que **a forma do filler revela qual lingua esta sendo processada**. Isso e confirmado por Kosmala (2025): alunos franceses falando ingles usam "euh" (24 instancias) e "eum" (12 instancias), os fillers franceses, nao os ingleses.
|
| 91 |
+
|
| 92 |
+
### 3.2 Fillers sao itens lexicais, nao sons universais
|
| 93 |
+
|
| 94 |
+
Bottcher & Zellers (2024) analisaram o corpus RUEG (736 falantes, 5 linguas, 4.468 narracoes) e demonstraram que fillers sao **itens lexicais especificos de cada lingua**, nao sons universais de hesitacao. A proporcao nasal ("uhm") vs. nao-nasal ("uh") varia por lingua, genero e idade.
|
| 95 |
+
|
| 96 |
+
Falantes heritage (bilingues desde a infancia) produzem **mais fillers no total** devido a carga cognitiva da ativacao de duas linguas. A **tolerancia ao silencio** tambem se transfere da L1: falantes de japones mantem maior tolerancia ao silencio mesmo falando ingles.
|
| 97 |
+
|
| 98 |
+
### 3.3 Code-switching: 93% nas fronteiras de unidade
|
| 99 |
+
|
| 100 |
+
Beatty-Martinez, Navarro-Torres & Dussias (2020) analisaram 10 bilingues espanhol-ingles e encontraram que **93% das trocas de codigo ocorrem em fronteiras de unidades entonacionais** — os mesmos pontos onde transicoes de turno acontecem. Antes de uma troca, ha **reorganizacao prosodica** (mudanca na velocidade de fala), que pode ser confundida com marcadores prosodicos de fim de turno.
|
| 101 |
+
|
| 102 |
+
**Implicacao para o modelo**: quando um frances falando portugues alterna para uma palavra em frances ("Eu preciso de... *comment dit-on*... uma tesoura"), a mudanca prosodica antes da troca NAO indica fim de turno.
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## 4. Pistas Prosodicas de Fim de Turno
|
| 107 |
+
|
| 108 |
+
### 4.1 O contorno de F0 e o sinal-chave
|
| 109 |
+
|
| 110 |
+
Ekstedt & Skantze (2022) conduziram experimentos de perturbacao prosodica com o modelo VAP para isolar a importancia relativa de cada pista:
|
| 111 |
+
|
| 112 |
+
| Pista prosodica | Importancia |
|
| 113 |
+
|----------------|-------------|
|
| 114 |
+
| Informacao fonetica/espectral | Mais importante no geral |
|
| 115 |
+
| **Contorno de F0 (pitch)** | **Mais importante para desambiguar pontos sintaticamente completos** |
|
| 116 |
+
| Intensidade | Comparavel ao F0 no geral |
|
| 117 |
+
| Normalizacao de duracao | Menos importante individualmente |
|
| 118 |
+
|
| 119 |
+
O achado central: **F0 e o principal desambiguador em pontos onde a sintaxe poderia indicar completude**. Um F0 ascendente + silaba final mais longa = sinal de completude de turno. O modelo VAP consegue prever trocas de turno **antes** da completude do enunciado a partir da dinamica prosodica.
|
| 120 |
+
|
| 121 |
+
### 4.2 Prosodia antes de pausas preenchidas
|
| 122 |
+
|
| 123 |
+
Wu, Didirkova & Simon (2023) analisaram o corpus LOCAS-F (2h38m, >1.000 FPs, kappa inter-anotador = 0.86) e encontraram padroes prosodicos especificos antes de fillers:
|
| 124 |
+
|
| 125 |
+
- Queda de F0 de **1-2 semitons** durante pausas preenchidas (17.90 vs. 19.01 ST)
|
| 126 |
+
- Reset melodico **negativo significativo** antes da FP (descontinuidade de pitch)
|
| 127 |
+
- Efeito de preparacao: queda de **4.66 ST** entre fala nao-preparada (18.89 ST) e preparada (14.24 ST)
|
| 128 |
+
- **Nao ha reset significativo** entre a FP e a silaba seguinte
|
| 129 |
+
|
| 130 |
+
**Implicacao**: a queda de pitch antes de um filler "euh" e diferente da queda de pitch de fim de frase. O modelo precisa aprender essa distincao sutil — o que o encoder do Whisper pode capturar atraves de representacoes espectrais.
|
| 131 |
+
|
| 132 |
+
### 4.3 Resumo das pistas prosodicas de turno
|
| 133 |
+
|
| 134 |
+
Skantze (2021) compilou a literatura:
|
| 135 |
+
|
| 136 |
+
| Pista | Fim de turno | Pausa de hesitacao |
|
| 137 |
+
|-------|-------------|-------------------|
|
| 138 |
+
| Pitch (F0) | Queda (declarativa) ou queda-ascensao | Suspenso ou queda menor |
|
| 139 |
+
| Intensidade | Diminui gradualmente | Mantem-se ou cai abruptamente |
|
| 140 |
+
| Velocidade de fala | Desacelera antes do fim | Pode acelerar antes de parar |
|
| 141 |
+
| Alongamento silabico | Silaba final alongada | Ausente |
|
| 142 |
+
| Completude sintatica | Frase sintaticamente completa | Frase incompleta |
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## 5. Estado da Arte em Deteccao de Fim de Turno
|
| 147 |
+
|
| 148 |
+
### 5.1 Modelos e suas acuracias
|
| 149 |
+
|
| 150 |
+
| Modelo | Tipo | Params | Tamanho | Latencia | Acuracia | Linguas |
|
| 151 |
+
|--------|------|--------|---------|----------|----------|---------|
|
| 152 |
+
| **Pipecat Smart Turn v3.1** | Audio (Whisper Tiny) | 8M | 8 MB (INT8) | 12ms CPU | 94.7% (EN) | 23 |
|
| 153 |
+
| **LiveKit v0.4.1** | Texto (Qwen2.5-0.5B) | 500M | — | — | TNR 87.4% (PT) | 15+ |
|
| 154 |
+
| **Vogent-Turn-80M** | Multimodal | 79.2M | — | 7ms T4 | 94.1% (EN) | 1 |
|
| 155 |
+
| **Krisp TT v1** | Audio | 6.1M | 65 MB | — | 82% bal.acc. | 1 |
|
| 156 |
+
| **VAP** | Audio (CPC+Transformer) | — | — | 14.6ms CPU | 76.2% bal.acc. | 3 |
|
| 157 |
+
| **SpeculativeETD** | Audio (GRU+Wav2vec) | 1M+94M | — | 0.26ms+server | 66% real | 1 |
|
| 158 |
+
|
| 159 |
+
Nenhum destes modelos foi especificamente otimizado para portugues ou para fala L2.
|
| 160 |
+
|
| 161 |
+
### 5.2 Pipecat Smart Turn: resultados por lingua
|
| 162 |
+
|
| 163 |
+
| Lingua | Acuracia | FP | FN |
|
| 164 |
+
|--------|----------|-----|-----|
|
| 165 |
+
| Turco | 97.10% | 1.66% | 1.24% |
|
| 166 |
+
| Coreano | 96.85% | 1.12% | 2.02% |
|
| 167 |
+
| Japones | 96.76% | 2.04% | 1.20% |
|
| 168 |
+
| Frances | 96.01% | 1.60% | 2.39% |
|
| 169 |
+
| **Portugues** | **95.42%** | **2.79%** | **1.79%** |
|
| 170 |
+
| Ingles | 94.31% | 2.64% | 3.06% |
|
| 171 |
+
| Espanhol | 91.97% | 4.48% | 3.55% |
|
| 172 |
+
|
| 173 |
+
Portugues esta atras de frances, japones, coreano e turco. A taxa de falsos positivos (2.79%) indica que o modelo diz "terminou" quando nao terminou em quase 3% dos casos — problematico para traducao simultanea.
|
| 174 |
+
|
| 175 |
+
### 5.3 O gap sintetico → real: o maior risco
|
| 176 |
+
|
| 177 |
+
Ok, Yoo & Lee (2025) demonstraram um colapso devastador quando modelos treinados em dados sinteticos (TTS) sao avaliados em fala real:
|
| 178 |
+
|
| 179 |
+
| Modelo | F1 sintetico | F1 real | Perda |
|
| 180 |
+
|--------|-------------|---------|-------|
|
| 181 |
+
| Wav2vec 2.0 | 94.7% | **30.3%** | -64.4 p.p. |
|
| 182 |
+
| SpeculativeETD | 88.9 IoU | **16.4 IoU** | -72.5 p.p. |
|
| 183 |
+
| VAP | 87.7 IoU | **10.7 IoU** | -77.0 p.p. |
|
| 184 |
+
|
| 185 |
+
**Todos os modelos colapsam** ao sair de dados sinteticos para conversacao real. A melhoria de v3.0 para v3.1 do Pipecat veio justamente de adicionar **audio humano real** de tres parceiros especializados (Liva AI, Midcentury, MundoAI).
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## 6. Tecnicas de Treinamento Validadas
|
| 190 |
+
|
| 191 |
+
### 6.1 Focal Loss (Lin et al., 2017)
|
| 192 |
+
|
| 193 |
+
A Focal Loss resolve o desbalanceamento extremo de classes (a maioria dos frames e "fala em andamento"; fim de turno e raro):
|
| 194 |
+
|
| 195 |
+
```
|
| 196 |
+
FL(p_t) = -α_t · (1 - p_t)^γ · log(p_t)
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
Com gamma=2, a perda para exemplos bem classificados (p_t=0.9) e **100x menor** que cross-entropy padrao; com p_t=0.968, **1.000x menor**. Os melhores hiperparametros: **gamma=2, alpha=0.25** (Lin et al., 2017, Tabela 1a).
|
| 200 |
+
|
| 201 |
+
### 6.2 Knowledge Distillation (LiveKit)
|
| 202 |
+
|
| 203 |
+
O LiveKit treinou um modelo professor (Qwen2.5-7B) e destilou para um aluno (Qwen2.5-0.5B), que converge apos ~1.500 steps. O resultado: reducao de 39% nos falsos positivos de interrupcao. Para portugues especificamente, a melhoria relativa foi de **45.97%** — a segunda maior entre todas as linguas.
|
| 204 |
+
|
| 205 |
+
### 6.3 Dados curtos e ruidosos (Pipecat v3.2)
|
| 206 |
+
|
| 207 |
+
O Pipecat v3.2 adicionou dois tipos de dados ao treinamento:
|
| 208 |
+
1. **Respostas curtas** ("yes", "no", "ok"): reduziu erros de classificacao em **40%**
|
| 209 |
+
2. **Ruido de fundo** (cafe, escritorio, CC-0 Freesound): melhorou robustez
|
| 210 |
+
|
| 211 |
+
### 6.4 Injecao de fillers e pausas (SpeculativeETD)
|
| 212 |
+
|
| 213 |
+
Ok et al. (2025) testaram tres variantes de dados sinteticos:
|
| 214 |
+
- V1: TTS direto (baseline)
|
| 215 |
+
- V2: Pausas de hesitacao estendidas para 1.5-3.0 segundos
|
| 216 |
+
- **V3: Fillers injetados ("um", "uh") em posicoes aleatorias + pausas apos**
|
| 217 |
+
|
| 218 |
+
V3 foi a melhor variante, validando a abordagem de gerar dados com fillers inseridos por LLM.
|
| 219 |
+
|
| 220 |
+
### 6.5 Transfer learning cross-lingual
|
| 221 |
+
|
| 222 |
+
Castillo-Lopez et al. (2025) documentam que VAP pre-treinado em ingles (Switchboard) e fine-tunado em japones **supera** o treinamento direto em japones. Isso valida a abordagem de partir do Pipecat (pre-treinado em 23 linguas) e fine-tunar para portugues.
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
## 7. Pesquisa: Modelos de Turn-Taking para Aprendizado de Idiomas (Marco 2026)
|
| 227 |
+
|
| 228 |
+
**Data da pesquisa**: 16 de marco de 2026
|
| 229 |
+
**Objetivo**: identificar modelos e tecnicas especificas para turn-taking em contexto de aprendizado de idiomas com avatar conversacional
|
| 230 |
+
|
| 231 |
+
### 7.1 Panorama: nenhum modelo especifico existe
|
| 232 |
+
|
| 233 |
+
Nao existe nenhum modelo open-source de turn-taking treinado especificamente para aprendizes L2. Os produtos comerciais usam abordagens genericas:
|
| 234 |
+
|
| 235 |
+
| Produto | Abordagem | Limitacao |
|
| 236 |
+
|---------|-----------|-----------|
|
| 237 |
+
| **Praktika** (OpenAI) | GPT-5.2 Realtime API — turn detection nativo OpenAI | Caixa preta, sem adaptacao L2 |
|
| 238 |
+
| **ConversAR** (Meta Quest) | Timeout fixo 2000ms + "periodo infinito de pensamento" | Nao distingue hesitacao de fim de turno |
|
| 239 |
+
| **Gliglish** | ChatGPT + speech recognition generico | Sem turn detection especializado |
|
| 240 |
+
| **ELSA Speak** | Modelo proprietario treinado com sotaques variados | Foco em pronuncia, nao turn-taking |
|
| 241 |
+
| **TalkPal/SpeakPal/Talkio** | GPT + NLP generico | Sem modelo de audio para end-of-turn |
|
| 242 |
+
| **Hume EVI** | Prosodia (tom de voz) para turn detection | Comercial, sem foco em L2 |
|
| 243 |
+
| **Deepgram Flux** | Modelo fusionado (ASR + turn detection) com `eot_threshold` configuravel | So ingles, sem adaptacao para proficiencia |
|
| 244 |
+
|
| 245 |
+
### 7.2 Modelos de pesquisa relevantes
|
| 246 |
+
|
| 247 |
+
- **VAP (Voice Activity Projection)** — multilingual (EN/ZH/JA), fine-tuning por lingua melhora significativamente, mas nao cobre L2 ou non-native speakers (Inoue et al., 2024)
|
| 248 |
+
- **Speak & Improve Corpus 2025** — 340 horas de fala L2 ingles com anotacao de disfluencias e CEFR scores (A2-C1, maioria B1-B2) (Knill et al., 2025)
|
| 249 |
+
- **Whisper + LoRA para hesitation tagging** — fine-tune do Whisper Large V3 com tags de hesitacao acusticamente precisas: **11.3% melhoria no WER** para fala L2 (arXiv:2506.04076)
|
| 250 |
+
- **Deepgram Flux** — modelo fusionado (ASR + turn detection em um unico modelo); ~260ms end-of-turn detection, ~30% menos interrupcoes vs pipeline tradicional (Deepgram, 2025)
|
| 251 |
+
- **Survey IWSDS 2025** — 72% dos trabalhos de turn-taking NAO comparam com metodos anteriores, sugerindo falta de benchmarks estabelecidos (Castillo-Lopez et al., 2025)
|
| 252 |
+
|
| 253 |
+
### 7.3 Tecnicas identificadas para melhorar o fine-tuning
|
| 254 |
+
|
| 255 |
+
**1. Threshold duplo (Eager + Final) — inspirado no Deepgram Flux**
|
| 256 |
+
|
| 257 |
+
Em vez de um unico threshold, usar dois:
|
| 258 |
+
- **Eager threshold (0.3-0.5)**: o avatar comeca a preparar resposta especulativamente (inicia geracao LLM)
|
| 259 |
+
- **Final threshold (0.7+)**: confirma fim de turno e fala
|
| 260 |
+
|
| 261 |
+
Economia de latencia estimada: 150-250ms no inicio da resposta, sem interromper o falante.
|
| 262 |
+
|
| 263 |
+
Deepgram Flux implementa isso como `eot_threshold` (padrao 0.7) e `eager_eot_threshold` (0.3-0.5), com parametro `eot_timeout_ms` para configurar sensibilidade.
|
| 264 |
+
|
| 265 |
+
**2. Custo assimetrico (FP >> FN)**
|
| 266 |
+
|
| 267 |
+
Para um tutor de idiomas, **interromper o aluno e muito pior que esperar demais**:
|
| 268 |
+
- ConversAR (2025): "periodo infinito de pensamento" — learners controlam quando falar
|
| 269 |
+
- Praktika: timeout estendido para fala L2 fragmentada e com sotaque
|
| 270 |
+
- Implementacao: `fp_penalty=2.0` no focal loss — erros de interrupcao custam 2x mais
|
| 271 |
+
|
| 272 |
+
**3. Dados L2 reais — Speak & Improve Corpus**
|
| 273 |
+
|
| 274 |
+
O corpus Speak & Improve 2025 (Knill et al., 2025) tem:
|
| 275 |
+
- 340 horas de audio L2 ingles
|
| 276 |
+
- Anotacao de disfluencias (false starts, repeticoes, hesitacoes)
|
| 277 |
+
- Scores CEFR: B1 (18.3%), B1+ (25.3%), B2 (25.1%), B2+ (18.3%)
|
| 278 |
+
- Embora seja ingles L2, padroes de hesitacao L2 sao transferiveis entre linguas (Cenoz, 2000)
|
| 279 |
+
|
| 280 |
+
Fine-tuning do Whisper com anotacao precisa de hesitacoes (tags "um"/"uh" acusticamente alinhadas) mostrou 11.3% melhoria no WER vs ignorar hesitacoes.
|
| 281 |
+
|
| 282 |
+
**4. Proficiency-aware turn-taking (futuro)**
|
| 283 |
+
|
| 284 |
+
Nenhum sistema implementa isso ainda, mas e logico:
|
| 285 |
+
- Aluno A1: pausa **3-5x mais** que B2
|
| 286 |
+
- Aluno B2: usa fillers como estrategia de floor-holding (Cenoz, 2000)
|
| 287 |
+
- De Jong & Bosker: threshold otimo de 250ms para pausa L2 em holandes
|
| 288 |
+
- Shea & Leonard (2019): threshold de **1000ms** para espanhol L2
|
| 289 |
+
- Implementacao futura: input de nivel CEFR ao modelo, ou adaptar threshold baseado em perfil do falante
|
| 290 |
+
|
| 291 |
+
### 7.5 Sistema de backchannel para avatar de aprendizado
|
| 292 |
+
|
| 293 |
+
Um problema critico identificado: se o avatar espera 3 segundos em silencio, o aprendiz pensa que o sistema travou e para de falar. A solucao e um sistema de **backchannel signals** que mostra ao aprendiz que o avatar esta ouvindo, sem tomar o turno.
|
| 294 |
+
|
| 295 |
+
| Tempo de silencio | Acao do avatar | Tipo |
|
| 296 |
+
|-------------------|----------------|------|
|
| 297 |
+
| 0-600ms | Nada (normal) | — |
|
| 298 |
+
| 600ms-1.5s | Aceno de cabeca, olhar atento | Backchannel visual |
|
| 299 |
+
| 1.5s-3.0s | "Mhm...", "Continue...", "Uhum..." | Backchannel verbal |
|
| 300 |
+
| 3.0s+ | "Sem pressa, pode pensar...", "Prenez votre temps..." | Encorajamento |
|
| 301 |
+
|
| 302 |
+
O sistema integra o threshold duplo do modelo com os backchannels:
|
| 303 |
+
- **Eager threshold** atingido → avatar da backchannel + inicia geracao LLM especulativa
|
| 304 |
+
- **Final threshold** atingido → avatar fala a resposta
|
| 305 |
+
|
| 306 |
+
Presets por nivel CEFR:
|
| 307 |
+
|
| 308 |
+
| CEFR | Eager | Final | BC Visual | BC Verbal | Encorajamento |
|
| 309 |
+
|------|-------|-------|-----------|-----------|---------------|
|
| 310 |
+
| A1 | 0.50 | **0.80** | 500ms | 1200ms | 2500ms |
|
| 311 |
+
| A2 | 0.45 | 0.75 | 550ms | 1300ms | 2800ms |
|
| 312 |
+
| B1 | 0.40 | 0.70 | 600ms | 1500ms | 3000ms |
|
| 313 |
+
| B2 | 0.35 | 0.65 | 600ms | 1800ms | 4000ms |
|
| 314 |
+
| C1 | 0.35 | **0.60** | 600ms | 2000ms | 5000ms |
|
| 315 |
+
|
| 316 |
+
O avatar e mais paciente com A1/A2 (threshold final 0.80 = raramente interrompe) e mais responsivo com B2/C1 (0.60-0.65 = conversa mais natural). Implementado em `06_inference.py`.
|
| 317 |
+
|
| 318 |
+
### 7.6 Conclusao da pesquisa
|
| 319 |
+
|
| 320 |
+
O trabalho do BabelCast e confirmado como **pioneiro**: nao existe modelo de turn-taking para aprendizes L2, muito menos para francofonos falando portugues. As melhorias implementadas (custo assimetrico, threshold duplo, dados L2 reais, backchannel por CEFR) trazem o modelo mais proximo do comportamento ideal de um tutor paciente.
|
| 321 |
+
|
| 322 |
+
---
|
| 323 |
+
|
| 324 |
+
## 8. Implicacoes para o Fine-Tuning BabelCast
|
| 325 |
+
|
| 326 |
+
### 8.1 O que o modelo precisa aprender
|
| 327 |
+
|
| 328 |
+
Com base na literatura, os seguintes padroes sao criticos para o modelo distinguir:
|
| 329 |
+
|
| 330 |
+
| Padrao | Duracao tipica | Label | Pista prosodica |
|
| 331 |
+
|--------|---------------|-------|----------------|
|
| 332 |
+
| Fim de turno real | silencio 200-500ms | COMPLETO | F0 caindo, intensidade caindo, silaba final alongada |
|
| 333 |
+
| Hesitacao nativa (PT-BR) | FP 378ms + silencio 200-1000ms | INCOMPLETO | F0 suspenso, "hum"/"tipo"/"ne" |
|
| 334 |
+
| Hesitacao L2 (francofono) | FP **524ms** + silencio **844ms** | INCOMPLETO | F0 suspenso, "euh"/"alors"/"comment dire" |
|
| 335 |
+
| Busca de palavra L2 | silencio 1000-3000ms | INCOMPLETO | Silencio puro, sem pista prosodica clara |
|
| 336 |
+
| Code-switching | prosodic reset + silencio 500-1500ms | INCOMPLETO | Mudanca de velocidade antes da troca |
|
| 337 |
+
| Resposta curta completa | "Sim." + silencio 200ms | COMPLETO | F0 caindo, curta duracao |
|
| 338 |
+
| Resposta curta incompleta | "Sim, mas..." + silencio | INCOMPLETO | F0 suspenso ou ascendente |
|
| 339 |
+
|
| 340 |
+
### 8.2 Dados necessarios
|
| 341 |
+
|
| 342 |
+
1. **Audio real de portugues** (CORAA MUPE 365h, NURC-SP 239h) — evitar o gap sintetico→real
|
| 343 |
+
2. **TTS com fillers** brasileiros ("hum", "tipo", "ne") e franceses ("euh", "alors") inseridos por LLM
|
| 344 |
+
3. **Pausas longas de hesitacao** (1.5-3.0s) injetadas em amostras incompletas
|
| 345 |
+
4. **Respostas curtas** ("sim", "nao", "ok", "tá bom") com variantes completas e incompletas
|
| 346 |
+
5. **Code-switching** frances-portugues em fronteiras de unidade
|
| 347 |
+
6. **Ruido de fundo** real (cafe, escritorio) — CC-0 Freesound
|
| 348 |
+
|
| 349 |
+
### 8.3 Escolhas arquiteturais validadas
|
| 350 |
+
|
| 351 |
+
- **Whisper Tiny encoder** (39M params, 8s janela, 384-dim): mesma arquitetura do Pipecat, captura prosodia sem decodificar texto
|
| 352 |
+
- **Attention pooling**: aprende quais frames perto do silencio sao mais informativos (Ekstedt & Skantze, 2022)
|
| 353 |
+
- **Focal Loss** (gamma=2, alpha=0.25): calibracao implícita sem label smoothing (EMNLP 2022)
|
| 354 |
+
- **INT8 quantizacao**: 32MB → 8MB, 12ms CPU (Pipecat deploy pipeline)
|
| 355 |
+
|
| 356 |
+
### 8.4 Lacuna na literatura
|
| 357 |
+
|
| 358 |
+
**Nenhum paper foi encontrado especificamente sobre deteccao de fim de turno em fala L2**, e especialmente nao sobre francofonos falando portugues. A intersecao de:
|
| 359 |
+
- Deteccao de fim de turno (ML/audio)
|
| 360 |
+
- Pausas de hesitacao em L2 (linguistica)
|
| 361 |
+
- Fala francofona em portugues (fonologia)
|
| 362 |
+
|
| 363 |
+
...e uma area completamente inexplorada. O trabalho do BabelCast e, ate onde sabemos, o primeiro a atacar esse problema especifico.
|
| 364 |
+
|
| 365 |
+
---
|
| 366 |
+
|
| 367 |
+
## Referencias
|
| 368 |
+
|
| 369 |
+
### Papers Academicos
|
| 370 |
+
|
| 371 |
+
1. Beatty-Martinez, A. L., Navarro-Torres, C. A., & Dussias, P. E. (2020). Codeswitching: A bilingual toolkit for opportunistic speech planning. *Frontiers in Psychology*, 11, 1699.
|
| 372 |
+
|
| 373 |
+
2. Bottcher, A. & Zellers, M. (2024). Do you say uh or uhm? A cross-linguistic approach to filler particle use in heritage and majority speakers across three languages. *Frontiers in Psychology*, 15, 1358182.
|
| 374 |
+
|
| 375 |
+
3. Castillo-Lopez, V., de Chalendar, G., & Semmar, N. (2025). A survey of recent advances on turn-taking modeling in spoken dialogue systems. *arXiv:2503.xxxxx*.
|
| 376 |
+
|
| 377 |
+
4. Cenoz, J. (2000). Pauses and hesitation phenomena in second language production. *ITL - International Journal of Applied Linguistics*, 127-128, 53-69.
|
| 378 |
+
|
| 379 |
+
5. Christodoulides, G. & Avanzi, M. (2014). DisMo: A morphosyntactic, disfluency and multi-word unit annotator. *Proceedings of LREC 2014*.
|
| 380 |
+
|
| 381 |
+
6. Christodoulides, G. & Avanzi, M. (2015). Automatic detection and annotation of disfluencies in spoken French corpora. *Proceedings of Interspeech 2015*.
|
| 382 |
+
|
| 383 |
+
7. Ekstedt, E. & Skantze, G. (2022). How much does prosody help turn-taking? Investigations using voice activity projection models. *Proceedings of Interspeech 2022*.
|
| 384 |
+
|
| 385 |
+
8. Inoue, K., Jiang, D., Ekstedt, E., Kawahara, T., & Skantze, G. (2024). Real-time and continuous turn-taking prediction using voice activity projection. *arXiv:2401.04868*.
|
| 386 |
+
|
| 387 |
+
9. Inoue, K., Jiang, D., Ekstedt, E., Kawahara, T., & Skantze, G. (2024). Multilingual turn-taking prediction using voice activity projection. *Proceedings of LREC-COLING 2024*.
|
| 388 |
+
|
| 389 |
+
10. Kosmala, L. & Crible, L. (2022). The dual status of filled pauses: Evidence from genre, proficiency and co-occurrence. *Language and Speech*, 65(4), 1-25.
|
| 390 |
+
|
| 391 |
+
11. Kosmala, L. (2023). Exploring the status of filled pauses as pragmatic markers: The role of gaze and gesture. *Journal of Pragmatics*, 212, 1-15.
|
| 392 |
+
|
| 393 |
+
12. Kosmala, L. (2025). Multimodal self- and other-initiated repairs in L2 peer interactions. *Proceedings of DiSS 2025*, Lisbon.
|
| 394 |
+
|
| 395 |
+
13. Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. *Proceedings of ICCV 2017*. arXiv:1708.02002.
|
| 396 |
+
|
| 397 |
+
14. Lo, S. L. (2018). Between ah(m) and euh(m): Filled pauses in German-French bilinguals. *BAAP 2018 Poster Presentation*.
|
| 398 |
+
|
| 399 |
+
15. Ok, J., Yoo, I. C., & Lee, Y. (2025). Speculative end-turn detector: Revisiting an efficient design for low-latency end-of-turn detection. *arXiv:2503.23439*.
|
| 400 |
+
|
| 401 |
+
16. Peters, M. (2017). L2 fluency development in French. *Ph.D. Thesis*.
|
| 402 |
+
|
| 403 |
+
17. Raux, A. & Eskenazi, M. (2009). A finite-state turn-taking model for spoken dialog systems. *Proceedings of NAACL-HLT 2009*.
|
| 404 |
+
|
| 405 |
+
18. Skantze, G. (2021). Turn-taking in conversational systems and human-robot interaction: A review. *Computer Speech & Language*, 67, 101178.
|
| 406 |
+
|
| 407 |
+
19. Wu, X., Didirkova, I., & Simon, A. C. (2023). Disfluencies in continuous speech in French: Prosodic parameters of filled pauses and vowel lengthening. *Proceedings of ICPhS 2023*.
|
| 408 |
+
|
| 409 |
+
20. Knill, K., et al. (2025). Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback. *arXiv:2412.11986*.
|
| 410 |
+
|
| 411 |
+
21. Saeki, T., et al. (2025). Acoustically precise hesitation tagging is essential for end-to-end verbatim transcription systems. *arXiv:2506.04076*.
|
| 412 |
+
|
| 413 |
+
22. Gamboa, H. & Wohlgenannt, G. (2025). ConversAR: Practicing a second language without fear — Mixed reality agents for interactive group conversation. *arXiv:2510.08227*.
|
| 414 |
+
|
| 415 |
+
23. De Jong, N. & Bosker, H. R. (2013). Choosing a threshold for silent pauses to measure second language fluency. *The 6th Workshop on Disfluency in Spontaneous Speech (DiSS)*.
|
| 416 |
+
|
| 417 |
+
24. Shea, C. & Leonard, K. (2019). Evaluating measures of pausing for second language fluency research. *Journal of Second Language Pronunciation*, 5(2), 254-277.
|
| 418 |
+
|
| 419 |
+
### Modelos e Blogs Tecnicos
|
| 420 |
+
|
| 421 |
+
20. Daily/Pipecat. (2025). Announcing Smart Turn v3, with CPU inference in just 12ms. https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/
|
| 422 |
+
|
| 423 |
+
21. Daily/Pipecat. (2025). Improved accuracy in Smart Turn v3.1. https://www.daily.co/blog/improved-accuracy-in-smart-turn-v3-1/
|
| 424 |
+
|
| 425 |
+
22. Daily/Pipecat. (2026). Smart Turn v3.2: Handling noisy environments and short responses. https://www.daily.co/blog/smart-turn-v3-2-handling-noisy-environments-and-short-responses/
|
| 426 |
+
|
| 427 |
+
23. LiveKit. (2025). Improved end-of-turn model cuts voice AI interruptions 39%. https://livekit.com/blog/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/
|
| 428 |
+
|
| 429 |
+
24. LiveKit. (2025). Using a transformer to improve end-of-turn detection. https://blog.livekit.io/using-a-transformer-to-improve-end-of-turn-detection
|
| 430 |
+
|
| 431 |
+
25. Krisp. (2025). Audio-only 6M weights turn-taking model for voice AI agents. https://krisp.ai/blog/turn-taking-for-voice-ai/
|
| 432 |
+
|
| 433 |
+
26. Deepgram. (2025). Evaluating end-of-turn detection models. https://deepgram.com/learn/evaluating-end-of-turn-detection-models
|
| 434 |
+
|
| 435 |
+
27. Vogent. (2025). Vogent-Turn-80M model card. https://huggingface.co/vogent/Vogent-Turn-80M
|
| 436 |
+
|
| 437 |
+
28. Deepgram. (2025). Introducing Flux: Conversational Speech Recognition. https://deepgram.com/learn/introducing-flux-conversational-speech-recognition
|
| 438 |
+
|
| 439 |
+
29. Hume AI. (2025). Empathic Voice Interface (EVI) documentation. https://dev.hume.ai/docs/speech-to-speech-evi/overview
|
| 440 |
+
|
| 441 |
+
30. OpenAI. (2025). Inside Praktika's conversational approach to language learning. https://openai.com/index/praktika/
|
| 442 |
+
|
| 443 |
+
31. Tavus. (2025). The Complete Guide to AI Turn-Taking. https://www.tavus.io/post/ai-turn-taking
|
| 444 |
+
|
| 445 |
+
### Datasets
|
| 446 |
+
|
| 447 |
+
32. Pipecat-AI. (2026). Smart Turn Data v3.2 Training Set. HuggingFace: `pipecat-ai/smart-turn-data-v3.2-train`. 270,946 samples, 23 languages.
|
| 448 |
+
|
| 449 |
+
33. NILC-NLP. CORAA-MUPE-ASR. HuggingFace: `nilc-nlp/CORAA-MUPE-ASR`. 365h, Portuguese interviews.
|
| 450 |
+
|
| 451 |
+
34. NILC-NLP. CORAA NURC-SP Audio Corpus. HuggingFace: `nilc-nlp/CORAA-NURC-SP-Audio-Corpus`. 239h, Portuguese dialogues.
|
| 452 |
+
|
| 453 |
+
35. Cambridge English. Speak & Improve Corpus 2025. 340h, L2 English learner speech with disfluency annotations and CEFR scores.
|
03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.md
ADDED
|
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "A Finite-State Turn-Taking Model for Spoken Dialog Systems"
|
| 3 |
+
authors:
|
| 4 |
+
- Antoine Raux
|
| 5 |
+
- Maxine Eskenazi
|
| 6 |
+
year: 2009
|
| 7 |
+
source: https://aclanthology.org/N09-1071/
|
| 8 |
+
date_converted: 2026-03-16
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Abstract
|
| 12 |
+
|
| 13 |
+
This paper introduces the Finite-State Turn-Taking Machine (FSTTM), a new model to control the turn-taking behavior of conversational agents. Based on a non-deterministic finite-state machine, the FSTTM uses a cost matrix and decision-theoretic principles to select a turn-taking action at any time. The authors show how the model can be applied to the problem of end-of-turn detection. Evaluation results on a deployed spoken dialog system show that the FSTTM provides significantly higher responsiveness than previous approaches.
|
| 14 |
+
|
| 15 |
+
**Note**: The PDF file `finite_state_turn_taking_raux_2009.pdf` in this directory is corrupted (contains an unrelated physics paper). Content for this summary was sourced from the actual paper PDF downloaded from ACL Anthology.
|
| 16 |
+
|
| 17 |
+
## Key Contributions
|
| 18 |
+
|
| 19 |
+
1. **FSTTM model**: A principled, unified framework for turn-taking control based on a 6-state non-deterministic finite-state machine with decision-theoretic action selection.
|
| 20 |
+
2. **Data-driven optimization**: Unlike prior hand-coded models, the FSTTM's cost parameters can be optimized from data using logistic regression.
|
| 21 |
+
3. **Anytime endpointing**: The model can make end-of-turn decisions both during pauses and during speech (before a pause is even detected by VAD), enabling faster response times.
|
| 22 |
+
4. **Deployed evaluation**: Tested on a real, publicly deployed bus information system, not just in lab conditions.
|
| 23 |
+
|
| 24 |
+
## Architecture / Method Details
|
| 25 |
+
|
| 26 |
+
### Six-State Finite-State Model
|
| 27 |
+
|
| 28 |
+
Based on Jaffe & Feldstein (1970) and Brady (1969), the FSTTM uses six states representing who holds the conversational floor:
|
| 29 |
+
|
| 30 |
+
| State | Description |
|
| 31 |
+
|-------|-------------|
|
| 32 |
+
| **USER** | User has the floor (obligation or intention to speak) |
|
| 33 |
+
| **SYSTEM** | System has the floor |
|
| 34 |
+
| **FREES** | Floor is free, following a SYSTEM state |
|
| 35 |
+
| **FREEU** | Floor is free, following a USER state |
|
| 36 |
+
| **BOTHS** | Both claim the floor, following a SYSTEM state |
|
| 37 |
+
| **BOTHU** | Both claim the floor, following a USER state |
|
| 38 |
+
|
| 39 |
+
States are defined in terms of **intentions and obligations**, not surface-level speech/silence observations. For example, USER state persists during mid-turn pauses.
|
| 40 |
+
|
| 41 |
+
### Four Turn-Taking Actions
|
| 42 |
+
|
| 43 |
+
At any time, the system can take one of four actions:
|
| 44 |
+
- **Grab (G)**: Claim the floor
|
| 45 |
+
- **Release (R)**: Relinquish the floor
|
| 46 |
+
- **Keep (K)**: Maintain floor claim
|
| 47 |
+
- **Wait (W)**: Remain silent without claiming floor
|
| 48 |
+
|
| 49 |
+
### Turn-Taking Phenomena Captured
|
| 50 |
+
|
| 51 |
+
The model formalizes common turn-taking patterns as 2-step state transitions:
|
| 52 |
+
|
| 53 |
+
1. **Turn transitions with gap**: `SYSTEM --(R,W)--> FREES --(W,G)--> USER` (most common)
|
| 54 |
+
2. **Turn transitions with overlap**: `SYSTEM --(K,G)--> BOTHS --(R,K)--> USER` (barge-in)
|
| 55 |
+
3. **Failed interruptions**: `USER --(G,K)--> BOTHU --(R,K)--> USER` (system interrupts then backs off)
|
| 56 |
+
4. **Time outs**: `SYSTEM --(R,W)--> FREES --(G,W)--> SYSTEM` (user doesn't respond)
|
| 57 |
+
|
| 58 |
+
### Decision-Theoretic Action Selection
|
| 59 |
+
|
| 60 |
+
The optimal action minimizes expected cost:
|
| 61 |
+
|
| 62 |
+
```
|
| 63 |
+
C(A) = sum over S in States: P(s=S|O) * C(A, S)
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
Where `P(s=S|O)` is estimated from observable features and `C(A,S)` comes from a cost matrix.
|
| 67 |
+
|
| 68 |
+
### Cost Matrix Design Principles
|
| 69 |
+
|
| 70 |
+
Derived from Sacks et al. (1974) -- "participants minimize gaps and overlaps":
|
| 71 |
+
|
| 72 |
+
1. Actions that resolve gaps/overlaps have zero cost.
|
| 73 |
+
2. Actions that create unwanted gaps/overlaps have a constant cost parameter.
|
| 74 |
+
3. Actions that maintain gaps/overlaps have cost proportional to time in that state.
|
| 75 |
+
|
| 76 |
+
Four cost parameters:
|
| 77 |
+
- **CS**: Cost of false interruption (interrupting system prompt when user is not claiming floor)
|
| 78 |
+
- **CO(tau)**: Cost of remaining in overlap for tau ms
|
| 79 |
+
- **CU**: Cost of cut-in (grabbing floor when user holds it)
|
| 80 |
+
- **CG(tau)**: Cost of remaining in a gap for tau ms (set as CG^p * tau)
|
| 81 |
+
|
| 82 |
+
### Probability Estimation
|
| 83 |
+
|
| 84 |
+
The key estimation task is `P(FREEU | Ot)` -- the probability that the user has released the floor.
|
| 85 |
+
|
| 86 |
+
**At pauses** (VAD detects silence): Uses Bayes rule combining:
|
| 87 |
+
- `P(F|O)`: Prior probability of floor release from pause-onset features (logistic regression)
|
| 88 |
+
- `P(d >= tau | O, U)`: Probability pause lasts at least tau ms given user holds floor (exponential distribution)
|
| 89 |
+
|
| 90 |
+
**During speech** (before VAD pause detection): Separate logistic regression on each ASR partial hypothesis, enabling endpointing before a full pause is detected.
|
| 91 |
+
|
| 92 |
+
### Features Used
|
| 93 |
+
|
| 94 |
+
All features are automatically extractable at runtime:
|
| 95 |
+
- **Dialog state**: Open question / Closed question / Confirmation
|
| 96 |
+
- **Turn-taking features**: Whether current utterance is a barge-in
|
| 97 |
+
- **Semantic features**: From dialog state and partial ASR hypotheses
|
| 98 |
+
- **Boundary LM score**: Ratio of log-likelihood of hypothesis being complete vs. incomplete (trained on ASR output, no human transcription needed). **Most informative feature across all states.**
|
| 99 |
+
- **Average words per utterance** so far
|
| 100 |
+
- **Confirmation markers**: "YES", "SURE", etc.
|
| 101 |
+
- **Pause duration** within partial hypothesis (0-200ms range)
|
| 102 |
+
|
| 103 |
+
## Experimental Results
|
| 104 |
+
|
| 105 |
+
### Logistic Regression for P(F|O)
|
| 106 |
+
|
| 107 |
+
Trained with stepwise regression, 10-fold cross-validation on 586 dialogs (2008 corpus):
|
| 108 |
+
|
| 109 |
+
| Dialog State | Pause: Class. Error | Pause: Log-Likelihood | Speech: Class. Error | Speech: Log-Likelihood |
|
| 110 |
+
|---|---|---|---|---|
|
| 111 |
+
| Open question | 35% (vs 38% baseline) | -0.61 (vs -0.66) | 17% (vs 20%) | -0.40 (vs -0.50) |
|
| 112 |
+
| Closed question | 26% (vs 25% baseline) | -0.50 (vs -0.56) | 22% (vs 32%) | -0.49 (vs -0.63) |
|
| 113 |
+
| Confirmation | 12% (vs 12% baseline) | -0.30 (vs -0.36) | 17% (vs 36%) | -0.40 (vs -0.65) |
|
| 114 |
+
|
| 115 |
+
The "in speech" model achieves much larger gains, especially for Closed questions and Confirmations -- classification error drops from 32% to 22% and 36% to 17% respectively.
|
| 116 |
+
|
| 117 |
+
### Batch Evaluation: Latency vs. Cut-in Rate
|
| 118 |
+
|
| 119 |
+
**In-pause FSTTM vs. baselines** (2007 corpus):
|
| 120 |
+
- FSTTM outperforms fixed threshold baseline by up to **29.5% latency reduction**.
|
| 121 |
+
- Slight improvement over Ferrer et al. (2003) reimplementation.
|
| 122 |
+
|
| 123 |
+
**Anytime-FSTTM vs. in-pause FSTTM** (2008 corpus):
|
| 124 |
+
- At 5% cut-in rate: anytime-FSTTM yields latencies **17% shorter** than in-pause-FSTTM, and **40% shorter** than fixed threshold baseline.
|
| 125 |
+
- **30-40% of turns are endpointed before the pause is detected by VAD** (during speech).
|
| 126 |
+
|
| 127 |
+
### Live Evaluation (Deployed System)
|
| 128 |
+
|
| 129 |
+
A/B test over 10 days: 171 FSTTM dialogs vs. 148 fixed-threshold control dialogs.
|
| 130 |
+
|
| 131 |
+
Settings: `CG^p = 1, CG^s = 500, CU = 5000` (FSTTM) vs. 555ms fixed threshold (control). Both calibrated for ~6.3% cut-in rate.
|
| 132 |
+
|
| 133 |
+
| Metric | FSTTM | Fixed Threshold |
|
| 134 |
+
|--------|-------|-----------------|
|
| 135 |
+
| Average latency | **320ms** | 513ms |
|
| 136 |
+
| Cut-in rate | **4.8%** | 6.3% |
|
| 137 |
+
|
| 138 |
+
Results:
|
| 139 |
+
- **193ms latency reduction** (p < 0.05, statistically significant).
|
| 140 |
+
- 1.5% cut-in rate reduction (not statistically significant, but directionally correct).
|
| 141 |
+
|
| 142 |
+
## Relevance to Turn-Taking / End-of-Turn Detection
|
| 143 |
+
|
| 144 |
+
The FSTTM is a foundational reference for BabelCast's end-of-turn detection:
|
| 145 |
+
|
| 146 |
+
1. **Decision-theoretic framework**: The cost-based approach directly applies to our system. We can define costs for premature LLM invocation (wasted compute, incorrect partial translation) vs. delayed response (poor user experience), and optimize the trade-off.
|
| 147 |
+
|
| 148 |
+
2. **Anytime endpointing**: The insight that 30-40% of turns can be endpointed *before the pause is detected by VAD* is powerful. In our pipeline, this means we could start the translation process before the speaker's pause is even fully formed, dramatically reducing perceived latency.
|
| 149 |
+
|
| 150 |
+
3. **Boundary LM score**: The most informative feature in Raux's system was a language model score measuring syntactic completeness. We already have ASR partial results from Whisper -- we could compute a similar feature to predict turn completeness without waiting for silence.
|
| 151 |
+
|
| 152 |
+
4. **Cost parameter tuning**: The four-parameter cost matrix (CS, CO, CU, CG) provides a compact, interpretable way to tune system behavior. We could expose these as configuration parameters in our Pipecat pipeline, allowing adjustment of responsiveness vs. interruption rate.
|
| 153 |
+
|
| 154 |
+
5. **Real-world validation**: This is one of the few ETD papers validated on a deployed system with real users, not just lab data. The 193ms latency improvement is a concrete, practically meaningful result.
|
| 155 |
+
|
| 156 |
+
6. **Simplicity**: The FSTTM is mathematically elegant and computationally trivial -- just a few probability estimates and a cost comparison. It could be implemented alongside our Silero VAD as an additional decision layer with negligible overhead.
|
| 157 |
+
|
| 158 |
+
7. **Limitation for our case**: The system was designed for a task-oriented telephone dialog (bus information), which has more predictable turn structures than the open-domain meeting conversations we handle. The boundary LM approach would need adaptation for our more varied content.
|
03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:aaa1054cf8f25ccdfd847cd0a323e8d0473867404c5895551a1ec9889e4f7d97
|
| 3 |
+
size 254541
|
03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.md
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Focal Loss for Dense Object Detection"
|
| 3 |
+
authors:
|
| 4 |
+
- Tsung-Yi Lin
|
| 5 |
+
- Priya Goyal
|
| 6 |
+
- Ross Girshick
|
| 7 |
+
- Kaiming He
|
| 8 |
+
- Piotr Dollar
|
| 9 |
+
year: 2017
|
| 10 |
+
source: https://arxiv.org/abs/1708.02002
|
| 11 |
+
date_converted: 2026-03-16
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## Abstract
|
| 15 |
+
|
| 16 |
+
The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. The authors discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. They propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. The novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. Using focal loss, their one-stage RetinaNet detector matches the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
|
| 17 |
+
|
| 18 |
+
## Key Contributions
|
| 19 |
+
|
| 20 |
+
1. **Focal Loss function**: A dynamically scaled cross entropy loss where the scaling factor decays to zero as confidence in the correct class increases, automatically down-weighting easy examples and focusing on hard examples.
|
| 21 |
+
2. **RetinaNet**: A simple one-stage dense object detector built on Feature Pyramid Networks (FPN) that, when trained with focal loss, achieves state-of-the-art results.
|
| 22 |
+
3. **Identification of class imbalance** as the central obstacle preventing one-stage detectors from matching two-stage detector accuracy.
|
| 23 |
+
|
| 24 |
+
## Method Details
|
| 25 |
+
|
| 26 |
+
### Focal Loss Definition
|
| 27 |
+
|
| 28 |
+
Standard cross entropy loss:
|
| 29 |
+
|
| 30 |
+
```
|
| 31 |
+
CE(pt) = -log(pt)
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
Focal loss adds a modulating factor `(1 - pt)^gamma`:
|
| 35 |
+
|
| 36 |
+
```
|
| 37 |
+
FL(pt) = -alpha_t * (1 - pt)^gamma * log(pt)
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
Key properties:
|
| 41 |
+
- When `gamma = 0`, FL is equivalent to CE.
|
| 42 |
+
- When an example is misclassified (`pt` is small), the modulating factor is near 1 and the loss is unaffected.
|
| 43 |
+
- As `pt -> 1`, the factor goes to 0, down-weighting well-classified examples.
|
| 44 |
+
- With `gamma = 2`, an example classified with `pt = 0.9` has 100x lower loss compared to CE; with `pt ~ 0.968`, 1000x lower loss.
|
| 45 |
+
- Best setting found: `gamma = 2`, `alpha = 0.25`.
|
| 46 |
+
|
| 47 |
+
### RetinaNet Architecture
|
| 48 |
+
|
| 49 |
+
- **Backbone**: ResNet + Feature Pyramid Network (FPN), pyramid levels P3-P7 with C=256 channels.
|
| 50 |
+
- **Classification subnet**: Small FCN with four 3x3 conv layers (256 filters, ReLU) + final 3x3 conv with KA filters + sigmoid. Shared across all pyramid levels.
|
| 51 |
+
- **Box regression subnet**: Identical structure to classification subnet, terminates in 4A linear outputs per location. Class-agnostic.
|
| 52 |
+
- **Anchors**: 9 anchors per level (3 aspect ratios x 3 scales), covering 32-813 pixel range.
|
| 53 |
+
- **Initialization**: Prior probability pi=0.01 for foreground class at initialization to prevent instability from background class dominance.
|
| 54 |
+
|
| 55 |
+
## Experimental Results
|
| 56 |
+
|
| 57 |
+
### Ablation Studies (ResNet-50-FPN, COCO minival)
|
| 58 |
+
|
| 59 |
+
| Loss | gamma | alpha | AP | AP50 | AP75 |
|
| 60 |
+
|------|-------|-------|------|------|------|
|
| 61 |
+
| CE (balanced) | 0 | 0.75 | 31.1 | 49.4 | 33.0 |
|
| 62 |
+
| FL | 0.5 | 0.75 | 31.4 | 49.9 | 33.1 |
|
| 63 |
+
| FL | 1.0 | 0.50 | 32.9 | 51.7 | 35.2 |
|
| 64 |
+
| **FL** | **2.0** | **0.25** | **34.0** | **52.5** | **36.5** |
|
| 65 |
+
| FL | 5.0 | 0.25 | 32.2 | 49.6 | 34.8 |
|
| 66 |
+
|
| 67 |
+
### FL vs. OHEM (ResNet-101-FPN)
|
| 68 |
+
|
| 69 |
+
| Method | AP | AP50 | AP75 |
|
| 70 |
+
|--------|------|------|------|
|
| 71 |
+
| OHEM (best) | 32.8 | 50.3 | 35.1 |
|
| 72 |
+
| FL | **36.0** | **54.9** | **38.7** |
|
| 73 |
+
|
| 74 |
+
FL outperforms the best OHEM variant by **3.2 AP points**.
|
| 75 |
+
|
| 76 |
+
### State-of-the-Art Comparison (COCO test-dev)
|
| 77 |
+
|
| 78 |
+
| Method | Backbone | AP | AP50 | AP75 |
|
| 79 |
+
|--------|----------|------|------|------|
|
| 80 |
+
| Faster R-CNN w FPN | ResNet-101-FPN | 36.2 | 59.1 | 39.0 |
|
| 81 |
+
| Faster R-CNN by G-RMI | Inception-ResNet-v2 | 34.7 | 55.5 | 36.7 |
|
| 82 |
+
| DSSD513 | ResNet-101-DSSD | 33.2 | 53.3 | 35.2 |
|
| 83 |
+
| **RetinaNet** | **ResNet-101-FPN** | **39.1** | **59.1** | **42.3** |
|
| 84 |
+
| **RetinaNet** | **ResNeXt-101-FPN** | **40.8** | **61.1** | **44.1** |
|
| 85 |
+
|
| 86 |
+
### Speed vs. Accuracy
|
| 87 |
+
|
| 88 |
+
- RetinaNet-101-600: 122ms inference, matching Faster R-CNN accuracy (36.0 AP) at similar speed.
|
| 89 |
+
- RetinaNet-101-800: 198ms inference, 37.8 AP -- surpassing all prior methods.
|
| 90 |
+
|
| 91 |
+
## Relevance to Turn-Taking / End-of-Turn Detection
|
| 92 |
+
|
| 93 |
+
Focal loss is directly applicable to end-of-turn detection for BabelCast:
|
| 94 |
+
|
| 95 |
+
1. **Class imbalance problem**: End-of-turn detection faces severe class imbalance -- the vast majority of audio frames are "continuing speech" (easy negatives), while actual turn-end boundaries are rare events. This is analogous to the foreground-background imbalance in object detection.
|
| 96 |
+
|
| 97 |
+
2. **Down-weighting easy examples**: In a streaming ETD system, most 100ms frames are clearly mid-utterance. Focal loss would prevent these trivially classified frames from dominating the gradient, focusing learning on the ambiguous boundary cases (pauses vs. turn-ends).
|
| 98 |
+
|
| 99 |
+
3. **Drop-in replacement**: Focal loss is a simple modification to standard cross entropy -- just adding `(1 - pt)^gamma` -- making it trivial to integrate into any binary or ternary classifier for end-of-turn detection (e.g., the GRU or Wav2Vec models in SpeculativeETD).
|
| 100 |
+
|
| 101 |
+
4. **Hyperparameter robustness**: The paper shows `gamma = 2, alpha = 0.25` works well across settings, reducing the need for extensive tuning.
|
| 102 |
+
|
| 103 |
+
5. **No sampling needed**: Unlike OHEM or hard negative mining, focal loss operates on all examples naturally, which is important for real-time streaming where we process every frame.
|
03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2e6f45b97007009f32b82a3c19f35638d8da7022d8d4e4416b9078de91339f71
|
| 3 |
+
size 1297358
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.md
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "The Dual Status of Filled Pauses: Evidence from Genre, Proficiency and Co-occurrence"
|
| 3 |
+
authors:
|
| 4 |
+
- Loulou Kosmala
|
| 5 |
+
- Ludivine Crible
|
| 6 |
+
year: 2022
|
| 7 |
+
source_url: "https://doi.org/10.1177/00238309211010862"
|
| 8 |
+
date_converted: 2026-03-16
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Abstract
|
| 12 |
+
|
| 13 |
+
A corpus study examining the lexical vs. non-lexical status of filled pauses ("euh" and "eum") in spoken French. Analyzes their distribution across communication settings (prepared monologues vs. spontaneous conversations) and language proficiency levels (native vs. non-native French). Quantitative findings reveal differences in frequency, duration, position, and co-occurrence patterns. Qualitative analysis identifies two distinct patterns: (1) initial position clustered with a discourse marker (fluent, structuring use) vs. (2) medial position clustered with other hesitation markers (disfluent, processing use). The authors argue for a **dual status** of filled pauses based on formal, functional, and contextual features.
|
| 14 |
+
|
| 15 |
+
## Key Findings Relevant to L2 Turn-Taking
|
| 16 |
+
|
| 17 |
+
### French Filled Pause Forms
|
| 18 |
+
- Two main variants in French: **"euh"** [schwa] and **"eum"** [nasal]
|
| 19 |
+
- "Eum" is associated with longer delays and major discourse transitions/boundaries
|
| 20 |
+
- "Euh" is the dominant form in both native and non-native French (84-92% of all FPs)
|
| 21 |
+
|
| 22 |
+
### Genre Effects (DisReg Corpus -- French Native Speakers)
|
| 23 |
+
- **Class presentations**: 6.8 FPs per 100 words (mean duration 415ms)
|
| 24 |
+
- **Casual conversations**: 4.2 FPs per 100 words (mean duration 343ms)
|
| 25 |
+
- More filled pauses in monologues than dialogues (significant, LL = 47.02, p < .001)
|
| 26 |
+
- More "eum" in presentations (23%) vs. conversations (8%)
|
| 27 |
+
- More initial-position FPs in presentations (40%) vs. conversations (24%)
|
| 28 |
+
- More final-position FPs in conversations (14%) vs. presentations (3%) -- reflects **turn-yielding** function
|
| 29 |
+
|
| 30 |
+
### Native vs. Non-Native French (SITAF Corpus)
|
| 31 |
+
- **Native rate**: 4.4 FPs per 100 words (mean duration **378ms**)
|
| 32 |
+
- **Non-native rate**: 5.3 FPs per 100 words (mean duration **524ms**)
|
| 33 |
+
- Rate difference not significant, but **duration difference highly significant** (p < .001)
|
| 34 |
+
- Both groups: ~60% medial position, ~30% initial, ~10% final
|
| 35 |
+
- Non-native speakers had more **standalone and interrupted** positions (exclusive to learners)
|
| 36 |
+
- Non-native speakers cluster FPs with other fluencemes more often (82% clustered vs. 72% for natives)
|
| 37 |
+
- Longer fluenceme sequences in learner speech signal higher disruption
|
| 38 |
+
|
| 39 |
+
### Co-occurrence with Discourse Markers
|
| 40 |
+
- 69 instances of FP + discourse marker clusters found
|
| 41 |
+
- Common French discourse markers paired with FPs: "donc" (so), "mais" (but), "ben" (well), "alors" (then/well), "en fait" (actually), "enfin" (I mean)
|
| 42 |
+
- FP position shifts when clustered with discourse markers: **57% initial** (vs. 73% medial when isolated)
|
| 43 |
+
- **Two distinct patterns emerge**:
|
| 44 |
+
1. **Fluent pattern**: FP + discourse marker at turn/phrase boundary (initial position) -- structuring function
|
| 45 |
+
2. **Disfluent pattern**: FP in medial position clustered with repetitions, lengthenings, silent pauses -- processing difficulty
|
| 46 |
+
|
| 47 |
+
### Functional Analysis of Discourse Marker + FP Clusters
|
| 48 |
+
Four discourse domains identified:
|
| 49 |
+
- **Ideational**: connecting facts (e.g., "alors euh" = "then euh")
|
| 50 |
+
- **Rhetorical**: expressing opinions (e.g., "mais euh" = "but euh")
|
| 51 |
+
- **Sequential**: marking transitions (e.g., "donc euh" = "so euh")
|
| 52 |
+
- **Interpersonal**: monitoring/agreement (e.g., "bah oui euh" = "well yes euh")
|
| 53 |
+
|
| 54 |
+
## Specific Hesitation Pattern Data
|
| 55 |
+
|
| 56 |
+
- Native French FP mean duration: **378ms** (SD=200ms)
|
| 57 |
+
- Non-native French FP mean duration: **524ms** (SD=222ms) -- 146ms longer
|
| 58 |
+
- FP duration is a more reliable index of proficiency than frequency
|
| 59 |
+
- Duration of FPs with/without discourse markers: no significant difference (~433ms vs. ~467ms)
|
| 60 |
+
- Example extreme disfluent sequence from a learner: 8 fluencemes in a row (lengthening + 3 silent pauses + 3 filled pauses + self-interruption), with pauses up to **1,177ms**
|
| 61 |
+
|
| 62 |
+
## Implications for Turn-Taking Detection in L2 Speech
|
| 63 |
+
|
| 64 |
+
1. **Dual function detection is essential**: The same sound "euh" can signal either (a) turn/phrase structuring (fluent, initial position + discourse marker) or (b) processing difficulty (disfluent, medial position + hesitation cluster). Both mean "don't take the turn yet" but for different reasons.
|
| 65 |
+
|
| 66 |
+
2. **Duration as proficiency proxy**: Non-native FPs are ~150ms longer on average. A model trained on native French data will encounter systematically longer hesitations from L2 speakers. This duration difference should NOT be interpreted as turn-yielding.
|
| 67 |
+
|
| 68 |
+
3. **Final-position FPs signal turn-yielding**: In conversations, 14% of FPs occur in final position (vs. 3% in monologues). The combination of FP + silent pause at utterance end is a reliable turn-yielding signal (e.g., "but he has a role euh (0.924)...").
|
| 69 |
+
|
| 70 |
+
4. **Discourse marker + FP combinations**: Patterns like "donc euh" (so euh), "mais euh" (but euh) are strong turn-HOLD signals in French. A model should recognize these common French fluenceme clusters.
|
| 71 |
+
|
| 72 |
+
5. **Cluster length matters**: Isolated FPs may go unnoticed; clustered FPs with repetitions and silent pauses signal genuine difficulty. Longer clusters = speaker is struggling but still holding the floor.
|
| 73 |
+
|
| 74 |
+
6. **L2 speakers underuse discourse markers**: Non-native speakers rely more heavily on filled pauses alone, whereas native speakers combine FPs with rich discourse markers. The model should expect L2 French-Portuguese speakers to show heavy FP reliance with fewer Portuguese discourse markers.
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:46985a0e6f9345c0ae5a756f3bf13523e9ba2a76694c60ef2b861cf0aef67b4b
|
| 3 |
+
size 631099
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.md
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "How Much Does Prosody Help Turn-taking? Investigations using Voice Activity Projection Models"
|
| 3 |
+
authors:
|
| 4 |
+
- Erik Ekstedt
|
| 5 |
+
- Gabriel Skantze
|
| 6 |
+
year: 2022
|
| 7 |
+
source_url: "https://doi.org/10.18653/v1/2023.acl-long.304"
|
| 8 |
+
date_converted: 2026-03-16
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Abstract
|
| 12 |
+
|
| 13 |
+
Investigates the role of prosody in turn-taking using the Voice Activity Projection (VAP) model, which incrementally models upcoming speech activity of interlocutors in a self-supervised manner without relying on explicit annotation of turn-taking events or explicit prosodic feature modeling. Through systematic manipulation of the speech signal (F0 flattening, intensity flattening, low-pass filtering, duration averaging, F0 shifting), the authors show that VAP models learn to utilize various prosodic aspects of speech for turn-taking prediction. The study uses both aggregate quantitative metrics on long-form conversations (Switchboard corpus) and controlled utterance-level experiments with synthesized short/long phrase pairs.
|
| 14 |
+
|
| 15 |
+
## Key Findings Relevant to L2 Turn-Taking
|
| 16 |
+
|
| 17 |
+
### VAP Model Architecture
|
| 18 |
+
- Self-supervised model trained on raw waveforms (no hand-crafted features)
|
| 19 |
+
- Predicts future voice activity for both speakers at every time frame
|
| 20 |
+
- Tested at 20Hz, 50Hz, and 100Hz frame rates
|
| 21 |
+
- Trained on **Switchboard corpus** (English telephone conversations, 260h)
|
| 22 |
+
- Evaluated on 4 zero-shot tasks: shift vs. hold, shift prediction at voice activity, backchannel prediction, backchannel vs. turn-shift
|
| 23 |
+
|
| 24 |
+
### Prosodic Perturbation Results
|
| 25 |
+
Five signal manipulations tested:
|
| 26 |
+
1. **Low-pass filter** (removes phonetic info, keeps F0 + intensity): Largest overall impact -- phonetic information is crucial
|
| 27 |
+
2. **F0 flat** (removes pitch contour): Second most impactful -- intonation is important for disambiguating turn completion
|
| 28 |
+
3. **Intensity flat** (removes loudness variation): Comparable impact to F0
|
| 29 |
+
4. **F0 shift** (arbitrary pitch shift): Minimal effect on slower models (50Hz), larger effect on faster models (100Hz)
|
| 30 |
+
5. **Duration average** (normalizes segment durations): Least important individual cue
|
| 31 |
+
|
| 32 |
+
### Key Prosodic Finding: F0 Contour for Turn Projection
|
| 33 |
+
- At **syntactically ambiguous completion points** (where lexical information is identical for both short and long utterances), the VAP model correctly uses prosody to distinguish turn-yielding from turn-holding
|
| 34 |
+
- **Higher F0 rise + longer duration** at the last syllable = turn completion signal
|
| 35 |
+
- The model predicts turn shifts **before** the utterance is actually complete -- it projects completions from prosodic dynamics
|
| 36 |
+
- Even when F0 is flattened, the model still partially distinguishes hold/shift, indicating **redundant information in intensity and/or duration**
|
| 37 |
+
|
| 38 |
+
### Relative Importance of Prosodic Cues
|
| 39 |
+
1. **Phonetic information** (segmental, captured by spectral content): Most important overall
|
| 40 |
+
2. **F0 contour** (intonation): Most important for disambiguating syntactically equivalent completion points
|
| 41 |
+
3. **Intensity**: At least as important as pitch for general turn-taking
|
| 42 |
+
4. **Duration**: Less important, but contributes as redundant cue
|
| 43 |
+
|
| 44 |
+
### Model Frame Rate Findings
|
| 45 |
+
- 20Hz and 50Hz models: comparable performance, more robust to perturbations
|
| 46 |
+
- 100Hz models: more sensitive to phonetic information and acoustic artifacts
|
| 47 |
+
- Lower frame rates preferred for computational efficiency and robustness
|
| 48 |
+
|
| 49 |
+
## Specific Data Points
|
| 50 |
+
|
| 51 |
+
- Human inter-turn gap: ~200ms average (Levinson & Torreira 2015)
|
| 52 |
+
- Switchboard corpus: 2,400 dyadic telephone conversations, 260 hours
|
| 53 |
+
- Short/long phrase pairs: 9 pairs, 10 TTS voices each (5 male, 5 female)
|
| 54 |
+
- Three evaluation regions at short completion point: hold (start to -200ms), predictive (-200ms to end), reactive (last frame)
|
| 55 |
+
|
| 56 |
+
## Implications for Turn-Taking Detection in L2 Speech
|
| 57 |
+
|
| 58 |
+
1. **VAP models work without explicit prosodic features**: The self-supervised approach learns prosodic patterns from raw audio. This is crucial for L2 speech where prosodic patterns deviate from native norms -- the model can potentially learn L2-specific patterns if fine-tuned on L2 data.
|
| 59 |
+
|
| 60 |
+
2. **F0 contour is the key disambiguation signal**: When words alone cannot determine turn boundaries (common in L2 speech with simpler syntax), the pitch contour becomes the primary cue. French speakers' intonation patterns in Portuguese will be a critical signal for the model.
|
| 61 |
+
|
| 62 |
+
3. **Prosody is redundant and multi-dimensional**: Even removing one prosodic dimension (F0 or intensity) doesn't completely collapse turn-taking prediction. This redundancy is useful for L2 speech where some prosodic cues may be atypical.
|
| 63 |
+
|
| 64 |
+
4. **50Hz frame rate is optimal**: For our Pipecat implementation, a 50Hz (20ms frame) VAP model provides the best balance of performance, robustness, and computational efficiency. This aligns with our current 16kHz/512-sample (32ms) Silero VAD window.
|
| 65 |
+
|
| 66 |
+
5. **Pre-completion projection**: The VAP model predicts turn shifts BEFORE the speaker finishes, based on prosodic dynamics. This is essential for reducing response latency in real-time systems. For L2 speakers, this projection may be less reliable due to non-standard prosodic patterns, requiring L2-specific fine-tuning.
|
| 67 |
+
|
| 68 |
+
6. **Cross-linguistic transfer needed**: The model was trained on English (Switchboard). For French speakers in Portuguese, it needs fine-tuning on Romance language conversation data. The underlying prosodic principles (F0 rise at completion, intensity patterns) likely transfer across Romance languages, but language-specific patterns must be learned.
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:472f85f79713f457e2ec5aa866a1e88cbd58d119a8b3e1561d50e7f1c1722661
|
| 3 |
+
size 2825968
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "A Survey of Recent Advances on Turn-taking Modeling in Spoken Dialogue Systems"
|
| 3 |
+
authors:
|
| 4 |
+
- Galo Castillo-Lopez
|
| 5 |
+
- Gael de Chalendar
|
| 6 |
+
- Nasredine Semmar
|
| 7 |
+
year: 2025
|
| 8 |
+
source_url: "https://aclanthology.org/2025.iwsds-1.24/"
|
| 9 |
+
date_converted: 2026-03-16
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## Abstract
|
| 13 |
+
|
| 14 |
+
A comprehensive review of recent methods on turn-taking modeling in spoken dialogue systems. Covers end-of-turn prediction, backchannel prediction, and multi-party turn-taking. Notes that 72% of reviewed works do not compare their methods with previous efforts, and argues that the lack of well-established benchmarks is a key challenge. Provides the first detailed review of datasets used in the field, discusses overlooked limitations, and examines new ideas and approaches since 2021. Published at IWSDS 2025.
|
| 15 |
+
|
| 16 |
+
## Key Findings Relevant to L2 Turn-Taking
|
| 17 |
+
|
| 18 |
+
### Turn-Taking Modeling Taxonomy
|
| 19 |
+
Three main approaches to end-of-turn (EOU) prediction:
|
| 20 |
+
1. **Silence-based**: Uses VAD + silence threshold (e.g., 700ms). Results in poor user experience due to unnaturalness. Current spoken dialogue systems wait 700-1000ms (vs. human 200ms average).
|
| 21 |
+
2. **IPU-based (Inter-Pausal Unit)**: Predictions after each detected silence. Assumes turns cannot be taken while user speaks.
|
| 22 |
+
3. **Continuous**: Predictions at every frame (e.g., every 50ms) regardless of silence. Most recent and promising approach.
|
| 23 |
+
|
| 24 |
+
### Voice Activity Projection (VAP) Models -- Current State of the Art
|
| 25 |
+
- VAP models predict future voice activity of both interlocutors incrementally
|
| 26 |
+
- Self-supervised learning -- no explicit turn-taking event annotation needed
|
| 27 |
+
- Best results in Japanese when trained on English + fine-tuned with Japanese data (vs. direct Japanese training)
|
| 28 |
+
- Cross-lingual performance is poor without proper label alignment across datasets
|
| 29 |
+
- Emerging trend: VAP models dominate continuous methods
|
| 30 |
+
|
| 31 |
+
### Feature Importance for Turn-Taking
|
| 32 |
+
- **Combined features outperform individual ones**: prosody + words > prosody alone or words alone
|
| 33 |
+
- **Turn-taking cues have additive effect** in human communication
|
| 34 |
+
- Word embeddings + acoustic features together improve backchannel detection
|
| 35 |
+
- Gaze, head pose, and non-verbal features enhance predictions when available
|
| 36 |
+
- **ASR-based linguistic features**: Fine-tuning wav2vec 2.0 for ASR outperforms acoustic-only features
|
| 37 |
+
|
| 38 |
+
### Key Datasets
|
| 39 |
+
| Dataset | Language | Duration | Dialogues |
|
| 40 |
+
|---|---|---|---|
|
| 41 |
+
| Switchboard | English | 260h | 2,400 |
|
| 42 |
+
| Fisher Corpus | English | 1,960h | 11,700 |
|
| 43 |
+
| NoXi Database | en/es/fr/de/it/ar/id | 25h | 84 |
|
| 44 |
+
| AMI Meeting Corpus | English | 100h | 175 (multi-party) |
|
| 45 |
+
|
| 46 |
+
- **Switchboard** is the dominant benchmark: used in 69% of backchannel and 41% of EOU papers
|
| 47 |
+
- Most research is on **English and Japanese** -- very limited work on other languages
|
| 48 |
+
- IPU silence thresholds vary from **50ms to 1s** across studies -- no standard
|
| 49 |
+
|
| 50 |
+
### Backchannel Prediction
|
| 51 |
+
- Backchannels are short feedback tokens ("mhm", "yeah") produced during another's speech
|
| 52 |
+
- Key distinction: backchannel vs. actual interruption vs. non-lexical noise
|
| 53 |
+
- Multi-task learning (sentiment + dialogue acts + backchannel) improves prediction
|
| 54 |
+
- Context-aware models using BERT text embeddings + wav2vec acoustic embeddings show strong results
|
| 55 |
+
|
| 56 |
+
### Open Challenges Identified
|
| 57 |
+
1. **No standardized benchmarks**: Only 28% of papers compare with prior work
|
| 58 |
+
2. **Multilinguality**: Very limited -- most work is English/Japanese only
|
| 59 |
+
3. **Multi-party conversations**: Understudied, requires visual channel
|
| 60 |
+
4. **Special populations**: Senior adults, mental health disorders need adapted models
|
| 61 |
+
5. **LLMs are inefficient** at detecting mid-utterance turn opportunities
|
| 62 |
+
|
| 63 |
+
## Implications for Turn-Taking Detection in L2 Speech
|
| 64 |
+
|
| 65 |
+
1. **Continuous VAP models are the right approach**: For our fine-tuning task, continuous frame-level prediction (not silence-threshold-based) is the current state of the art. This aligns with our Pipecat architecture.
|
| 66 |
+
|
| 67 |
+
2. **Cross-lingual fine-tuning works**: VAP models trained on English and fine-tuned on target language data outperform direct target-language training. This suggests training on English Switchboard + fine-tuning on French-Portuguese conversation data is a viable strategy.
|
| 68 |
+
|
| 69 |
+
3. **Multi-feature approach is essential**: For L2 speakers, combining prosodic + lexical + timing features provides the best predictions. L2 speakers may have atypical prosody but more predictable syntax, or vice versa. The additive effect of features provides robustness.
|
| 70 |
+
|
| 71 |
+
4. **Silence threshold tuning**: The survey notes thresholds from 50ms to 1s. For L2 speakers with longer pauses, the threshold must be adjusted upward. Our current 300ms SILENCE_HANGOVER_MS may need extension for L2 French speakers in Portuguese.
|
| 72 |
+
|
| 73 |
+
5. **French is available in NoXi**: The NoXi database includes French conversations (among 7 languages) with 25h total. This could be a valuable fine-tuning resource for our French-specific model.
|
| 74 |
+
|
| 75 |
+
6. **Backchannel handling**: L2 speakers may produce non-standard backchannels or transfer French backchannels ("ouais", "mmh") into Portuguese conversation. The model must distinguish these from turn-taking attempts.
|
| 76 |
+
|
| 77 |
+
7. **Benchmarking gap**: Since 72% of papers don't compare methods, we should establish clear metrics for our L2 turn-taking task from the start, enabling future comparison.
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f3fd4a6e8a77a46ad20634ba4b05c5731dbf4c692eb1acdb2fc1bb71062f12ee
|
| 3 |
+
size 473840
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Pauses and Communication Strategies in Second Language Speech"
|
| 3 |
+
authors:
|
| 4 |
+
- Jasone Cenoz
|
| 5 |
+
year: 2000
|
| 6 |
+
source_url: "https://eric.ed.gov/?id=ED426630"
|
| 7 |
+
date_converted: 2026-03-16
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Abstract
|
| 11 |
+
|
| 12 |
+
A study of silent and filled pauses in second language speech analyzing (1) which types of pauses are produced, (2) the functions of non-juncture pauses, (3) whether pauses co-occur with other hesitation phenomena, and (4) whether the occurrence of pauses is associated with second language proficiency. Subjects were 15 intermediate and advanced learners of English as a second language (L1 Spanish, university students). Each told a story in English which was recorded and transcribed. Silent and filled pauses at non-grammatical junctures were identified and analyzed.
|
| 13 |
+
|
| 14 |
+
## Key Findings Relevant to L2 Turn-Taking
|
| 15 |
+
|
| 16 |
+
### Pause Distribution
|
| 17 |
+
- Total non-juncture pauses: **1,085** across 15 subjects
|
| 18 |
+
- **64% silent pauses**, **36% filled pauses**
|
| 19 |
+
- The most common filler was **"eh"** (transferred from L1 Spanish), demonstrating L1 transfer of hesitation markers
|
| 20 |
+
- Wide individual variation: filled pauses ranged from 4% to 74.5% of all pauses per individual
|
| 21 |
+
|
| 22 |
+
### Silent Pause Durations
|
| 23 |
+
| Duration Range | % of Pauses | Subjects Affected | Individual Variation |
|
| 24 |
+
|---|---|---|---|
|
| 25 |
+
| 200-1000ms | 70% | 100% | 39%-96% |
|
| 26 |
+
| 1001-2000ms | 21% | 100% | 4%-39% |
|
| 27 |
+
| 2001-4000ms | 7% | 40% | 10%-18% |
|
| 28 |
+
| 4001ms+ | 2% | 40% | 1.5%-7% |
|
| 29 |
+
|
| 30 |
+
### Functional Categories of Pauses
|
| 31 |
+
| Function | Silent Pauses | Filled Pauses |
|
| 32 |
+
|---|---|---|
|
| 33 |
+
| Lexical (retrieval) | 36% | 26% |
|
| 34 |
+
| Morphological | 5% | 1% |
|
| 35 |
+
| Planning | 59% | 73% |
|
| 36 |
+
|
| 37 |
+
- Both silent and filled pauses serve the same functions, but filled pauses are overwhelmingly used for general **planning** (73%)
|
| 38 |
+
- More silent pauses associated with **lexical retrieval** than filled pauses
|
| 39 |
+
|
| 40 |
+
### Co-occurrence with Other Hesitation Phenomena
|
| 41 |
+
| Strategy | Silent Pauses | Filled Pauses |
|
| 42 |
+
|---|---|---|
|
| 43 |
+
| Pauses + other hesitations | 54% | 23% |
|
| 44 |
+
| Only pauses | 46% | 77% |
|
| 45 |
+
|
| 46 |
+
- Most common hesitation phenomena: **repetition, self-correction, and reformulation**
|
| 47 |
+
- Silent pauses much more likely to co-occur with other repair strategies (54%) vs. filled pauses (23%)
|
| 48 |
+
- Filled pauses function as **standalone repair devices**; silent pauses tend to **precede** other repairs
|
| 49 |
+
|
| 50 |
+
### Proficiency Effects
|
| 51 |
+
- Higher-proficiency learners produced **more total pauses** and **more filled pauses** (53% silent + 46% filled)
|
| 52 |
+
- Lower-proficiency learners: 69% silent + 31% filled
|
| 53 |
+
- Lower-proficiency learners used **more hesitation strategies combined with pauses** (62% for silent pauses) vs. higher-proficiency (38%)
|
| 54 |
+
- Interpretation: high-proficiency learners need only time to retrieve information; low-proficiency learners need to vocalize different options
|
| 55 |
+
|
| 56 |
+
## Specific Hesitation Pattern Data
|
| 57 |
+
|
| 58 |
+
- Silent pause minimum threshold: **200ms**
|
| 59 |
+
- Pause range: **205ms to 11,569ms**
|
| 60 |
+
- Most pauses (70%) are under 1 second
|
| 61 |
+
- Long pauses (>2s) only in 40% of subjects -- marks a subset of struggling speakers
|
| 62 |
+
- L1 Spanish filler "eh" dominates filled pauses (L1 transfer effect)
|
| 63 |
+
|
| 64 |
+
## Implications for Turn-Taking Detection in L2 Speech
|
| 65 |
+
|
| 66 |
+
1. **Pause duration is critical**: 70% of L2 pauses are 200-1000ms, overlapping with natural turn-transition gaps (~200ms). A turn-taking model must distinguish these within-turn hesitation pauses from actual turn boundaries.
|
| 67 |
+
|
| 68 |
+
2. **Filled pauses signal continuation**: Since filled pauses are primarily standalone floor-holding devices (77% occur without other hesitations), detecting "eh/euh/um" should strongly indicate the speaker intends to continue, NOT yield the turn.
|
| 69 |
+
|
| 70 |
+
3. **L1 transfer of filler forms**: French speakers speaking Portuguese will likely transfer French "euh" rather than adopting Portuguese fillers. The model should recognize L1-influenced fillers as valid hold signals.
|
| 71 |
+
|
| 72 |
+
4. **Proficiency paradox**: More proficient L2 speakers use MORE filled pauses, not fewer. The model should not interpret high filler rates as low fluency or turn-yielding intent.
|
| 73 |
+
|
| 74 |
+
5. **Cluster detection**: When silent pauses co-occur with repetitions, self-corrections, or reformulations (54% of cases), this strongly signals ongoing speech processing, not turn completion. Detecting hesitation clusters is key to avoiding premature turn-taking.
|
| 75 |
+
|
| 76 |
+
6. **Individual variation is massive**: Some speakers use almost no filled pauses (4%) while others fill 74.5% of their pauses. The model needs speaker adaptation or robust handling of diverse hesitation profiles.
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3e6e1ea3fb31493034539cd59a0274c8bcc11e08ff4104b2fb882c5766106e7a
|
| 3 |
+
size 238029
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/05-kosmala-2023-filled-pauses-pragmatic-markers-gaze-gesture.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7b4003c9f58c976ea687d117ea25b507885d0472d440eaad0e8386880e036734
|
| 3 |
+
size 920978
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.md
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Disfluencies in Continuous Speech in French: Prosodic Parameters of Filled Pauses and Vowel Lengthening"
|
| 3 |
+
authors:
|
| 4 |
+
- Yaru Wu
|
| 5 |
+
- Ivana Didirkova
|
| 6 |
+
- Anne-Catherine Simon
|
| 7 |
+
year: 2023
|
| 8 |
+
source_url: null
|
| 9 |
+
date_converted: 2026-03-16
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## Abstract
|
| 13 |
+
|
| 14 |
+
This study examines prosodic parameters in two types of disfluencies -- vowel lengthenings and filled pauses -- in approximately 2.5 hours of continuous French speech across 11 different speech genres (prepared, semi-prepared, and unprepared). Analyzes mean fundamental frequency (F0), pitch resets, and duration to compare disfluent syllables with surrounding fluent syllables as a function of speech preparation level. Results show that F0 is lower in filled pauses and disfluent vowel lengthenings than in fluent speech, with filled pauses produced at lower F0 than vowel lengthening. Larger pitch resets occur between disfluent units and their preceding contexts.
|
| 15 |
+
|
| 16 |
+
## Key Findings Relevant to L2 Turn-Taking
|
| 17 |
+
|
| 18 |
+
### Prosodic Characteristics of French Filled Pauses
|
| 19 |
+
|
| 20 |
+
#### Fundamental Frequency (F0) Drop
|
| 21 |
+
- **Fluent speech**: mean 19.01 ST (SD=5.97)
|
| 22 |
+
- **Vowel lengthening**: mean 18.10 ST (SD=6.43) -- ~1 ST drop
|
| 23 |
+
- **Filled pauses**: mean 17.90 ST (SD=7.31) -- ~1.1 ST drop
|
| 24 |
+
- F0 drop in filled pauses vs. fluent speech:
|
| 25 |
+
- **Females**: 2.12 ST decrease (from 23.66 to 21.54 ST)
|
| 26 |
+
- **Males**: 1.75 ST decrease (from 16.72 to 14.97 ST)
|
| 27 |
+
- Both disfluency types significantly lower than fluent speech (p < 0.001)
|
| 28 |
+
|
| 29 |
+
#### F0 and Speech Preparation Level
|
| 30 |
+
- **Unprepared speech**: FP average F0 = 18.89 ST
|
| 31 |
+
- **Semi-prepared speech**: FP average F0 = 17.41 ST
|
| 32 |
+
- **Prepared speech**: FP average F0 = 14.24 ST
|
| 33 |
+
- Massive **4.66 ST drop** between unprepared and prepared speech for FPs
|
| 34 |
+
- Vowel lengthening does NOT follow this trend (only 0.07 ST difference)
|
| 35 |
+
- Fluent speech shows moderate 1.2 ST drop between unprepared and prepared
|
| 36 |
+
|
| 37 |
+
#### Pitch Reset (Melodic Discontinuity)
|
| 38 |
+
- Filled pauses create a **larger negative pitch reset** from the preceding syllable than vowel lengthening
|
| 39 |
+
- Both FPs and lengthening produce significantly more negative pitch changes than fluent speech (p < 0.001)
|
| 40 |
+
- No significant pitch reset difference between the disfluent unit and the FOLLOWING syllable
|
| 41 |
+
- Key insight: the pitch drop **before** a filled pause is a reliable acoustic cue
|
| 42 |
+
|
| 43 |
+
#### Duration
|
| 44 |
+
- **Vowel lengthening** is significantly LONGER than **filled pauses** (p < 0.001)
|
| 45 |
+
- Vowel lengthening mean: ~347ms; Filled pause mean: ~268ms (from prior French study by Grosman 2018)
|
| 46 |
+
- Filled pauses are shorter in prepared speech and longer in unprepared speech
|
| 47 |
+
- Vowel lengthening shows the opposite pattern: shorter in unprepared, longer in prepared/semi-prepared
|
| 48 |
+
|
| 49 |
+
### Corpus Details (LOCAS-F)
|
| 50 |
+
- Multi-genre French oral corpus (14 speech genres)
|
| 51 |
+
- Analyzed: 2h38m of recordings
|
| 52 |
+
- Data: >1,500 vowel prolongation sequences, >1,000 filled pauses, ~59,000 fluent syllables
|
| 53 |
+
- Disfluency affects ~4% of all data (typical for non-pathological speech)
|
| 54 |
+
- Inter-annotator agreement: kappa 0.86 for FPs (near perfect), 0.64 for lengthening (substantial)
|
| 55 |
+
|
| 56 |
+
## Specific Hesitation Pattern Data
|
| 57 |
+
|
| 58 |
+
- French filled pauses typically transcribed as **"euh"**
|
| 59 |
+
- Duration range for hesitation vowels (cross-linguistic): **200ms to 650ms** (from Vasilescu et al. 2004, 8 languages)
|
| 60 |
+
- French FP F0 value is similar to the **onset value of breath groups** for a given speaker
|
| 61 |
+
- Mean FP duration in French: **268.4ms** (Grosman 2018)
|
| 62 |
+
- Mean vowel lengthening duration in French: **347.25ms** (Grosman 2018)
|
| 63 |
+
|
| 64 |
+
## Implications for Turn-Taking Detection in L2 Speech
|
| 65 |
+
|
| 66 |
+
1. **F0 drop is a reliable filled pause detector**: Filled pauses in French show a consistent 1-2 ST drop in F0 compared to fluent speech. A turn-taking model can use this prosodic signature to identify FPs even without lexical recognition. This is especially useful when ASR may not reliably transcribe L2 "euh" sounds.
|
| 67 |
+
|
| 68 |
+
2. **Pitch reset before FPs marks processing onset**: The significant negative pitch reset between the preceding syllable and the filled pause signals the START of a hesitation event. This could serve as an early warning that the speaker is about to hesitate but intends to continue.
|
| 69 |
+
|
| 70 |
+
3. **Preparation level affects FP prosody**: In more spontaneous speech (like real-time conversation), FPs have HIGHER F0 (closer to fluent speech). In prepared speech, FPs drop dramatically in F0. Since L2 speakers in conversation are in "unprepared" mode, their FPs may be harder to distinguish from fluent speech by F0 alone.
|
| 71 |
+
|
| 72 |
+
4. **Vowel lengthening is a separate hesitation signal**: French speakers (both L1 and L2) elongate vowels at word boundaries as a hesitation strategy. These are LONGER than filled pauses (~347ms vs. ~268ms) but show a smaller F0 drop. The model should recognize elongated schwa-like sounds as hesitation, not turn completion.
|
| 73 |
+
|
| 74 |
+
5. **4% disfluency rate in native speech**: This baseline helps calibrate expectations. L2 speakers will show significantly higher rates, potentially 8-15% or more, meaning the model will encounter disfluency markers very frequently in L2 French-to-Portuguese speech.
|
| 75 |
+
|
| 76 |
+
6. **Cross-linguistic prosodic transfer**: French speakers speaking Portuguese will likely transfer their French FP prosodic patterns (F0 drop magnitude, pitch reset patterns). The model should accommodate French-influenced prosodic contours on Portuguese hesitations.
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c77a54e77105dc6ad1f3306d0a991c9730f70ed327559f2686bc5897c8f6540c
|
| 3 |
+
size 436423
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/07-ekstedt-2023-turn-taking-cues-speech-synthesis.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9422e0c9df73fd3e2a91bb7bea6eba9393ba5865ea91b36463e1d50c23c9215b
|
| 3 |
+
size 294389
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/08-inoue-2024-realtime-turn-taking-vap.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:129a900833fb9a775f6abc8869de60f3526a3a489ed6fda5f8b267136a5a8cec
|
| 3 |
+
size 496267
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/09-inoue-2024-multilingual-turn-taking-vap.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a12189557e7022189609c710cd8071f359ae4e05a3c755c7e1aceef1b19a86bc
|
| 3 |
+
size 974569
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/10-hutin-2024-filled-pauses-conversational-convergence.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e48ef19c610ebb2f2dca05458c801d11f2899563ff165aa7b1a90c17a563501a
|
| 3 |
+
size 951784
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "Codeswitching: A Bilingual Toolkit for Opportunistic Speech Planning"
|
| 3 |
+
authors:
|
| 4 |
+
- Anne L. Beatty-Martinez
|
| 5 |
+
- Christian A. Navarro-Torres
|
| 6 |
+
- Paola E. Dussias
|
| 7 |
+
year: 2020
|
| 8 |
+
source_url: "https://doi.org/10.3389/fpsyg.2020.01699"
|
| 9 |
+
date_converted: 2026-03-16
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## Abstract
|
| 13 |
+
|
| 14 |
+
Reviews codeswitching as a bilingual strategy for opportunistic speech planning. Recent discoveries show that codeswitching is not haphazard but subject to unique linguistic and cognitive constraints, and that bilinguals who codeswitch exhibit usage patterns conforming to community-based norms. The paper provides corpus evidence (from the Puerto Rico Codeswitching Map Task corpus of Spanish-English bilinguals) that codeswitching serves as a tool to navigate linguistic interference during production, enabling speakers to circumvent speech planning difficulties by opportunistically drawing from whichever language is most active or accessible.
|
| 15 |
+
|
| 16 |
+
## Key Findings Relevant to L2 Turn-Taking
|
| 17 |
+
|
| 18 |
+
### Codeswitching as a Fluency Strategy
|
| 19 |
+
- Codeswitching is NOT random -- it is **structured and strategic**
|
| 20 |
+
- Functions as a tool for **opportunistic speech planning**: bilinguals take advantage of whichever language's words/structures are most active to achieve communicative goals
|
| 21 |
+
- Reduces the cost in time and resources during speech production
|
| 22 |
+
- **93% of codeswitches occur at Intonation Unit (IU) boundaries** (Plaistowe 2015, from NMSEB corpus)
|
| 23 |
+
- This means switches overwhelmingly happen at natural prosodic break points, not mid-phrase
|
| 24 |
+
|
| 25 |
+
### Prosodic Signatures of Codeswitching
|
| 26 |
+
- Speech rate changes before codeswitches: momentary **reorganization of prosodic and phonetic systems**
|
| 27 |
+
- These changes serve a dual purpose:
|
| 28 |
+
1. Help the speaker negotiate lexical competition and minimize cross-language interference
|
| 29 |
+
2. Provide reliable **acoustic cues for listeners** to anticipate and process the switch
|
| 30 |
+
- Codeswitching affects bilingual speech at different levels: word-level (speech rate of individual words) and sentence-level (speech rate within prosodic sentences)
|
| 31 |
+
|
| 32 |
+
### Language Control Modes
|
| 33 |
+
- **Single-language context**: Language control is COMPETITIVE -- one language suppressed at expense of other
|
| 34 |
+
- **Codeswitching context**: Language control is COOPERATIVE -- coactivation maintained, items from both languages available for selection
|
| 35 |
+
- Dense codeswitchers minimize language membership tagging and keep both languages active
|
| 36 |
+
|
| 37 |
+
### Corpus Data (Puerto Rico Codeswitching Map Task)
|
| 38 |
+
- 10 Spanish-English bilinguals (6 female), all native Spanish speakers
|
| 39 |
+
- Equal self-reported proficiency in both languages (9.6/10 each)
|
| 40 |
+
- ~2.5 hours of unscripted, task-oriented dialogs
|
| 41 |
+
- Participants exposed to both languages: more Spanish with family, more English in media, equal among friends
|
| 42 |
+
- Paired with in-group confederate (close friend from same speech community) -- this increased codeswitching 4x vs. out-group pairing
|
| 43 |
+
|
| 44 |
+
### Variable Equivalence and Switch Sites
|
| 45 |
+
- Bilinguals do NOT consistently avoid "conflict sites" between languages
|
| 46 |
+
- Instead, they **opportunistically use** sites of variable equivalence (partial structural overlap between languages)
|
| 47 |
+
- This challenges the strict "equivalence constraint" (Poplack 1980)
|
| 48 |
+
|
| 49 |
+
## Specific Data Points
|
| 50 |
+
|
| 51 |
+
- Self-reported proficiency: Spanish 9.6/10 (SD=0.8), English 9.6/10 (SD=0.5)
|
| 52 |
+
- Mean age: 23.3 years (SD=1.8)
|
| 53 |
+
- Codeswitching at IU boundaries: 93% (from related corpus study)
|
| 54 |
+
|
| 55 |
+
## Implications for Turn-Taking Detection in L2 Speech
|
| 56 |
+
|
| 57 |
+
1. **Codeswitches happen at prosodic boundaries**: Since 93% of codeswitches align with intonation unit boundaries, a turn-taking model should expect language switches at natural break points -- the same locations where turn transitions might occur. The model must not interpret a language switch as a turn-ending signal.
|
| 58 |
+
|
| 59 |
+
2. **Prosodic reorganization before switches**: The speech rate and prosodic changes before a codeswitch could be confused with turn-ending prosody. For French speakers occasionally inserting French words while speaking Portuguese, the model needs to tolerate these prosodic perturbations without triggering a false turn-shift prediction.
|
| 60 |
+
|
| 61 |
+
3. **L1-L2 fluency trade-off**: When French speakers struggle with Portuguese lexical retrieval, they may briefly switch to French (a natural bilingual strategy). The turn-taking model should recognize this as a hold signal (the speaker is using an alternative strategy to continue) rather than a breakdown signal.
|
| 62 |
+
|
| 63 |
+
4. **Cooperative language mode in bilingual contexts**: If the conversation partner also speaks French, the French-Portuguese speaker may engage in cooperative codeswitching. The model should handle mixed-language utterances without treating language boundaries as turn boundaries.
|
| 64 |
+
|
| 65 |
+
5. **Community norms matter**: Codeswitching frequency and patterns depend heavily on the interactional context and community norms. The model should be configurable for different bilingual settings (e.g., high-codeswitch vs. monolingual-target contexts).
|
| 66 |
+
|
| 67 |
+
6. **Intonation unit alignment**: Since codeswitches cluster at IU boundaries, and IU boundaries are also key sites for turn-taking decisions, the model must weigh additional cues (gaze, content completeness, prosodic finality) at these ambiguous boundary points where both a codeswitch-and-continue and a turn-yield are possible.
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:40ac9947f5813c936324bc2e266a1a574f1edb9d72a0d18f851bc3ddadf1d25b
|
| 3 |
+
size 1159649
|
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/12-lingref-code-switching-bilinguals.pdf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ca2aa2330e88de35a7f0254aefd3c024d14ce282f3f73ee28914f656f62f7820
|
| 3 |
+
size 874378
|