marcosremar2 commited on
Commit
3c1eb61
·
verified ·
1 Parent(s): 926483b

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +30 -0
  2. .gitignore +11 -0
  3. 03-finetune-pipecat-pt/01_download_pipecat.py +176 -0
  4. 03-finetune-pipecat-pt/02_generate_labels.py +474 -0
  5. 03-finetune-pipecat-pt/03_generate_audio.py +697 -0
  6. 03-finetune-pipecat-pt/04_finetune.py +1202 -0
  7. 03-finetune-pipecat-pt/05_evaluate.py +511 -0
  8. 03-finetune-pipecat-pt/06_inference.py +560 -0
  9. 03-finetune-pipecat-pt/README.md +489 -0
  10. 03-finetune-pipecat-pt/modal_run.py +267 -0
  11. 03-finetune-pipecat-pt/references/blogs/assemblyai_turn_detection.md +137 -0
  12. 03-finetune-pipecat-pt/references/blogs/deepgram_eot_evaluation.md +58 -0
  13. 03-finetune-pipecat-pt/references/blogs/krisp_turn_taking_v1.md +78 -0
  14. 03-finetune-pipecat-pt/references/blogs/krisp_turn_taking_v2.md +54 -0
  15. 03-finetune-pipecat-pt/references/blogs/livekit_eot_v0_4.md +104 -0
  16. 03-finetune-pipecat-pt/references/blogs/livekit_transformer_turn_detection.md +87 -0
  17. 03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3.md +75 -0
  18. 03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3_1.md +69 -0
  19. 03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3_2.md +41 -0
  20. 03-finetune-pipecat-pt/references/blogs/speechmatics_semantic_turn.md +103 -0
  21. 03-finetune-pipecat-pt/references/guides/focal_loss_calibration_emnlp_2022.md +209 -0
  22. 03-finetune-pipecat-pt/references/guides/livekit_turn_detector.md +170 -0
  23. 03-finetune-pipecat-pt/references/guides/pipecat_data_generation_guide.md +33 -0
  24. 03-finetune-pipecat-pt/references/guides/pipecat_dataset_v3_2.md +93 -0
  25. 03-finetune-pipecat-pt/references/guides/pipecat_smart_turn_readme.md +115 -0
  26. 03-finetune-pipecat-pt/references/guides/pipecat_train_py.md +179 -0
  27. 03-finetune-pipecat-pt/references/guides/vogent_turn_80m.md +171 -0
  28. 03-finetune-pipecat-pt/references/hesitation_turn_taking_l2_review.md +453 -0
  29. 03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.md +158 -0
  30. 03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.pdf +3 -0
  31. 03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.md +103 -0
  32. 03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.pdf +3 -0
  33. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.md +74 -0
  34. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.pdf +3 -0
  35. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.md +68 -0
  36. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.pdf +3 -0
  37. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.md +77 -0
  38. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.pdf +3 -0
  39. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.md +76 -0
  40. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.pdf +3 -0
  41. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/05-kosmala-2023-filled-pauses-pragmatic-markers-gaze-gesture.pdf +3 -0
  42. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.md +76 -0
  43. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.pdf +3 -0
  44. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/07-ekstedt-2023-turn-taking-cues-speech-synthesis.pdf +3 -0
  45. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/08-inoue-2024-realtime-turn-taking-vap.pdf +3 -0
  46. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/09-inoue-2024-multilingual-turn-taking-vap.pdf +3 -0
  47. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/10-hutin-2024-filled-pauses-conversational-convergence.pdf +3 -0
  48. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.md +67 -0
  49. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.pdf +3 -0
  50. 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/12-lingref-code-switching-bilinguals.pdf +3 -0
.gitattributes CHANGED
@@ -33,3 +33,33 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ 03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.pdf filter=lfs diff=lfs merge=lfs -text
37
+ 03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.pdf filter=lfs diff=lfs merge=lfs -text
38
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.pdf filter=lfs diff=lfs merge=lfs -text
39
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.pdf filter=lfs diff=lfs merge=lfs -text
40
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.pdf filter=lfs diff=lfs merge=lfs -text
41
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.pdf filter=lfs diff=lfs merge=lfs -text
42
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/05-kosmala-2023-filled-pauses-pragmatic-markers-gaze-gesture.pdf filter=lfs diff=lfs merge=lfs -text
43
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.pdf filter=lfs diff=lfs merge=lfs -text
44
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/07-ekstedt-2023-turn-taking-cues-speech-synthesis.pdf filter=lfs diff=lfs merge=lfs -text
45
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/08-inoue-2024-realtime-turn-taking-vap.pdf filter=lfs diff=lfs merge=lfs -text
46
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/09-inoue-2024-multilingual-turn-taking-vap.pdf filter=lfs diff=lfs merge=lfs -text
47
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/10-hutin-2024-filled-pauses-conversational-convergence.pdf filter=lfs diff=lfs merge=lfs -text
48
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.pdf filter=lfs diff=lfs merge=lfs -text
49
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/12-lingref-code-switching-bilinguals.pdf filter=lfs diff=lfs merge=lfs -text
50
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/13-christodoulides-avanzi-2014-DisMo-disfluency-french.pdf filter=lfs diff=lfs merge=lfs -text
51
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/14-jouvet-2019-speech-processing-prosody.pdf filter=lfs diff=lfs merge=lfs -text
52
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/15-filler-particles-cross-linguistic-heritage-2024.pdf filter=lfs diff=lfs merge=lfs -text
53
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/16-hal-peters-2017-L2-fluency-french.pdf filter=lfs diff=lfs merge=lfs -text
54
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/17-christodoulides-2015-automatic-disfluency-detection-french.pdf filter=lfs diff=lfs merge=lfs -text
55
+ 03-finetune-pipecat-pt/references/papers/hesitation-l2-french/19-hal-multimodal-self-other-repair-L2.pdf filter=lfs diff=lfs merge=lfs -text
56
+ 03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/conversar_mixed_reality_l2.pdf filter=lfs diff=lfs merge=lfs -text
57
+ 03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/hesitation_tagging_l2_whisper_lora.pdf filter=lfs diff=lfs merge=lfs -text
58
+ 03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/iwsds2025_survey_turn_taking.pdf filter=lfs diff=lfs merge=lfs -text
59
+ 03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/multilingual_vap_2024.pdf filter=lfs diff=lfs merge=lfs -text
60
+ 03-finetune-pipecat-pt/references/papers/language-learning-turn-taking/speak_improve_corpus_2025.pdf filter=lfs diff=lfs merge=lfs -text
61
+ 03-finetune-pipecat-pt/references/papers/soda_dialog_distillation_2023.pdf filter=lfs diff=lfs merge=lfs -text
62
+ 03-finetune-pipecat-pt/references/papers/speculative_etd_2025.pdf filter=lfs diff=lfs merge=lfs -text
63
+ 03-finetune-pipecat-pt/references/papers/turn_taking_review_skantze_2021.pdf filter=lfs diff=lfs merge=lfs -text
64
+ 03-finetune-pipecat-pt/references/papers/vap_turn_taking_ekstedt_2024.pdf filter=lfs diff=lfs merge=lfs -text
65
+ previous-experiments/01-benchmarks/report/figures/radar_chart.png filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data/
2
+ checkpoints/
3
+ __pycache__/
4
+ *.pyc
5
+ .deploy_state.json
6
+ benchmark.log
7
+ *.egg-info/
8
+ hf_cache/
9
+ results/
10
+ *.pt
11
+ *.onnx
03-finetune-pipecat-pt/01_download_pipecat.py ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Download Pipecat Smart Turn v3 dataset + model architecture.
2
+
3
+ The Pipecat model is only published as ONNX (no PyTorch weights).
4
+ So we initialize from openai/whisper-tiny and train using their dataset
5
+ which already includes Portuguese samples.
6
+
7
+ This script:
8
+ 1. Downloads Pipecat's training dataset (270K samples, 23 langs)
9
+ 2. Filters Portuguese samples
10
+ 3. Downloads the ONNX model for reference/benchmarking
11
+ 4. Saves Portuguese subset locally
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ import json
17
+ import logging
18
+ from pathlib import Path
19
+
20
+ log = logging.getLogger(__name__)
21
+
22
+ DATA_DIR = Path(__file__).parent / "data"
23
+ CACHE_DIR = Path("/workspace/hf_cache") if Path("/workspace").exists() else DATA_DIR / "hf_cache"
24
+
25
+
26
+ def download_pipecat_dataset(max_pt_samples: int = 5000) -> Path:
27
+ """Download Pipecat v3.2 training data and extract Portuguese samples."""
28
+ from datasets import load_dataset
29
+
30
+ DATA_DIR.mkdir(parents=True, exist_ok=True)
31
+
32
+ # Download full dataset (streaming to avoid 37GB download)
33
+ log.info("Loading Pipecat Smart Turn v3.2 training data (streaming)...")
34
+ ds = load_dataset(
35
+ "pipecat-ai/smart-turn-data-v3.2-train",
36
+ split="train",
37
+ streaming=True,
38
+ cache_dir=str(CACHE_DIR),
39
+ )
40
+
41
+ # Filter Portuguese samples
42
+ pt_samples = []
43
+ other_count = 0
44
+ for row in ds:
45
+ lang = row.get("language", "")
46
+ if lang == "por":
47
+ pt_samples.append({
48
+ "id": row["id"],
49
+ "language": lang,
50
+ "endpoint_bool": row["endpoint_bool"],
51
+ "midfiller": row.get("midfiller", False),
52
+ "endfiller": row.get("endfiller", False),
53
+ "synthetic": row.get("synthetic", True),
54
+ "dataset": row.get("dataset", ""),
55
+ "spoken_text": row.get("spoken_text", ""),
56
+ "audio": row["audio"],
57
+ })
58
+ if len(pt_samples) % 100 == 0:
59
+ log.info(" Found %d Portuguese samples (scanned %d others)",
60
+ len(pt_samples), other_count)
61
+ if len(pt_samples) >= max_pt_samples:
62
+ break
63
+ else:
64
+ other_count += 1
65
+
66
+ log.info("Total: %d Portuguese samples found (scanned %d other languages)",
67
+ len(pt_samples), other_count)
68
+
69
+ # Save metadata (without audio)
70
+ meta_path = DATA_DIR / "pipecat_pt_metadata.json"
71
+ meta = [{k: v for k, v in s.items() if k != "audio"} for s in pt_samples]
72
+ with open(meta_path, "w") as f:
73
+ json.dump(meta, f, indent=2, ensure_ascii=False)
74
+ log.info("Metadata saved to %s", meta_path)
75
+
76
+ # Save audio files
77
+ audio_dir = DATA_DIR / "pipecat_pt_audio"
78
+ audio_dir.mkdir(exist_ok=True)
79
+
80
+ import soundfile as sf
81
+ import numpy as np
82
+
83
+ complete = 0
84
+ incomplete = 0
85
+ for s in pt_samples:
86
+ audio_data = s["audio"]
87
+ audio = np.array(audio_data["array"], dtype=np.float32)
88
+ sr = audio_data["sampling_rate"]
89
+
90
+ label = "complete" if s["endpoint_bool"] else "incomplete"
91
+ if s["endpoint_bool"]:
92
+ complete += 1
93
+ else:
94
+ incomplete += 1
95
+
96
+ out_path = audio_dir / f"{s['id']}_{label}.wav"
97
+ sf.write(str(out_path), audio, sr)
98
+
99
+ log.info("Audio saved: %d complete, %d incomplete → %s",
100
+ complete, incomplete, audio_dir)
101
+
102
+ return audio_dir
103
+
104
+
105
+ def download_pipecat_test_data(max_pt_samples: int = 2000) -> Path:
106
+ """Download Pipecat test data for evaluation."""
107
+ from datasets import load_dataset
108
+
109
+ log.info("Loading Pipecat Smart Turn v3.2 test data (streaming)...")
110
+ ds = load_dataset(
111
+ "pipecat-ai/smart-turn-data-v3.2-test",
112
+ split="train",
113
+ streaming=True,
114
+ cache_dir=str(CACHE_DIR),
115
+ )
116
+
117
+ pt_samples = []
118
+ for row in ds:
119
+ if row.get("language", "") == "por":
120
+ pt_samples.append(row)
121
+ if len(pt_samples) >= max_pt_samples:
122
+ break
123
+
124
+ log.info("Found %d Portuguese test samples", len(pt_samples))
125
+
126
+ # Save
127
+ test_dir = DATA_DIR / "pipecat_pt_test"
128
+ test_dir.mkdir(exist_ok=True)
129
+
130
+ import soundfile as sf
131
+ import numpy as np
132
+
133
+ for s in pt_samples:
134
+ audio = np.array(s["audio"]["array"], dtype=np.float32)
135
+ sr = s["audio"]["sampling_rate"]
136
+ label = "complete" if s["endpoint_bool"] else "incomplete"
137
+ sf.write(str(test_dir / f"{s['id']}_{label}.wav"), audio, sr)
138
+
139
+ log.info("Test audio saved → %s", test_dir)
140
+ return test_dir
141
+
142
+
143
+ def download_onnx_model() -> Path:
144
+ """Download Pipecat ONNX model for benchmarking."""
145
+ from huggingface_hub import hf_hub_download
146
+
147
+ model_dir = DATA_DIR / "pipecat_model"
148
+ model_dir.mkdir(exist_ok=True)
149
+
150
+ for fname in ["smart-turn-v3.2-cpu.onnx", "smart-turn-v3.2-gpu.onnx"]:
151
+ path = hf_hub_download(
152
+ "pipecat-ai/smart-turn-v3",
153
+ fname,
154
+ local_dir=str(model_dir),
155
+ )
156
+ log.info("Downloaded %s → %s", fname, path)
157
+
158
+ return model_dir
159
+
160
+
161
+ if __name__ == "__main__":
162
+ logging.basicConfig(
163
+ level=logging.INFO,
164
+ format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
165
+ )
166
+
167
+ log.info("=== Step 1: Download Pipecat Portuguese training data ===")
168
+ download_pipecat_dataset(max_pt_samples=5000)
169
+
170
+ log.info("\n=== Step 2: Download Pipecat Portuguese test data ===")
171
+ download_pipecat_test_data(max_pt_samples=2000)
172
+
173
+ log.info("\n=== Step 3: Download ONNX model for benchmarking ===")
174
+ download_onnx_model()
175
+
176
+ log.info("\nDone!")
03-finetune-pipecat-pt/02_generate_labels.py ADDED
@@ -0,0 +1,474 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Generate high-quality turn-taking labels using Claude API.
2
+
3
+ Based on the Pipecat v3.1 pipeline:
4
+ 1. Load Portuguese transcripts from CORAA MuPe + NURC-SP
5
+ 2. Claude filters bad/ambiguous sentences (Pipecat: Gemini removed 50-80%)
6
+ 3. Claude classifies: COMPLETE vs INCOMPLETE (semantic, not punctuation-based)
7
+ 4. Claude inserts Brazilian Portuguese fillers
8
+ 5. Claude generates French-accented Portuguese variants
9
+
10
+ Uses Claude Haiku for cost efficiency (~$0.001 per 20 sentences).
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import json
16
+ import logging
17
+ import os
18
+ import re
19
+ import time
20
+ from dataclasses import dataclass, field
21
+ from pathlib import Path
22
+
23
+ log = logging.getLogger(__name__)
24
+
25
+ DATA_DIR = Path(__file__).parent / "data"
26
+ CACHE_DIR = Path("/workspace/hf_cache") if Path("/workspace").exists() else DATA_DIR / "hf_cache"
27
+
28
+ # Brazilian Portuguese fillers (from research: Claude + GPT generate these for Pipecat)
29
+ PT_BR_FILLERS = [
30
+ "hum", "eh", "ah", "tipo", "ne", "entao", "assim", "quer dizer",
31
+ "como e que eu falo", "deixa eu pensar", "olha", "bom", "pois e",
32
+ "sabe", "veja bem", "na verdade", "digamos", "e...",
33
+ ]
34
+
35
+ # French speaker fillers when speaking Portuguese
36
+ FR_PT_FILLERS = [
37
+ "euh", "alors", "comment dire", "como se diz", "enfin",
38
+ "c'est-a-dire", "voila", "bon", "donc",
39
+ ]
40
+
41
+
42
+ @dataclass
43
+ class LabeledSentence:
44
+ text: str
45
+ label: str # "complete" or "incomplete"
46
+ confidence: float # 0-1
47
+ source: str # "coraa", "mupe", "claude_generated"
48
+ has_filler: bool = False
49
+ filler_type: str = "" # "pt_br" or "fr_pt"
50
+ original_text: str = ""
51
+
52
+
53
+ def load_coraa_transcripts(max_samples: int = 10000) -> list[dict]:
54
+ """Load transcripts from CORAA MuPe (365h interviews, diarization kappa 0.947)."""
55
+ from datasets import load_dataset
56
+
57
+ log.info("Loading CORAA MuPe transcripts (streaming)...")
58
+ ds = load_dataset(
59
+ "nilc-nlp/CORAA-MUPE-ASR",
60
+ split="train",
61
+ streaming=True,
62
+ cache_dir=str(CACHE_DIR),
63
+ )
64
+
65
+ transcripts = []
66
+ for i, row in enumerate(ds):
67
+ text = str(row.get("text", row.get("normalized_text", "")))
68
+ if not text or len(text.split()) < 3:
69
+ continue
70
+
71
+ transcripts.append({
72
+ "text": text,
73
+ "speaker_type": row.get("speaker_type", ""),
74
+ "speaker_code": str(row.get("speaker_code", f"mupe_{i}")),
75
+ "source": "mupe",
76
+ })
77
+
78
+ if len(transcripts) >= max_samples:
79
+ break
80
+
81
+ if i % 5000 == 0 and i > 0:
82
+ log.info(" Scanned %d rows, collected %d transcripts", i, len(transcripts))
83
+
84
+ log.info("CORAA MuPe: %d transcripts loaded", len(transcripts))
85
+ return transcripts
86
+
87
+
88
+ def load_nurc_transcripts(max_samples: int = 10000) -> list[dict]:
89
+ """Load transcripts from CORAA NURC-SP (239h dialogues, speaker IDs)."""
90
+ from datasets import load_dataset
91
+
92
+ log.info("Loading CORAA NURC-SP transcripts (streaming)...")
93
+ try:
94
+ ds = load_dataset(
95
+ "nilc-nlp/CORAA-NURC-SP-Audio-Corpus",
96
+ split="train",
97
+ streaming=True,
98
+ cache_dir=str(CACHE_DIR),
99
+ )
100
+ except Exception as e:
101
+ log.warning("Failed to load NURC-SP: %s", e)
102
+ return []
103
+
104
+ transcripts = []
105
+ for i, row in enumerate(ds):
106
+ text = str(row.get("text", row.get("sentence", "")))
107
+ if not text or len(text.split()) < 3:
108
+ continue
109
+
110
+ transcripts.append({
111
+ "text": text,
112
+ "speaker_type": row.get("speech_genre", ""),
113
+ "speaker_code": str(row.get("speaker_id", f"nurc_{i}")),
114
+ "source": "nurc_sp",
115
+ })
116
+
117
+ if len(transcripts) >= max_samples:
118
+ break
119
+
120
+ log.info("NURC-SP: %d transcripts loaded", len(transcripts))
121
+ return transcripts
122
+
123
+
124
+ def classify_with_claude(
125
+ sentences: list[str],
126
+ batch_size: int = 20,
127
+ model: str = "claude-haiku-4-5-20251001",
128
+ ) -> list[dict]:
129
+ """Use Claude to classify sentences as complete/incomplete.
130
+
131
+ Based on Pipecat's approach: Gemini 2.5 Flash filtered 50-80% of sentences.
132
+ We use Claude Haiku for cost efficiency.
133
+
134
+ Returns list of {text, label, confidence, keep}.
135
+ """
136
+ import anthropic
137
+
138
+ api_key = os.environ.get("ANTHROPIC_API_KEY")
139
+ if not api_key:
140
+ log.warning("ANTHROPIC_API_KEY not set — using rule-based fallback")
141
+ return _rule_based_classify(sentences)
142
+
143
+ client = anthropic.Anthropic(api_key=api_key)
144
+ results = []
145
+
146
+ for batch_start in range(0, len(sentences), batch_size):
147
+ batch = sentences[batch_start:batch_start + batch_size]
148
+ numbered = "\n".join(f"{i+1}. {s}" for i, s in enumerate(batch))
149
+
150
+ prompt = f"""Voce e um anotador de turn-taking para portugues brasileiro.
151
+ Para cada frase abaixo, classifique:
152
+ - COMPLETO: o falante terminou o pensamento (pode comecar a traduzir)
153
+ - INCOMPLETO: o falante vai continuar falando (nao traduzir ainda)
154
+ - RUIM: frase com erro, ambigua, ou inutilizavel (descartar)
155
+
156
+ Responda APENAS em JSON, sem explicacao:
157
+ [{{"n": 1, "label": "COMPLETO", "confidence": 0.95}}, ...]
158
+
159
+ Frases:
160
+ {numbered}"""
161
+
162
+ try:
163
+ response = client.messages.create(
164
+ model=model,
165
+ max_tokens=2000,
166
+ messages=[{"role": "user", "content": prompt}],
167
+ )
168
+ text = response.content[0].text
169
+
170
+ # Parse JSON from response
171
+ json_match = re.search(r'\[.*\]', text, re.DOTALL)
172
+ if json_match:
173
+ batch_results = json.loads(json_match.group())
174
+ for item in batch_results:
175
+ idx = item["n"] - 1
176
+ if 0 <= idx < len(batch):
177
+ results.append({
178
+ "text": batch[idx],
179
+ "label": item["label"].lower(),
180
+ "confidence": item.get("confidence", 0.5),
181
+ "keep": item["label"].upper() != "RUIM",
182
+ })
183
+ else:
184
+ log.warning("No JSON in Claude response for batch %d", batch_start)
185
+ results.extend(_rule_based_classify(batch))
186
+
187
+ except Exception as e:
188
+ log.warning("Claude API error at batch %d: %s — falling back to rules", batch_start, e)
189
+ results.extend(_rule_based_classify(batch))
190
+
191
+ # Rate limiting
192
+ if batch_start % 100 == 0 and batch_start > 0:
193
+ log.info(" Classified %d/%d sentences", batch_start, len(sentences))
194
+ time.sleep(0.5)
195
+
196
+ return results
197
+
198
+
199
+ def _rule_based_classify(sentences: list[str]) -> list[dict]:
200
+ """Fallback rule-based classification (less accurate than Claude)."""
201
+ results = []
202
+ for s in sentences:
203
+ text = s.strip()
204
+ if not text or len(text) < 5:
205
+ results.append({"text": text, "label": "ruim", "confidence": 0.5, "keep": False})
206
+ continue
207
+
208
+ if re.search(r'[.!?]+\s*$', text):
209
+ results.append({"text": text, "label": "completo", "confidence": 0.7, "keep": True})
210
+ elif re.search(r'[,;:\-]\s*$', text):
211
+ results.append({"text": text, "label": "incompleto", "confidence": 0.6, "keep": True})
212
+ elif re.search(r'\b(e|que|mas|porque|quando|se|como|pra|para)\s*$', text, re.I):
213
+ results.append({"text": text, "label": "incompleto", "confidence": 0.8, "keep": True})
214
+ else:
215
+ results.append({"text": text, "label": "completo", "confidence": 0.5, "keep": True})
216
+
217
+ return results
218
+
219
+
220
+ def insert_fillers_with_claude(
221
+ sentences: list[str],
222
+ filler_type: str = "pt_br",
223
+ model: str = "claude-haiku-4-5-20251001",
224
+ ) -> list[dict]:
225
+ """Use Claude to insert fillers at natural points.
226
+
227
+ Based on Pipecat: Gemini Flash inserted fillers at natural break points.
228
+
229
+ filler_type: "pt_br" for native Brazilian, "fr_pt" for French-accented
230
+ """
231
+ import anthropic
232
+
233
+ api_key = os.environ.get("ANTHROPIC_API_KEY")
234
+ if not api_key:
235
+ log.warning("ANTHROPIC_API_KEY not set — using simple filler insertion")
236
+ return _simple_filler_insert(sentences, filler_type)
237
+
238
+ client = anthropic.Anthropic(api_key=api_key)
239
+ fillers = PT_BR_FILLERS if filler_type == "pt_br" else FR_PT_FILLERS
240
+
241
+ filler_list = ", ".join(f'"{f}"' for f in fillers)
242
+
243
+ if filler_type == "fr_pt":
244
+ context = """O falante e um frances de nivel B1 falando portugues.
245
+ Ele hesita ao conjugar verbos, busca palavras, e as vezes usa fillers em frances.
246
+ Insira hesitacoes naturais como: """ + filler_list
247
+ else:
248
+ context = """O falante e brasileiro nativo.
249
+ Insira fillers naturais do portugues brasileiro como: """ + filler_list
250
+
251
+ results = []
252
+ batch_size = 15
253
+
254
+ for batch_start in range(0, len(sentences), batch_size):
255
+ batch = sentences[batch_start:batch_start + batch_size]
256
+ numbered = "\n".join(f"{i+1}. {s}" for i, s in enumerate(batch))
257
+
258
+ prompt = f"""{context}
259
+
260
+ Para cada frase, crie uma versao com 1-2 fillers inseridos em pontos naturais.
261
+ A frase deve parecer INCOMPLETA (como se o falante fosse continuar).
262
+
263
+ Responda APENAS em JSON:
264
+ [{{"n": 1, "original": "frase original", "with_filler": "frase com filler"}}]
265
+
266
+ Frases:
267
+ {numbered}"""
268
+
269
+ try:
270
+ response = client.messages.create(
271
+ model=model,
272
+ max_tokens=3000,
273
+ messages=[{"role": "user", "content": prompt}],
274
+ )
275
+ text = response.content[0].text
276
+ json_match = re.search(r'\[.*\]', text, re.DOTALL)
277
+ if json_match:
278
+ batch_results = json.loads(json_match.group())
279
+ for item in batch_results:
280
+ results.append({
281
+ "original": item.get("original", ""),
282
+ "with_filler": item.get("with_filler", ""),
283
+ "filler_type": filler_type,
284
+ })
285
+
286
+ except Exception as e:
287
+ log.warning("Claude filler API error: %s", e)
288
+ results.extend(_simple_filler_insert(batch, filler_type))
289
+
290
+ time.sleep(0.3)
291
+
292
+ return results
293
+
294
+
295
+ def _simple_filler_insert(sentences: list[str], filler_type: str) -> list[dict]:
296
+ """Simple rule-based filler insertion fallback."""
297
+ import random
298
+
299
+ fillers = PT_BR_FILLERS if filler_type == "pt_br" else FR_PT_FILLERS
300
+ results = []
301
+
302
+ for s in sentences:
303
+ words = s.split()
304
+ if len(words) < 4:
305
+ results.append({"original": s, "with_filler": s, "filler_type": filler_type})
306
+ continue
307
+
308
+ # Insert filler at ~40% of sentence
309
+ pos = max(1, len(words) * 2 // 5)
310
+ filler = random.choice(fillers)
311
+ words.insert(pos, f"{filler}...")
312
+ results.append({
313
+ "original": s,
314
+ "with_filler": " ".join(words),
315
+ "filler_type": filler_type,
316
+ })
317
+
318
+ return results
319
+
320
+
321
+ def generate_french_portuguese_sentences(
322
+ n_sentences: int = 500,
323
+ model: str = "claude-haiku-4-5-20251001",
324
+ ) -> list[dict]:
325
+ """Generate sentences typical of French speakers learning Portuguese.
326
+
327
+ Based on Pipecat method: use LLM to generate diverse training text.
328
+ """
329
+ import anthropic
330
+
331
+ api_key = os.environ.get("ANTHROPIC_API_KEY")
332
+ if not api_key:
333
+ log.warning("ANTHROPIC_API_KEY not set — cannot generate French-PT sentences")
334
+ return []
335
+
336
+ client = anthropic.Anthropic(api_key=api_key)
337
+ results = []
338
+
339
+ contexts = [
340
+ "reuniao de trabalho",
341
+ "conversa informal com amigos",
342
+ "apresentacao de projeto",
343
+ "negociacao comercial",
344
+ "aula de portugues",
345
+ "pedindo informacoes na rua",
346
+ "restaurante pedindo comida",
347
+ "entrevista de emprego",
348
+ "ligacao telefonica profissional",
349
+ "discussao tecnica sobre software",
350
+ ]
351
+
352
+ for ctx in contexts:
353
+ prompt = f"""Gere {n_sentences // len(contexts)} frases que um FRANCES de nivel B1-B2 diria em portugues durante: {ctx}
354
+
355
+ Inclua variedade:
356
+ - Frases COMPLETAS (pensamento terminado, pode traduzir)
357
+ - Frases INCOMPLETAS (vai continuar falando, nao traduzir)
358
+ - Com hesitacoes tipicas (euh, alors, como se diz, tipo)
359
+ - Com erros de conjugacao comuns de franceses
360
+ - Com code-switching involuntario (palavra em frances no meio)
361
+
362
+ Responda em JSON:
363
+ [{{"text": "frase", "label": "completo" ou "incompleto", "notes": "o que torna dificil classificar"}}]"""
364
+
365
+ try:
366
+ response = client.messages.create(
367
+ model=model,
368
+ max_tokens=4000,
369
+ messages=[{"role": "user", "content": prompt}],
370
+ )
371
+ text = response.content[0].text
372
+ json_match = re.search(r'\[.*\]', text, re.DOTALL)
373
+ if json_match:
374
+ batch = json.loads(json_match.group())
375
+ for item in batch:
376
+ results.append({
377
+ "text": item["text"],
378
+ "label": item["label"],
379
+ "source": "claude_fr_pt",
380
+ "context": ctx,
381
+ "notes": item.get("notes", ""),
382
+ })
383
+ log.info(" Generated %d sentences for context: %s", len(batch), ctx)
384
+
385
+ except Exception as e:
386
+ log.warning("Error generating for %s: %s", ctx, e)
387
+
388
+ time.sleep(1)
389
+
390
+ log.info("Total French-PT sentences generated: %d", len(results))
391
+ return results
392
+
393
+
394
+ def run_full_pipeline(
395
+ max_transcripts: int = 5000,
396
+ max_fr_sentences: int = 500,
397
+ ) -> Path:
398
+ """Run the full label generation pipeline."""
399
+ output_dir = DATA_DIR / "claude_labeled"
400
+ output_dir.mkdir(parents=True, exist_ok=True)
401
+
402
+ # Step 1: Load transcripts
403
+ log.info("=== Step 1: Loading Portuguese transcripts ===")
404
+ mupe = load_coraa_transcripts(max_samples=max_transcripts)
405
+ nurc = load_nurc_transcripts(max_samples=max_transcripts)
406
+ all_transcripts = mupe + nurc
407
+ log.info("Total transcripts: %d (MuPe=%d, NURC=%d)", len(all_transcripts), len(mupe), len(nurc))
408
+
409
+ # Step 2: Claude classifies
410
+ log.info("\n=== Step 2: Claude classifies complete/incomplete ===")
411
+ sentences = [t["text"] for t in all_transcripts]
412
+ classified = classify_with_claude(sentences)
413
+
414
+ # Filter bad sentences (Pipecat removed 50-80%)
415
+ kept = [c for c in classified if c["keep"]]
416
+ removed = len(classified) - len(kept)
417
+ log.info("Classified: %d total, %d kept, %d removed (%.0f%%)",
418
+ len(classified), len(kept), removed, 100 * removed / max(len(classified), 1))
419
+
420
+ complete = [c for c in kept if c["label"] == "completo"]
421
+ incomplete = [c for c in kept if c["label"] == "incompleto"]
422
+ log.info(" Complete: %d, Incomplete: %d", len(complete), len(incomplete))
423
+
424
+ # Save classified
425
+ with open(output_dir / "classified_pt.json", "w") as f:
426
+ json.dump(kept, f, indent=2, ensure_ascii=False)
427
+
428
+ # Step 3: Insert fillers (creates INCOMPLETE variants)
429
+ log.info("\n=== Step 3: Insert fillers (PT-BR + FR-PT) ===")
430
+ complete_texts = [c["text"] for c in complete[:1000]]
431
+
432
+ pt_fillers = insert_fillers_with_claude(complete_texts, filler_type="pt_br")
433
+ fr_fillers = insert_fillers_with_claude(complete_texts[:500], filler_type="fr_pt")
434
+
435
+ with open(output_dir / "fillers_pt_br.json", "w") as f:
436
+ json.dump(pt_fillers, f, indent=2, ensure_ascii=False)
437
+ with open(output_dir / "fillers_fr_pt.json", "w") as f:
438
+ json.dump(fr_fillers, f, indent=2, ensure_ascii=False)
439
+
440
+ log.info("Fillers: %d PT-BR, %d FR-PT", len(pt_fillers), len(fr_fillers))
441
+
442
+ # Step 4: Generate French-Portuguese sentences
443
+ log.info("\n=== Step 4: Generate French-Portuguese sentences ===")
444
+ fr_sentences = generate_french_portuguese_sentences(n_sentences=max_fr_sentences)
445
+ with open(output_dir / "french_portuguese.json", "w") as f:
446
+ json.dump(fr_sentences, f, indent=2, ensure_ascii=False)
447
+
448
+ # Summary
449
+ summary = {
450
+ "total_transcripts": len(all_transcripts),
451
+ "classified_kept": len(kept),
452
+ "classified_removed": removed,
453
+ "complete": len(complete),
454
+ "incomplete": len(incomplete),
455
+ "fillers_pt_br": len(pt_fillers),
456
+ "fillers_fr_pt": len(fr_fillers),
457
+ "french_portuguese": len(fr_sentences),
458
+ }
459
+ with open(output_dir / "summary.json", "w") as f:
460
+ json.dump(summary, f, indent=2)
461
+
462
+ log.info("\n=== Summary ===")
463
+ for k, v in summary.items():
464
+ log.info(" %s: %d", k, v)
465
+
466
+ return output_dir
467
+
468
+
469
+ if __name__ == "__main__":
470
+ logging.basicConfig(
471
+ level=logging.INFO,
472
+ format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
473
+ )
474
+ run_full_pipeline(max_transcripts=5000, max_fr_sentences=500)
03-finetune-pipecat-pt/03_generate_audio.py ADDED
@@ -0,0 +1,697 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Generate TTS audio for the turn-taking dataset.
2
+
3
+ Based on the Pipecat v3.1 pipeline:
4
+ 1. Native PT-BR voices via Kokoro (pf_dora, pm_alex, pm_santa)
5
+ 2. French-accented Portuguese via XTTS v2 voice cloning
6
+ 3. Speed/pitch variation + noise augmentation
7
+ 4. Hesitation pause injection (1.5-3s silence after fillers) — SpeculativeETD V3
8
+ 5. Short utterance dataset ("sim", "nao", "ok") — Pipecat v3.2 (-40% errors)
9
+
10
+ Pipecat used Google Chirp3 TTS; we use Kokoro (open-source, runs locally)
11
+ + XTTS v2 (voice cloning for accent simulation).
12
+
13
+ Run locally or on Modal GPU:
14
+ python 03_generate_audio.py
15
+ """
16
+
17
+ from __future__ import annotations
18
+
19
+ import json
20
+ import logging
21
+ import random
22
+ import re
23
+ from dataclasses import dataclass
24
+ from pathlib import Path
25
+
26
+ import numpy as np
27
+
28
+ log = logging.getLogger(__name__)
29
+
30
+ DATA_DIR = Path(__file__).parent / "data"
31
+ SAMPLE_RATE = 16000 # Pipecat model expects 16kHz
32
+
33
+
34
+ @dataclass
35
+ class AudioSample:
36
+ """A single audio sample for the dataset."""
37
+ audio: np.ndarray # float32, 16kHz
38
+ text: str
39
+ label: str # "complete" or "incomplete"
40
+ voice: str # e.g. "pf_dora", "fr_clone_1"
41
+ accent: str # "native_pt_br" or "french_pt"
42
+ source: str # "kokoro", "xtts", "pipecat"
43
+ speed: float = 1.0
44
+
45
+
46
+ # ---------------------------------------------------------------------------
47
+ # Kokoro TTS — native PT-BR voices
48
+ # ---------------------------------------------------------------------------
49
+
50
+ def generate_kokoro_audio(
51
+ sentences: list[dict],
52
+ voices: list[str] | None = None,
53
+ speed_range: tuple[float, float] = (0.85, 1.15),
54
+ ) -> list[AudioSample]:
55
+ """Generate audio using Kokoro TTS (PT-BR voices).
56
+
57
+ sentences: list of {"text": str, "label": "complete"|"incomplete"}
58
+ voices: Kokoro voice IDs (default: PT-BR voices)
59
+ """
60
+ try:
61
+ from kokoro import KPipeline
62
+ except ImportError:
63
+ log.error("kokoro not installed — pip install kokoro")
64
+ return []
65
+
66
+ if voices is None:
67
+ voices = ["pf_dora", "pm_alex", "pm_santa"]
68
+
69
+ log.info("Initializing Kokoro pipeline (lang=pt)...")
70
+ pipeline = KPipeline(lang_code="p") # Portuguese
71
+
72
+ samples = []
73
+ errors = 0
74
+
75
+ for i, sent in enumerate(sentences):
76
+ text = sent["text"]
77
+ label = sent["label"]
78
+ voice = random.choice(voices)
79
+ speed = random.uniform(*speed_range)
80
+
81
+ try:
82
+ # Generate audio
83
+ generator = pipeline(text, voice=voice, speed=speed)
84
+ audio_chunks = []
85
+ for _, _, chunk in generator:
86
+ audio_chunks.append(chunk.numpy() if hasattr(chunk, 'numpy') else np.array(chunk))
87
+
88
+ if not audio_chunks:
89
+ errors += 1
90
+ continue
91
+
92
+ audio = np.concatenate(audio_chunks).astype(np.float32)
93
+
94
+ # Resample to 16kHz if needed (Kokoro outputs 24kHz)
95
+ if hasattr(pipeline, 'sample_rate') and pipeline.sample_rate != SAMPLE_RATE:
96
+ import torchaudio
97
+ import torch
98
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
99
+ audio = torchaudio.functional.resample(
100
+ tensor, pipeline.sample_rate, SAMPLE_RATE
101
+ ).squeeze().numpy()
102
+
103
+ # Normalize
104
+ peak = np.max(np.abs(audio))
105
+ if peak > 0:
106
+ audio = audio / peak * 0.9
107
+
108
+ samples.append(AudioSample(
109
+ audio=audio,
110
+ text=text,
111
+ label=label,
112
+ voice=voice,
113
+ accent="native_pt_br",
114
+ source="kokoro",
115
+ speed=speed,
116
+ ))
117
+
118
+ except Exception as e:
119
+ errors += 1
120
+ if errors <= 5:
121
+ log.warning("Kokoro error on sentence %d: %s", i, e)
122
+
123
+ if (i + 1) % 100 == 0:
124
+ log.info(" Kokoro: %d/%d generated (%d errors)", len(samples), i + 1, errors)
125
+
126
+ log.info("Kokoro: %d samples generated (%d errors)", len(samples), errors)
127
+ return samples
128
+
129
+
130
+ # ---------------------------------------------------------------------------
131
+ # XTTS v2 — French-accented Portuguese via voice cloning
132
+ # ---------------------------------------------------------------------------
133
+
134
+ def generate_xtts_audio(
135
+ sentences: list[dict],
136
+ reference_audio_dir: Path | None = None,
137
+ speed_range: tuple[float, float] = (0.9, 1.1),
138
+ ) -> list[AudioSample]:
139
+ """Generate audio with French accent using XTTS v2 voice cloning.
140
+
141
+ Uses short French speech samples as reference to clone the accent,
142
+ then synthesizes Portuguese text with that voice.
143
+
144
+ sentences: list of {"text": str, "label": "complete"|"incomplete"}
145
+ reference_audio_dir: directory with .wav files of French speakers
146
+ """
147
+ try:
148
+ from TTS.api import TTS
149
+ except ImportError:
150
+ log.error("TTS (coqui) not installed — pip install TTS")
151
+ return []
152
+
153
+ log.info("Initializing XTTS v2...")
154
+ tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
155
+
156
+ # Find reference audio files (French speakers)
157
+ ref_files = []
158
+ if reference_audio_dir and reference_audio_dir.exists():
159
+ ref_files = list(reference_audio_dir.glob("*.wav"))
160
+
161
+ if not ref_files:
162
+ log.warning("No French reference audio found in %s — using default voice", reference_audio_dir)
163
+ # Fall back to generating without cloning (still works, but no accent)
164
+ return _generate_xtts_no_clone(tts, sentences, speed_range)
165
+
166
+ log.info("Found %d French reference voices", len(ref_files))
167
+
168
+ samples = []
169
+ errors = 0
170
+
171
+ for i, sent in enumerate(sentences):
172
+ text = sent["text"]
173
+ label = sent["label"]
174
+ ref = random.choice(ref_files)
175
+
176
+ try:
177
+ audio = tts.tts(
178
+ text=text,
179
+ speaker_wav=str(ref),
180
+ language="pt",
181
+ )
182
+ audio = np.array(audio, dtype=np.float32)
183
+
184
+ # Resample to 16kHz if needed
185
+ if hasattr(tts, 'synthesizer') and tts.synthesizer.output_sample_rate != SAMPLE_RATE:
186
+ import torchaudio
187
+ import torch
188
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
189
+ audio = torchaudio.functional.resample(
190
+ tensor, tts.synthesizer.output_sample_rate, SAMPLE_RATE
191
+ ).squeeze().numpy()
192
+
193
+ # Normalize
194
+ peak = np.max(np.abs(audio))
195
+ if peak > 0:
196
+ audio = audio / peak * 0.9
197
+
198
+ # Random speed variation
199
+ speed = random.uniform(*speed_range)
200
+ if abs(speed - 1.0) > 0.05:
201
+ indices = np.arange(0, len(audio), speed).astype(int)
202
+ indices = indices[indices < len(audio)]
203
+ audio = audio[indices]
204
+
205
+ samples.append(AudioSample(
206
+ audio=audio,
207
+ text=text,
208
+ label=label,
209
+ voice=f"fr_clone_{ref.stem}",
210
+ accent="french_pt",
211
+ source="xtts",
212
+ speed=speed,
213
+ ))
214
+
215
+ except Exception as e:
216
+ errors += 1
217
+ if errors <= 5:
218
+ log.warning("XTTS error on sentence %d: %s", i, e)
219
+
220
+ if (i + 1) % 50 == 0:
221
+ log.info(" XTTS: %d/%d generated (%d errors)", len(samples), i + 1, errors)
222
+
223
+ log.info("XTTS: %d samples generated (%d errors)", len(samples), errors)
224
+ return samples
225
+
226
+
227
+ def _generate_xtts_no_clone(
228
+ tts,
229
+ sentences: list[dict],
230
+ speed_range: tuple[float, float],
231
+ ) -> list[AudioSample]:
232
+ """XTTS generation without voice cloning (fallback)."""
233
+ samples = []
234
+ errors = 0
235
+
236
+ for i, sent in enumerate(sentences):
237
+ try:
238
+ audio = tts.tts(text=sent["text"], language="pt")
239
+ audio = np.array(audio, dtype=np.float32)
240
+
241
+ peak = np.max(np.abs(audio))
242
+ if peak > 0:
243
+ audio = audio / peak * 0.9
244
+
245
+ samples.append(AudioSample(
246
+ audio=audio,
247
+ text=sent["text"],
248
+ label=sent["label"],
249
+ voice="xtts_default",
250
+ accent="french_pt",
251
+ source="xtts",
252
+ speed=1.0,
253
+ ))
254
+ except Exception as e:
255
+ errors += 1
256
+ if errors <= 3:
257
+ log.warning("XTTS fallback error: %s", e)
258
+
259
+ log.info("XTTS (no clone): %d samples (%d errors)", len(samples), errors)
260
+ return samples
261
+
262
+
263
+ # ---------------------------------------------------------------------------
264
+ # Hesitation pause injection (SpeculativeETD V3 — best data variant)
265
+ # ---------------------------------------------------------------------------
266
+
267
+ def inject_hesitation_pause(
268
+ audio: np.ndarray,
269
+ pause_duration_range: tuple[float, float] = (1.5, 3.0),
270
+ position: str = "end",
271
+ ) -> np.ndarray:
272
+ """Inject a realistic hesitation pause into audio.
273
+
274
+ SpeculativeETD V3 showed that inserting 1.5-3.0s pauses after fillers
275
+ was the most effective data variant for training ETD models.
276
+
277
+ For French speakers learning Portuguese, these long pauses happen when:
278
+ - Searching for a word ("Eu preciso de... [2s pause] ...uma tesoura")
279
+ - Thinking about conjugation ("Ontem eu... [2s pause] ...fui ao mercado")
280
+ - Code-switching hesitation ("Eu gosto de... [1.5s] ...euh... praia")
281
+
282
+ position: "end" = pause at end (simulates mid-utterance stop)
283
+ "mid" = pause in the middle (simulates hesitation)
284
+ """
285
+ pause_s = random.uniform(*pause_duration_range)
286
+ pause_samples = int(pause_s * SAMPLE_RATE)
287
+
288
+ # Add very low-level noise to the pause (not pure silence — more realistic)
289
+ pause = np.random.randn(pause_samples).astype(np.float32) * 0.001
290
+
291
+ if position == "end":
292
+ # Pause at end: speaker stops mid-sentence (INCOMPLETE)
293
+ return np.concatenate([audio, pause])
294
+ else:
295
+ # Pause in middle: split audio and insert pause
296
+ if len(audio) < SAMPLE_RATE: # too short to split
297
+ return np.concatenate([audio, pause])
298
+ split_point = random.randint(len(audio) // 3, 2 * len(audio) // 3)
299
+ return np.concatenate([audio[:split_point], pause, audio[split_point:]])
300
+
301
+
302
+ def create_hesitation_variants(
303
+ samples: list[AudioSample],
304
+ fraction: float = 0.3,
305
+ ) -> list[AudioSample]:
306
+ """Create hesitation-pause variants from existing INCOMPLETE samples.
307
+
308
+ Takes a fraction of incomplete samples and adds 1.5-3s pauses,
309
+ simulating French speakers hesitating in Portuguese.
310
+ """
311
+ incomplete = [s for s in samples if s.label == "incomplete"]
312
+ n_variants = int(len(incomplete) * fraction)
313
+
314
+ variants = []
315
+ for s in random.sample(incomplete, min(n_variants, len(incomplete))):
316
+ # Variant 1: long pause at end (speaker stopped to think)
317
+ audio_end = inject_hesitation_pause(s.audio, position="end")
318
+ variants.append(AudioSample(
319
+ audio=audio_end,
320
+ text=s.text,
321
+ label="incomplete", # still incomplete — speaker will continue
322
+ voice=s.voice,
323
+ accent=s.accent,
324
+ source=f"{s.source}_hesitation_end",
325
+ speed=s.speed,
326
+ ))
327
+
328
+ # Variant 2: pause in the middle (thinking mid-sentence)
329
+ if random.random() < 0.5:
330
+ audio_mid = inject_hesitation_pause(s.audio, position="mid")
331
+ variants.append(AudioSample(
332
+ audio=audio_mid,
333
+ text=s.text,
334
+ label="incomplete",
335
+ voice=s.voice,
336
+ accent=s.accent,
337
+ source=f"{s.source}_hesitation_mid",
338
+ speed=s.speed,
339
+ ))
340
+
341
+ log.info("Created %d hesitation-pause variants from %d incomplete samples",
342
+ len(variants), len(incomplete))
343
+ return variants
344
+
345
+
346
+ # ---------------------------------------------------------------------------
347
+ # Short utterance generation (Pipecat v3.2: -40% errors on short responses)
348
+ # ---------------------------------------------------------------------------
349
+
350
+ # Common short Portuguese responses in meetings
351
+ SHORT_UTTERANCES_COMPLETE = [
352
+ "Sim.", "Não.", "Ok.", "Tá.", "Pode.", "Beleza.", "Certo.", "Claro.",
353
+ "Entendi.", "Combinado.", "Perfeito.", "Exato.", "Isso.", "Verdade.",
354
+ "Com certeza.", "Sem dúvida.", "Tá bom.", "Pode ser.", "Vamos lá.",
355
+ "Concordo.", "Fechado.", "Ótimo.", "Legal.", "Tranquilo.", "Valeu.",
356
+ "Obrigado.", "De nada.", "Até logo.", "Tchau.", "Bom dia.", "Boa tarde.",
357
+ # French speaker variants
358
+ "Oui... quer dizer, sim.", "Sim, euh, concordo.", "Ok, d'accord.",
359
+ "Bon, tá bom.", "Voilà, é isso.", "Exactement, exato.",
360
+ ]
361
+
362
+ SHORT_UTTERANCES_INCOMPLETE = [
363
+ "Sim, mas...", "Não, porque...", "Então...", "Tipo...", "Olha...",
364
+ "Bom...", "Na verdade...", "Quer dizer...", "Pois é...", "Sabe...",
365
+ "É que...", "O problema é...", "A questão é...", "Eu acho que...",
366
+ # French speaker variants (hesitating after short start)
367
+ "Oui... euh...", "Sim, mas... comment dire...", "Alors...",
368
+ "Bon, eu acho que... euh...", "C'est... quer dizer...",
369
+ "Não, enfin... na verdade...", "Sim, mas... como se diz...",
370
+ ]
371
+
372
+
373
+ def generate_short_utterances(
374
+ voices: list[str] | None = None,
375
+ n_per_utterance: int = 3,
376
+ ) -> list[AudioSample]:
377
+ """Generate audio for short utterances using Kokoro TTS.
378
+
379
+ Pipecat v3.2 showed that a dedicated short utterance dataset
380
+ reduced misclassification by 40%. Short responses like "sim", "não"
381
+ are very common in Portuguese meetings.
382
+ """
383
+ try:
384
+ from kokoro import KPipeline
385
+ except ImportError:
386
+ log.warning("kokoro not installed — skipping short utterances")
387
+ return []
388
+
389
+ if voices is None:
390
+ voices = ["pf_dora", "pm_alex", "pm_santa"]
391
+
392
+ pipeline = KPipeline(lang_code="p")
393
+ samples = []
394
+ errors = 0
395
+
396
+ all_short = (
397
+ [(text, "complete") for text in SHORT_UTTERANCES_COMPLETE]
398
+ + [(text, "incomplete") for text in SHORT_UTTERANCES_INCOMPLETE]
399
+ )
400
+
401
+ for text, label in all_short:
402
+ for _ in range(n_per_utterance):
403
+ voice = random.choice(voices)
404
+ speed = random.uniform(0.85, 1.15)
405
+
406
+ try:
407
+ generator = pipeline(text, voice=voice, speed=speed)
408
+ audio_chunks = []
409
+ for _, _, chunk in generator:
410
+ audio_chunks.append(chunk.numpy() if hasattr(chunk, 'numpy') else np.array(chunk))
411
+
412
+ if not audio_chunks:
413
+ errors += 1
414
+ continue
415
+
416
+ audio = np.concatenate(audio_chunks).astype(np.float32)
417
+
418
+ # Resample to 16kHz if needed
419
+ if hasattr(pipeline, 'sample_rate') and pipeline.sample_rate != SAMPLE_RATE:
420
+ import torchaudio
421
+ import torch
422
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
423
+ audio = torchaudio.functional.resample(
424
+ tensor, pipeline.sample_rate, SAMPLE_RATE
425
+ ).squeeze().numpy()
426
+
427
+ # Normalize
428
+ peak = np.max(np.abs(audio))
429
+ if peak > 0:
430
+ audio = audio / peak * 0.9
431
+
432
+ # For incomplete short utterances, add hesitation pause
433
+ if label == "incomplete" and random.random() < 0.7:
434
+ audio = inject_hesitation_pause(audio, pause_duration_range=(1.0, 2.5), position="end")
435
+
436
+ samples.append(AudioSample(
437
+ audio=audio,
438
+ text=text,
439
+ label=label,
440
+ voice=voice,
441
+ accent="native_pt_br",
442
+ source="short_utterance",
443
+ speed=speed,
444
+ ))
445
+
446
+ except Exception as e:
447
+ errors += 1
448
+ if errors <= 5:
449
+ log.warning("Short utterance error: %s", e)
450
+
451
+ n_c = sum(1 for s in samples if s.label == "complete")
452
+ n_i = sum(1 for s in samples if s.label == "incomplete")
453
+ log.info("Short utterances: %d samples (%d complete, %d incomplete, %d errors)",
454
+ len(samples), n_c, n_i, errors)
455
+ return samples
456
+
457
+
458
+ # ---------------------------------------------------------------------------
459
+ # Audio augmentation
460
+ # ---------------------------------------------------------------------------
461
+
462
+ def augment_sample(audio: np.ndarray) -> np.ndarray:
463
+ """Apply augmentation to diversify training data.
464
+
465
+ Based on Pipecat/SpeculativeETD augmentation strategies.
466
+ """
467
+ aug = audio.copy()
468
+
469
+ # Add background noise (simulates meeting room)
470
+ if random.random() < 0.4:
471
+ noise_level = random.uniform(0.002, 0.015)
472
+ aug += np.random.randn(len(aug)).astype(np.float32) * noise_level
473
+
474
+ # Volume variation (simulates different mic distances)
475
+ if random.random() < 0.5:
476
+ scale = random.uniform(0.6, 1.4)
477
+ aug *= scale
478
+
479
+ # Speed perturbation (simulates speaking rate variation)
480
+ if random.random() < 0.3:
481
+ speed = random.uniform(0.92, 1.08)
482
+ indices = np.arange(0, len(aug), speed).astype(int)
483
+ indices = indices[indices < len(aug)]
484
+ aug = aug[indices]
485
+
486
+ return np.clip(aug, -1.0, 1.0).astype(np.float32)
487
+
488
+
489
+ # ---------------------------------------------------------------------------
490
+ # Save dataset
491
+ # ---------------------------------------------------------------------------
492
+
493
+ def save_dataset(
494
+ samples: list[AudioSample],
495
+ output_dir: Path,
496
+ augment_copies: int = 2,
497
+ ) -> Path:
498
+ """Save audio samples as WAV files + metadata JSON.
499
+
500
+ augment_copies: number of augmented copies per sample (0 = no augmentation)
501
+ """
502
+ import soundfile as sf
503
+
504
+ output_dir.mkdir(parents=True, exist_ok=True)
505
+ audio_dir = output_dir / "audio"
506
+ audio_dir.mkdir(exist_ok=True)
507
+
508
+ metadata = []
509
+ total_saved = 0
510
+
511
+ for i, sample in enumerate(samples):
512
+ # Save original
513
+ fname = f"{i:05d}_{sample.label}_{sample.accent}_{sample.voice}.wav"
514
+ sf.write(str(audio_dir / fname), sample.audio, SAMPLE_RATE)
515
+ metadata.append({
516
+ "file": fname,
517
+ "text": sample.text,
518
+ "label": sample.label,
519
+ "voice": sample.voice,
520
+ "accent": sample.accent,
521
+ "source": sample.source,
522
+ "speed": sample.speed,
523
+ "augmented": False,
524
+ "duration_s": round(len(sample.audio) / SAMPLE_RATE, 2),
525
+ })
526
+ total_saved += 1
527
+
528
+ # Save augmented copies
529
+ for aug_idx in range(augment_copies):
530
+ aug_audio = augment_sample(sample.audio)
531
+ aug_fname = f"{i:05d}_{sample.label}_{sample.accent}_{sample.voice}_aug{aug_idx}.wav"
532
+ sf.write(str(audio_dir / aug_fname), aug_audio, SAMPLE_RATE)
533
+ metadata.append({
534
+ "file": aug_fname,
535
+ "text": sample.text,
536
+ "label": sample.label,
537
+ "voice": sample.voice,
538
+ "accent": sample.accent,
539
+ "source": sample.source,
540
+ "speed": sample.speed,
541
+ "augmented": True,
542
+ "duration_s": round(len(aug_audio) / SAMPLE_RATE, 2),
543
+ })
544
+ total_saved += 1
545
+
546
+ # Save metadata
547
+ meta_path = output_dir / "metadata.json"
548
+ with open(meta_path, "w") as f:
549
+ json.dump(metadata, f, indent=2, ensure_ascii=False)
550
+
551
+ # Stats
552
+ n_complete = sum(1 for m in metadata if m["label"] == "complete")
553
+ n_incomplete = sum(1 for m in metadata if m["label"] == "incomplete")
554
+ n_native = sum(1 for m in metadata if m["accent"] == "native_pt_br")
555
+ n_french = sum(1 for m in metadata if m["accent"] == "french_pt")
556
+
557
+ log.info("Dataset saved to %s:", output_dir)
558
+ log.info(" Total: %d samples (%d original + %d augmented)",
559
+ total_saved, len(samples), total_saved - len(samples))
560
+ log.info(" Complete: %d, Incomplete: %d", n_complete, n_incomplete)
561
+ log.info(" Native PT-BR: %d, French-accented: %d", n_native, n_french)
562
+
563
+ return output_dir
564
+
565
+
566
+ # ---------------------------------------------------------------------------
567
+ # Main pipeline
568
+ # ---------------------------------------------------------------------------
569
+
570
+ def run_audio_generation(
571
+ max_native_sentences: int = 3000,
572
+ max_french_sentences: int = 1000,
573
+ augment_copies: int = 2,
574
+ ) -> Path:
575
+ """Run the full audio generation pipeline.
576
+
577
+ Loads labeled sentences from 02_generate_labels.py output,
578
+ generates audio via Kokoro + XTTS, saves dataset.
579
+ """
580
+ labeled_dir = DATA_DIR / "claude_labeled"
581
+
582
+ # Load labeled sentences
583
+ all_sentences = []
584
+
585
+ # 1. Classified sentences (from CORAA transcripts)
586
+ classified_path = labeled_dir / "classified_pt.json"
587
+ if classified_path.exists():
588
+ with open(classified_path) as f:
589
+ classified = json.load(f)
590
+ for c in classified:
591
+ if c.get("label") in ("completo", "incompleto"):
592
+ all_sentences.append({
593
+ "text": c["text"],
594
+ "label": "complete" if c["label"] == "completo" else "incomplete",
595
+ "source": "classified",
596
+ })
597
+ log.info("Loaded %d classified sentences", len(all_sentences))
598
+
599
+ # 2. Filler sentences (all are INCOMPLETE)
600
+ for filler_file, filler_type in [
601
+ ("fillers_pt_br.json", "pt_br"),
602
+ ("fillers_fr_pt.json", "fr_pt"),
603
+ ]:
604
+ fpath = labeled_dir / filler_file
605
+ if fpath.exists():
606
+ with open(fpath) as f:
607
+ fillers = json.load(f)
608
+ for fl in fillers:
609
+ if fl.get("with_filler"):
610
+ all_sentences.append({
611
+ "text": fl["with_filler"],
612
+ "label": "incomplete",
613
+ "source": f"filler_{filler_type}",
614
+ })
615
+ log.info("Loaded %d %s filler sentences", len(fillers), filler_type)
616
+
617
+ # 3. French-Portuguese sentences
618
+ frpt_path = labeled_dir / "french_portuguese.json"
619
+ if frpt_path.exists():
620
+ with open(frpt_path) as f:
621
+ frpt = json.load(f)
622
+ for s in frpt:
623
+ label = "complete" if s.get("label", "") == "completo" else "incomplete"
624
+ all_sentences.append({
625
+ "text": s["text"],
626
+ "label": label,
627
+ "source": "claude_fr_pt",
628
+ })
629
+ log.info("Loaded %d French-Portuguese sentences", len(frpt))
630
+
631
+ if not all_sentences:
632
+ log.error("No labeled sentences found in %s — run 02_generate_labels.py first", labeled_dir)
633
+ return DATA_DIR
634
+
635
+ log.info("Total sentences to synthesize: %d", len(all_sentences))
636
+
637
+ # Split: native PT-BR vs French-accented
638
+ native_sentences = [s for s in all_sentences if s["source"] != "claude_fr_pt"]
639
+ french_sentences = [s for s in all_sentences if s["source"] == "claude_fr_pt"]
640
+
641
+ # Also use some filler_fr_pt sentences with XTTS
642
+ fr_filler = [s for s in all_sentences if s["source"] == "filler_fr_pt"]
643
+ french_sentences.extend(fr_filler)
644
+
645
+ random.shuffle(native_sentences)
646
+ random.shuffle(french_sentences)
647
+ native_sentences = native_sentences[:max_native_sentences]
648
+ french_sentences = french_sentences[:max_french_sentences]
649
+
650
+ # Generate audio
651
+ log.info("\n=== Generating native PT-BR audio (Kokoro) ===")
652
+ native_samples = generate_kokoro_audio(native_sentences)
653
+
654
+ log.info("\n=== Generating French-accented audio (XTTS) ===")
655
+ ref_dir = DATA_DIR / "french_reference_audio"
656
+ french_samples = generate_xtts_audio(french_sentences, reference_audio_dir=ref_dir)
657
+
658
+ # NEW: Short utterances (Pipecat v3.2: -40% errors)
659
+ log.info("\n=== Generating short utterance dataset ===")
660
+ short_samples = generate_short_utterances(n_per_utterance=3)
661
+
662
+ # NEW: Hesitation pause variants (SpeculativeETD V3: best data variant)
663
+ # Critical for French speakers: long pauses mid-sentence are NOT end-of-turn
664
+ log.info("\n=== Creating hesitation-pause variants ===")
665
+ all_pre_hesitation = native_samples + french_samples + short_samples
666
+ hesitation_variants = create_hesitation_variants(all_pre_hesitation, fraction=0.3)
667
+
668
+ all_samples = native_samples + french_samples + short_samples + hesitation_variants
669
+ random.shuffle(all_samples)
670
+
671
+ log.info("Total samples before augmentation: %d", len(all_samples))
672
+ log.info(" Native PT-BR: %d", len(native_samples))
673
+ log.info(" French-accented: %d", len(french_samples))
674
+ log.info(" Short utterances: %d", len(short_samples))
675
+ log.info(" Hesitation variants: %d", len(hesitation_variants))
676
+
677
+ # Save
678
+ log.info("\n=== Saving dataset ===")
679
+ output_dir = DATA_DIR / "tts_dataset"
680
+ save_dataset(all_samples, output_dir, augment_copies=augment_copies)
681
+
682
+ return output_dir
683
+
684
+
685
+ if __name__ == "__main__":
686
+ logging.basicConfig(
687
+ level=logging.INFO,
688
+ format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
689
+ )
690
+ random.seed(42)
691
+ np.random.seed(42)
692
+
693
+ run_audio_generation(
694
+ max_native_sentences=3000,
695
+ max_french_sentences=1000,
696
+ augment_copies=2,
697
+ )
03-finetune-pipecat-pt/04_finetune.py ADDED
@@ -0,0 +1,1202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Fine-tune Pipecat Smart Turn v3 for Portuguese + French-accented Portuguese.
2
+
3
+ Architecture: exact SmartTurnV3Model from Pipecat's train.py
4
+ - WhisperEncoder (openai/whisper-tiny) with max_source_positions=400 (8s)
5
+ - Attention pooling: Linear(384→256) → Tanh → Linear(256→1)
6
+ - Classifier: Linear(384→256) → LayerNorm → GELU → Dropout(0.1)
7
+ → Linear(256→64) → GELU → Linear(64→1)
8
+
9
+ Training strategy:
10
+ - Initialize from openai/whisper-tiny (no Pipecat PyTorch weights available)
11
+ - Use Pipecat's Portuguese data as primary training data
12
+ - Add our Claude-labeled + TTS-generated data for PT-BR + French accent
13
+ - Pipecat hyperparams: lr=5e-5, warmup=0.2, cosine schedule, epochs=4
14
+
15
+ Changes vs initial plan (based on reference analysis — see references/):
16
+ - alpha=0.25 (not 0.6) per Lin et al. 2017 optimal for gamma=2
17
+ - batch_size=128 (not 32) per Pipecat train.py (they use 384)
18
+ - Removed label smoothing (double-regularization with focal loss, per EMNLP 2022)
19
+ - Added BCEWithLogitsLoss option for comparison (Pipecat's original loss)
20
+ - Added real noise augmentation (not just Gaussian) per Pipecat v3.2
21
+ - Added short utterance dataset loading per Pipecat v3.2 findings
22
+ - ONNX opset 18 (not 17) per Pipecat train.py
23
+ - Added INT8 static quantization step per Pipecat's deployment pipeline
24
+
25
+ Language-learning-specific improvements (March 2026 research):
26
+ - fp_penalty=2.0: asymmetric cost — interrupting a learner costs 2x more than
27
+ waiting too long (ConversAR 2025, Praktika approach)
28
+ - Speak & Improve L2 corpus: real L2 learner speech with disfluencies (340h)
29
+ - Dual threshold evaluation: eager (speculative) + final (confirmed) per Deepgram Flux
30
+
31
+ Data sources:
32
+ 1. Pipecat v3.2 Portuguese subset (~5K samples from 270K)
33
+ 2. Our custom TTS dataset (03_generate_audio.py output)
34
+ 3. CORAA real audio (01_download_pipecat.py output)
35
+ 4. Speak & Improve L2 corpus (~2K samples, cross-lingual hesitation patterns)
36
+ """
37
+
38
+ from __future__ import annotations
39
+
40
+ import gc
41
+ import json
42
+ import logging
43
+ import random
44
+ import time
45
+ from dataclasses import dataclass
46
+ from pathlib import Path
47
+
48
+ import numpy as np
49
+ import torch
50
+ import torch.nn as nn
51
+ from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
52
+ from transformers import WhisperFeatureExtractor, WhisperPreTrainedModel, WhisperConfig
53
+
54
+ log = logging.getLogger(__name__)
55
+
56
+ SAMPLE_RATE = 16000
57
+ WINDOW_SECONDS = 8
58
+ WINDOW_SAMPLES = WINDOW_SECONDS * SAMPLE_RATE
59
+
60
+ _workspace = Path("/workspace") if Path("/workspace").exists() else Path(".")
61
+ OUTPUT_DIR = _workspace / "results"
62
+ CACHE_DIR = _workspace / "hf_cache"
63
+ DATA_DIR = Path(__file__).parent / "data"
64
+
65
+ # Label smoothing REMOVED — focal loss already provides implicit calibration
66
+ # via entropy regularization (EMNLP 2022: focal_loss_calibration_emnlp_2022.md)
67
+ # Double-regularization (FL + LS) reduces discriminative power.
68
+ LABEL_SMOOTH = 0.0
69
+
70
+
71
+ # ---------------------------------------------------------------------------
72
+ # Model — SmartTurnV3Model (exact Pipecat architecture)
73
+ # ---------------------------------------------------------------------------
74
+
75
+ class SmartTurnV3Model(nn.Module):
76
+ """Pipecat Smart Turn v3 model architecture.
77
+
78
+ Matches the exact architecture from:
79
+ https://github.com/pipecat-ai/smart-turn/blob/main/train/train.py
80
+
81
+ WhisperEncoder (whisper-tiny, 384-dim) → attention pooling → classifier
82
+ """
83
+
84
+ def __init__(self, whisper_model: str = "openai/whisper-tiny"):
85
+ super().__init__()
86
+ from transformers import WhisperModel
87
+
88
+ # Load pretrained encoder
89
+ whisper = WhisperModel.from_pretrained(
90
+ whisper_model, cache_dir=str(CACHE_DIR)
91
+ )
92
+ self.encoder = whisper.encoder
93
+
94
+ # Resize position embeddings: 1500 (30s) → 400 (8s)
95
+ max_pos = 400
96
+ old_embed = self.encoder.embed_positions.weight.data
97
+ new_embed = old_embed[:max_pos, :]
98
+ self.encoder.embed_positions = nn.Embedding(max_pos, old_embed.shape[1])
99
+ self.encoder.embed_positions.weight.data = new_embed
100
+ self.encoder.config.max_source_positions = max_pos
101
+
102
+ hidden_size = self.encoder.config.d_model # 384 for whisper-tiny
103
+
104
+ # Attention pooling (exact Pipecat architecture)
105
+ self.attention = nn.Sequential(
106
+ nn.Linear(hidden_size, 256),
107
+ nn.Tanh(),
108
+ nn.Linear(256, 1),
109
+ )
110
+
111
+ # Classifier head (exact Pipecat architecture)
112
+ self.classifier = nn.Sequential(
113
+ nn.Linear(hidden_size, 256),
114
+ nn.LayerNorm(256),
115
+ nn.GELU(),
116
+ nn.Dropout(0.1),
117
+ nn.Linear(256, 64),
118
+ nn.GELU(),
119
+ nn.Linear(64, 1),
120
+ )
121
+
122
+ def forward(self, input_features: torch.Tensor) -> torch.Tensor:
123
+ encoder_output = self.encoder(input_features).last_hidden_state
124
+ attn_weights = self.attention(encoder_output)
125
+ attn_weights = torch.softmax(attn_weights, dim=1)
126
+ pooled = (encoder_output * attn_weights).sum(dim=1)
127
+ logits = self.classifier(pooled)
128
+ return logits.squeeze(-1)
129
+
130
+
131
+ # ---------------------------------------------------------------------------
132
+ # Focal Loss
133
+ # ---------------------------------------------------------------------------
134
+
135
+ class FocalLoss(nn.Module):
136
+ """Focal Loss (Lin et al. 2017) — focuses on hard boundary cases.
137
+
138
+ alpha=0.25 is optimal for gamma=2.0 per the original paper (Table 1a).
139
+ Our initial alpha=0.6 over-weighted positives and hurt calibration.
140
+
141
+ fp_penalty: extra multiplier on false-positive loss (model says "complete"
142
+ when speaker is still talking). For a language-learning avatar, interrupting
143
+ the learner mid-thought is much worse than waiting too long.
144
+ - ConversAR (2025) gives learners "infinite thinking period"
145
+ - Praktika uses extended silence tolerance for L2 speech
146
+ - Default 2.0 means FP errors cost 2x more than FN errors
147
+ """
148
+
149
+ def __init__(self, gamma: float = 2.0, alpha: float = 0.25, fp_penalty: float = 2.0):
150
+ super().__init__()
151
+ self.gamma = gamma
152
+ self.alpha = alpha
153
+ self.fp_penalty = fp_penalty
154
+
155
+ def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
156
+ bce = nn.functional.binary_cross_entropy_with_logits(
157
+ logits, targets, reduction="none",
158
+ )
159
+ probs = torch.sigmoid(logits)
160
+ p_t = probs * targets + (1 - probs) * (1 - targets)
161
+ focal_weight = (1 - p_t) ** self.gamma
162
+ alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)
163
+ loss = alpha_t * focal_weight * bce
164
+
165
+ # Asymmetric cost: penalize FP (interrupting learner) more than FN (waiting)
166
+ # FP = model predicts 1 (complete) when target is 0 (incomplete)
167
+ if self.fp_penalty != 1.0:
168
+ is_fp = (probs > 0.5).float() * (1 - targets) # predicted complete, actually incomplete
169
+ penalty = 1.0 + (self.fp_penalty - 1.0) * is_fp
170
+ loss = loss * penalty
171
+
172
+ return loss.mean()
173
+
174
+
175
+ # ---------------------------------------------------------------------------
176
+ # Data loading
177
+ # ---------------------------------------------------------------------------
178
+
179
+ @dataclass
180
+ class AudioSample:
181
+ audio: np.ndarray
182
+ label: float # 1.0 = complete, 0.0 = incomplete
183
+ source: str
184
+ speaker_id: str = ""
185
+
186
+
187
+ def load_pipecat_portuguese(max_samples: int = 5000) -> list[AudioSample]:
188
+ """Load Pipecat v3.2 Portuguese data (from 01_download_pipecat.py)."""
189
+ audio_dir = DATA_DIR / "pipecat_pt_audio"
190
+ if not audio_dir.exists():
191
+ log.warning("Pipecat PT audio not found at %s — run 01_download_pipecat.py first", audio_dir)
192
+ return []
193
+
194
+ import soundfile as sf
195
+
196
+ samples = []
197
+ for wav_path in sorted(audio_dir.glob("*.wav"))[:max_samples]:
198
+ try:
199
+ audio, sr = sf.read(str(wav_path))
200
+ audio = np.array(audio, dtype=np.float32)
201
+
202
+ if sr != SAMPLE_RATE:
203
+ import torchaudio
204
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
205
+ audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
206
+
207
+ label = 1.0 if "_complete" in wav_path.name else 0.0
208
+ audio = _extract_window(audio)
209
+ if audio is not None:
210
+ samples.append(AudioSample(
211
+ audio=audio, label=label,
212
+ source="pipecat_v3.2",
213
+ speaker_id=f"pipecat_{wav_path.stem.split('_')[0]}",
214
+ ))
215
+ except Exception as e:
216
+ log.warning("Error loading %s: %s", wav_path.name, e)
217
+
218
+ n_c = sum(1 for s in samples if s.label == 1.0)
219
+ n_i = sum(1 for s in samples if s.label == 0.0)
220
+ log.info("Pipecat PT: %d samples (%d complete, %d incomplete)", len(samples), n_c, n_i)
221
+ return samples
222
+
223
+
224
+ def load_tts_dataset(max_samples: int = 10000) -> list[AudioSample]:
225
+ """Load TTS-generated audio (from 03_generate_audio.py)."""
226
+ tts_dir = DATA_DIR / "tts_dataset"
227
+ meta_path = tts_dir / "metadata.json"
228
+
229
+ if not meta_path.exists():
230
+ log.warning("TTS dataset not found at %s — run 03_generate_audio.py first", tts_dir)
231
+ return []
232
+
233
+ import soundfile as sf
234
+
235
+ with open(meta_path) as f:
236
+ metadata = json.load(f)
237
+
238
+ audio_dir = tts_dir / "audio"
239
+ samples = []
240
+
241
+ for meta in metadata[:max_samples]:
242
+ wav_path = audio_dir / meta["file"]
243
+ if not wav_path.exists():
244
+ continue
245
+
246
+ try:
247
+ audio, sr = sf.read(str(wav_path))
248
+ audio = np.array(audio, dtype=np.float32)
249
+
250
+ if sr != SAMPLE_RATE:
251
+ import torchaudio
252
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
253
+ audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
254
+
255
+ label = 1.0 if meta["label"] == "complete" else 0.0
256
+ audio = _extract_window(audio)
257
+ if audio is not None:
258
+ samples.append(AudioSample(
259
+ audio=audio, label=label,
260
+ source=f"tts_{meta['accent']}",
261
+ speaker_id=meta["voice"],
262
+ ))
263
+ except Exception as e:
264
+ log.warning("Error loading %s: %s", meta["file"], e)
265
+
266
+ n_c = sum(1 for s in samples if s.label == 1.0)
267
+ n_i = sum(1 for s in samples if s.label == 0.0)
268
+ log.info("TTS dataset: %d samples (%d complete, %d incomplete)", len(samples), n_c, n_i)
269
+ return samples
270
+
271
+
272
+ def load_coraa_real_audio(max_samples: int = 3000) -> list[AudioSample]:
273
+ """Load REAL conversational audio from CORAA (not TTS).
274
+
275
+ SpeculativeETD showed synthetic-only training has a devastating gap:
276
+ F1 drops from 94.7% → 30.3% on real data. Mixing real audio is critical.
277
+
278
+ CORAA MUPE has 365h of real PT-BR interviews with speaker diarization.
279
+ We use these as COMPLETE samples (full utterances with natural prosody).
280
+ For INCOMPLETE, we truncate at natural pause points.
281
+ """
282
+ from datasets import load_dataset
283
+ import re
284
+
285
+ log.info("Loading CORAA MUPE real audio (streaming)...")
286
+ try:
287
+ ds = load_dataset(
288
+ "nilc-nlp/CORAA-MUPE-ASR",
289
+ split="train",
290
+ streaming=True,
291
+ cache_dir=str(CACHE_DIR),
292
+ )
293
+ except Exception as e:
294
+ log.warning("Failed to load CORAA MUPE: %s", e)
295
+ return []
296
+
297
+ samples = []
298
+ complete_count = 0
299
+ incomplete_count = 0
300
+ target_per_class = max_samples // 2
301
+
302
+ for i, row in enumerate(ds):
303
+ if complete_count >= target_per_class and incomplete_count >= target_per_class:
304
+ break
305
+
306
+ try:
307
+ audio_data = row.get("audio", {})
308
+ if not audio_data:
309
+ continue
310
+
311
+ audio = np.array(audio_data["array"], dtype=np.float32)
312
+ sr = audio_data["sampling_rate"]
313
+
314
+ if sr != SAMPLE_RATE:
315
+ import torchaudio
316
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
317
+ audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
318
+
319
+ duration = len(audio) / SAMPLE_RATE
320
+ if duration < 1.0:
321
+ continue
322
+
323
+ text = str(row.get("text", ""))
324
+ speaker_id = f"coraa_real_{i // 50}"
325
+
326
+ # COMPLETE: full utterances ending with sentence punctuation
327
+ if complete_count < target_per_class and re.search(r'[.!?]+\s*$', text):
328
+ window = _extract_window(audio)
329
+ if window is not None:
330
+ samples.append(AudioSample(
331
+ audio=window, label=1.0,
332
+ source="coraa_real",
333
+ speaker_id=speaker_id,
334
+ ))
335
+ complete_count += 1
336
+
337
+ # INCOMPLETE: truncate real audio at 40-70% (natural mid-utterance)
338
+ if incomplete_count < target_per_class and duration >= 2.0:
339
+ cut_frac = random.uniform(0.4, 0.7)
340
+ truncated = audio[:int(len(audio) * cut_frac)]
341
+ window = _extract_window(truncated)
342
+ if window is not None:
343
+ samples.append(AudioSample(
344
+ audio=window, label=0.0,
345
+ source="coraa_real",
346
+ speaker_id=speaker_id,
347
+ ))
348
+ incomplete_count += 1
349
+
350
+ except Exception:
351
+ continue
352
+
353
+ if i % 5000 == 0 and i > 0:
354
+ log.info(" CORAA real: scanned %d, %d complete, %d incomplete", i, complete_count, incomplete_count)
355
+
356
+ log.info("CORAA real audio: %d complete + %d incomplete = %d samples",
357
+ complete_count, incomplete_count, len(samples))
358
+ return samples
359
+
360
+
361
+ def load_pipecat_test_data() -> list[AudioSample]:
362
+ """Load Pipecat v3.2 Portuguese test data for evaluation."""
363
+ test_dir = DATA_DIR / "pipecat_pt_test"
364
+ if not test_dir.exists():
365
+ log.warning("Pipecat PT test data not found — run 01_download_pipecat.py first")
366
+ return []
367
+
368
+ import soundfile as sf
369
+
370
+ samples = []
371
+ for wav_path in sorted(test_dir.glob("*.wav")):
372
+ try:
373
+ audio, sr = sf.read(str(wav_path))
374
+ audio = np.array(audio, dtype=np.float32)
375
+
376
+ if sr != SAMPLE_RATE:
377
+ import torchaudio
378
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
379
+ audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
380
+
381
+ label = 1.0 if "_complete" in wav_path.name else 0.0
382
+ audio = _extract_window(audio)
383
+ if audio is not None:
384
+ samples.append(AudioSample(
385
+ audio=audio, label=label,
386
+ source="pipecat_test",
387
+ speaker_id=f"pipecat_test_{wav_path.stem.split('_')[0]}",
388
+ ))
389
+ except Exception as e:
390
+ log.warning("Error loading test %s: %s", wav_path.name, e)
391
+
392
+ log.info("Pipecat test: %d samples", len(samples))
393
+ return samples
394
+
395
+
396
+ def _extract_window(audio: np.ndarray) -> np.ndarray | None:
397
+ """Extract 8-second window from end of audio, pad if needed."""
398
+ if len(audio) < SAMPLE_RATE: # minimum 1s
399
+ return None
400
+
401
+ peak = np.max(np.abs(audio))
402
+ if peak > 0:
403
+ audio = audio / peak * 0.9
404
+
405
+ if len(audio) > WINDOW_SAMPLES:
406
+ audio = audio[-WINDOW_SAMPLES:]
407
+ elif len(audio) < WINDOW_SAMPLES:
408
+ padding = WINDOW_SAMPLES - len(audio)
409
+ audio = np.pad(audio, (padding, 0), mode="constant")
410
+
411
+ # ~200ms silence at end (VAD behavior)
412
+ silence_samples = int(0.2 * SAMPLE_RATE)
413
+ audio[-silence_samples:] = 0.0
414
+
415
+ return audio.astype(np.float32)
416
+
417
+
418
+ # ---------------------------------------------------------------------------
419
+ # Audio augmentation
420
+ # ---------------------------------------------------------------------------
421
+
422
+ # Cache for background noise samples (loaded once)
423
+ _noise_cache: list[np.ndarray] = []
424
+
425
+
426
+ def _load_noise_samples() -> list[np.ndarray]:
427
+ """Load real background noise samples for augmentation.
428
+
429
+ Pipecat v3.2 used CC-0 Freesound.org cafe/office noise and saw
430
+ 40% fewer short-utterance misclassifications (pipecat_smart_turn_v3_2.md).
431
+ """
432
+ global _noise_cache
433
+ if _noise_cache:
434
+ return _noise_cache
435
+
436
+ noise_dir = DATA_DIR / "noise_samples"
437
+ if not noise_dir.exists():
438
+ return []
439
+
440
+ import soundfile as sf
441
+ for wav_path in noise_dir.glob("*.wav"):
442
+ try:
443
+ audio, sr = sf.read(str(wav_path))
444
+ audio = np.array(audio, dtype=np.float32)
445
+ if sr != SAMPLE_RATE:
446
+ import torchaudio
447
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
448
+ audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
449
+ _noise_cache.append(audio)
450
+ except Exception:
451
+ pass
452
+
453
+ if _noise_cache:
454
+ log.info("Loaded %d background noise samples for augmentation", len(_noise_cache))
455
+ return _noise_cache
456
+
457
+
458
+ def augment_audio(audio: np.ndarray) -> np.ndarray:
459
+ """Data augmentation for training.
460
+
461
+ Per reference analysis:
462
+ - Real background noise (Pipecat v3.2: 40% fewer errors with cafe/office noise)
463
+ - Speed perturbation (standard in speech ML)
464
+ - Volume variation (simulates mic distance)
465
+ - Time shift
466
+ """
467
+ aug = audio.copy()
468
+
469
+ # Speed perturbation
470
+ if random.random() < 0.5:
471
+ speed = random.uniform(0.9, 1.1)
472
+ indices = np.arange(0, len(aug), speed).astype(int)
473
+ indices = indices[indices < len(aug)]
474
+ aug = aug[indices]
475
+ if len(aug) > WINDOW_SAMPLES:
476
+ aug = aug[:WINDOW_SAMPLES]
477
+ elif len(aug) < WINDOW_SAMPLES:
478
+ aug = np.pad(aug, (WINDOW_SAMPLES - len(aug), 0), mode="constant")
479
+
480
+ # Volume variation
481
+ if random.random() < 0.5:
482
+ aug *= random.uniform(0.6, 1.4)
483
+
484
+ # Real background noise (preferred) or Gaussian fallback
485
+ noise_samples = _load_noise_samples()
486
+ if noise_samples and random.random() < 0.5:
487
+ noise = random.choice(noise_samples)
488
+ # Loop or trim noise to match audio length
489
+ if len(noise) < len(aug):
490
+ repeats = len(aug) // len(noise) + 1
491
+ noise = np.tile(noise, repeats)[:len(aug)]
492
+ else:
493
+ start = random.randint(0, len(noise) - len(aug))
494
+ noise = noise[start:start + len(aug)]
495
+ snr_db = random.uniform(10, 25) # 10-25 dB SNR
496
+ noise_scale = np.sqrt(np.mean(aug ** 2)) / (np.sqrt(np.mean(noise ** 2)) * 10 ** (snr_db / 20) + 1e-8)
497
+ aug += noise * noise_scale
498
+ elif random.random() < 0.4:
499
+ # Gaussian noise fallback
500
+ noise_level = random.uniform(0.002, 0.02)
501
+ aug += np.random.randn(len(aug)).astype(np.float32) * noise_level
502
+
503
+ # Time shift
504
+ if random.random() < 0.3:
505
+ shift = random.randint(-int(0.3 * SAMPLE_RATE), int(0.3 * SAMPLE_RATE))
506
+ aug = np.roll(aug, shift)
507
+ if shift > 0:
508
+ aug[:shift] = 0.0
509
+ elif shift < 0:
510
+ aug[shift:] = 0.0
511
+
512
+ return np.clip(aug, -1.0, 1.0).astype(np.float32)
513
+
514
+
515
+ # ---------------------------------------------------------------------------
516
+ # PyTorch Dataset
517
+ # ---------------------------------------------------------------------------
518
+
519
+ class SmartTurnDataset(Dataset):
520
+ def __init__(
521
+ self,
522
+ samples: list[AudioSample],
523
+ feature_extractor: WhisperFeatureExtractor,
524
+ augment: bool = False,
525
+ ):
526
+ self.samples = samples
527
+ self.feature_extractor = feature_extractor
528
+ self.augment = augment
529
+
530
+ def __len__(self) -> int:
531
+ return len(self.samples)
532
+
533
+ def __getitem__(self, idx: int) -> dict:
534
+ sample = self.samples[idx]
535
+ audio = sample.audio
536
+
537
+ if self.augment:
538
+ audio = augment_audio(audio)
539
+
540
+ inputs = self.feature_extractor(
541
+ audio,
542
+ sampling_rate=SAMPLE_RATE,
543
+ return_tensors="np",
544
+ padding="max_length",
545
+ max_length=WINDOW_SAMPLES,
546
+ truncation=True,
547
+ do_normalize=True,
548
+ )
549
+
550
+ features = inputs.input_features.squeeze(0).astype(np.float32)
551
+
552
+ return {
553
+ "input_features": torch.from_numpy(features),
554
+ "labels": torch.tensor(sample.label, dtype=torch.float32),
555
+ }
556
+
557
+
558
+ # ---------------------------------------------------------------------------
559
+ # Train/val split
560
+ # ---------------------------------------------------------------------------
561
+
562
+ def split_by_speaker(
563
+ samples: list[AudioSample],
564
+ val_frac: float = 0.1,
565
+ test_frac: float = 0.1,
566
+ ) -> tuple[list[AudioSample], list[AudioSample], list[AudioSample]]:
567
+ """Split by speaker ID to prevent data leakage."""
568
+ speaker_map: dict[str, list[AudioSample]] = {}
569
+ for s in samples:
570
+ speaker_map.setdefault(s.speaker_id or "unknown", []).append(s)
571
+
572
+ speakers = list(speaker_map.keys())
573
+ random.shuffle(speakers)
574
+
575
+ n_val = max(1, int(len(speakers) * val_frac))
576
+ n_test = max(1, int(len(speakers) * test_frac))
577
+
578
+ test_spk = set(speakers[:n_test])
579
+ val_spk = set(speakers[n_test:n_test + n_val])
580
+ train_spk = set(speakers[n_test + n_val:])
581
+
582
+ train = [s for sp in train_spk for s in speaker_map[sp]]
583
+ val = [s for sp in val_spk for s in speaker_map[sp]]
584
+ test = [s for sp in test_spk for s in speaker_map[sp]]
585
+
586
+ random.shuffle(train)
587
+ random.shuffle(val)
588
+ random.shuffle(test)
589
+
590
+ for name, split in [("Train", train), ("Val", val), ("Test", test)]:
591
+ n_c = sum(1 for s in split if s.label == 1.0)
592
+ n_i = sum(1 for s in split if s.label == 0.0)
593
+ log.info(" %s: %d samples (%d complete, %d incomplete, %d speakers)",
594
+ name, len(split), n_c, n_i, len(set(s.speaker_id for s in split)))
595
+
596
+ return train, val, test
597
+
598
+
599
+ # ---------------------------------------------------------------------------
600
+ # Training
601
+ # ---------------------------------------------------------------------------
602
+
603
+ def load_speak_improve_l2(max_samples: int = 2000) -> list[AudioSample]:
604
+ """Load L2 learner speech from Speak & Improve Corpus 2025.
605
+
606
+ 340 hours of L2 English learner speech with disfluency annotations and
607
+ CEFR proficiency scores (A2-C1, majority B1-B2).
608
+
609
+ Even though this is English L2 (not Portuguese), L2 hesitation patterns
610
+ transfer across languages (Cenoz 2000). The pauses, false starts, and
611
+ disfluencies of L2 speakers share universal characteristics regardless
612
+ of target language.
613
+
614
+ Knill et al. (2025). Speak & Improve Corpus 2025: an L2 English Speech
615
+ Corpus for Language Assessment and Feedback. arXiv:2412.11986
616
+ """
617
+ from datasets import load_dataset
618
+
619
+ log.info("Loading Speak & Improve L2 corpus (streaming)...")
620
+ try:
621
+ ds = load_dataset(
622
+ "CambridgeEnglish/SpeakAndImprove2025",
623
+ split="train",
624
+ streaming=True,
625
+ cache_dir=str(CACHE_DIR),
626
+ trust_remote_code=True,
627
+ )
628
+ except Exception as e:
629
+ log.warning("Failed to load Speak & Improve corpus: %s — trying fallback name", e)
630
+ try:
631
+ ds = load_dataset(
632
+ "cambridgeenglishtests/speak-and-improve-2025",
633
+ split="train",
634
+ streaming=True,
635
+ cache_dir=str(CACHE_DIR),
636
+ trust_remote_code=True,
637
+ )
638
+ except Exception as e2:
639
+ log.warning("Speak & Improve corpus not available: %s — skipping", e2)
640
+ return []
641
+
642
+ samples = []
643
+ complete_count = 0
644
+ incomplete_count = 0
645
+ target_per_class = max_samples // 2
646
+
647
+ for i, row in enumerate(ds):
648
+ if complete_count >= target_per_class and incomplete_count >= target_per_class:
649
+ break
650
+
651
+ try:
652
+ audio_data = row.get("audio", {})
653
+ if not audio_data:
654
+ continue
655
+
656
+ audio = np.array(audio_data["array"], dtype=np.float32)
657
+ sr = audio_data["sampling_rate"]
658
+
659
+ if sr != SAMPLE_RATE:
660
+ import torchaudio
661
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
662
+ audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
663
+
664
+ duration = len(audio) / SAMPLE_RATE
665
+ if duration < 1.0:
666
+ continue
667
+
668
+ text = str(row.get("text", row.get("transcript", "")))
669
+ speaker_id = f"si_l2_{i // 20}"
670
+
671
+ # COMPLETE: full utterances (use entire audio)
672
+ if complete_count < target_per_class and duration >= 2.0:
673
+ window = _extract_window(audio)
674
+ if window is not None:
675
+ samples.append(AudioSample(
676
+ audio=window, label=1.0,
677
+ source="speak_improve_l2",
678
+ speaker_id=speaker_id,
679
+ ))
680
+ complete_count += 1
681
+
682
+ # INCOMPLETE: truncate at random point (simulates mid-utterance)
683
+ if incomplete_count < target_per_class and duration >= 3.0:
684
+ cut_frac = random.uniform(0.3, 0.6)
685
+ truncated = audio[:int(len(audio) * cut_frac)]
686
+ # Add hesitation-like pause at end (L2 speakers pause 524ms+ per Kosmala 2022)
687
+ pause_ms = random.uniform(500, 2000)
688
+ pause_samples = int(pause_ms / 1000 * SAMPLE_RATE)
689
+ pause = np.random.randn(pause_samples).astype(np.float32) * 0.001
690
+ truncated = np.concatenate([truncated, pause])
691
+ window = _extract_window(truncated)
692
+ if window is not None:
693
+ samples.append(AudioSample(
694
+ audio=window, label=0.0,
695
+ source="speak_improve_l2",
696
+ speaker_id=speaker_id,
697
+ ))
698
+ incomplete_count += 1
699
+
700
+ except Exception:
701
+ continue
702
+
703
+ if i % 2000 == 0 and i > 0:
704
+ log.info(" S&I L2: scanned %d, %d complete, %d incomplete", i, complete_count, incomplete_count)
705
+
706
+ log.info("Speak & Improve L2: %d complete + %d incomplete = %d samples",
707
+ complete_count, incomplete_count, len(samples))
708
+ return samples
709
+
710
+
711
+ def train(
712
+ epochs: int = 6,
713
+ batch_size: int = 128,
714
+ lr: float = 5e-5,
715
+ warmup_ratio: float = 0.2,
716
+ max_pipecat_samples: int = 5000,
717
+ max_tts_samples: int = 10000,
718
+ max_l2_samples: int = 2000,
719
+ whisper_model: str = "openai/whisper-tiny",
720
+ loss_fn: str = "focal", # "focal" or "bce" (Pipecat's original)
721
+ fp_penalty: float = 2.0, # asymmetric cost: FP costs 2x more than FN
722
+ ) -> Path:
723
+ """Fine-tune SmartTurnV3Model on Portuguese data.
724
+
725
+ Default hyperparams from Pipecat's train.py:
726
+ - lr=5e-5, warmup_ratio=0.2, cosine schedule
727
+ - Pipecat uses epochs=4 on 270K samples; we use 6 on 5-15K (more passes needed)
728
+ - Pipecat uses batch_size=384; we use 128 (A10G has 24GB, enough for 128)
729
+ - loss_fn="focal" (our improvement) or "bce" (Pipecat's original) for comparison
730
+
731
+ Language-learning-specific improvements (March 2026 research):
732
+ - fp_penalty=2.0: asymmetric cost — interrupting a learner mid-thought is
733
+ much worse than waiting too long (ConversAR 2025, Praktika approach)
734
+ - Speak & Improve L2 corpus: real L2 learner speech with hesitations/disfluencies
735
+ - Dual threshold in evaluation: eager (0.3-0.5) for speculative prep,
736
+ final (0.7+) for actual turn transition (inspired by Deepgram Flux)
737
+ """
738
+
739
+ device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
740
+ log.info("Training on device: %s", device)
741
+ if device == "cuda":
742
+ log.info("GPU: %s (%d MB)", torch.cuda.get_device_name(),
743
+ torch.cuda.get_device_properties(0).total_memory // 1024 // 1024)
744
+
745
+ # ----- Load all data sources -----
746
+ t0 = time.time()
747
+ all_samples: list[AudioSample] = []
748
+
749
+ # 1. Pipecat Portuguese (primary — these have LLM-curated labels)
750
+ pipecat = load_pipecat_portuguese(max_samples=max_pipecat_samples)
751
+ all_samples.extend(pipecat)
752
+ del pipecat
753
+ gc.collect()
754
+
755
+ # 2. Our TTS-generated data (native PT-BR + French accent + short utterances)
756
+ tts = load_tts_dataset(max_samples=max_tts_samples)
757
+ all_samples.extend(tts)
758
+ del tts
759
+ gc.collect()
760
+
761
+ # 3. CORAA real audio (critical: SpeculativeETD showed synthetic→real gap of 94.7%→30.3%)
762
+ coraa = load_coraa_real_audio(max_samples=3000)
763
+ all_samples.extend(coraa)
764
+ del coraa
765
+ gc.collect()
766
+
767
+ # 4. Speak & Improve L2 corpus — real L2 learner speech with disfluencies
768
+ # L2 hesitation patterns transfer across languages (Cenoz 2000)
769
+ # 340h of L2 English learners, CEFR A2-C1 (arXiv:2412.11986)
770
+ si_l2 = load_speak_improve_l2(max_samples=max_l2_samples)
771
+ all_samples.extend(si_l2)
772
+ del si_l2
773
+ gc.collect()
774
+
775
+ if not all_samples:
776
+ raise RuntimeError("No training samples loaded! Run 01/03 scripts first.")
777
+
778
+ load_time = time.time() - t0
779
+ n_c = sum(1 for s in all_samples if s.label == 1.0)
780
+ n_i = sum(1 for s in all_samples if s.label == 0.0)
781
+ sources = {}
782
+ for s in all_samples:
783
+ sources[s.source] = sources.get(s.source, 0) + 1
784
+
785
+ log.info("Total: %d samples (%d complete, %d incomplete) in %.0fs", len(all_samples), n_c, n_i, load_time)
786
+ for src, cnt in sorted(sources.items()):
787
+ log.info(" %s: %d", src, cnt)
788
+
789
+ # ----- Split -----
790
+ log.info("=== Splitting by speaker ===")
791
+ train_samples, val_samples, internal_test = split_by_speaker(all_samples)
792
+
793
+ # Also load Pipecat test data (held-out, never seen during training)
794
+ pipecat_test = load_pipecat_test_data()
795
+
796
+ # ----- Datasets -----
797
+ feature_extractor = WhisperFeatureExtractor(chunk_length=8)
798
+ train_ds = SmartTurnDataset(train_samples, feature_extractor, augment=True)
799
+ val_ds = SmartTurnDataset(val_samples, feature_extractor, augment=False)
800
+
801
+ # Balanced sampling
802
+ train_labels = [s.label for s in train_samples]
803
+ n_pos = sum(1 for l in train_labels if l == 1.0)
804
+ n_neg = len(train_labels) - n_pos
805
+ weights = [1.0 / n_neg if l == 0.0 else 1.0 / n_pos for l in train_labels]
806
+ sampler = WeightedRandomSampler(weights, len(weights))
807
+
808
+ use_pin = device == "cuda"
809
+ n_workers = 4 if device == "cuda" else 0
810
+ train_loader = DataLoader(train_ds, batch_size=batch_size, sampler=sampler,
811
+ num_workers=n_workers, pin_memory=use_pin)
812
+ val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False,
813
+ num_workers=0, pin_memory=use_pin)
814
+
815
+ # ----- Model -----
816
+ model = SmartTurnV3Model(whisper_model=whisper_model).to(device)
817
+ total_params = sum(p.numel() for p in model.parameters())
818
+ log.info("Model: %d params (%.1f MB)", total_params, total_params * 4 / 1024 / 1024)
819
+
820
+ # ----- Loss -----
821
+ # Pipecat uses BCEWithLogitsLoss with dynamic pos_weight (clamped 0.1-10.0)
822
+ # We default to Focal Loss (alpha=0.25, gamma=2.0 per Lin et al. 2017)
823
+ # but support BCE for comparison
824
+ pos_weight_val = min(max(n_neg / max(n_pos, 1), 0.1), 10.0)
825
+ pos_weight = torch.tensor([pos_weight_val], device=device)
826
+
827
+ if loss_fn == "bce":
828
+ criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
829
+ log.info("Loss: BCEWithLogitsLoss (Pipecat original), pos_weight=%.2f", pos_weight_val)
830
+ else:
831
+ criterion = FocalLoss(gamma=2.0, alpha=0.25, fp_penalty=fp_penalty)
832
+ log.info("Loss: FocalLoss (gamma=2.0, alpha=0.25, fp_penalty=%.1f)", fp_penalty)
833
+ log.info(" Asymmetric cost: FP (interrupting learner) penalized %.1fx more than FN", fp_penalty)
834
+
835
+ # ----- Optimizer: uniform LR (Pipecat uses single lr=5e-5 for all params) -----
836
+ # Pipecat's train.py does NOT use differential LR — all params at same rate.
837
+ # Since we initialize from whisper-tiny (same as Pipecat), uniform LR is correct.
838
+ optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
839
+
840
+ # Cosine schedule with warmup (Pipecat: warmup_ratio=0.2)
841
+ total_steps = epochs * len(train_loader)
842
+ warmup_steps = int(total_steps * warmup_ratio)
843
+
844
+ def lr_lambda(step):
845
+ if step < warmup_steps:
846
+ return step / max(warmup_steps, 1)
847
+ progress = (step - warmup_steps) / max(total_steps - warmup_steps, 1)
848
+ return 0.5 * (1 + np.cos(np.pi * progress))
849
+
850
+ scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
851
+
852
+ # ----- Training loop -----
853
+ OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
854
+ best_f1 = 0.0
855
+ best_path = OUTPUT_DIR / "best_model.pt"
856
+ resume_path = OUTPUT_DIR / "resume_checkpoint.pt"
857
+ patience = 5
858
+ patience_counter = 0
859
+ history = []
860
+ start_epoch = 0
861
+
862
+ # Resume support
863
+ if resume_path.exists():
864
+ log.info("=== Resuming from checkpoint ===")
865
+ ckpt = torch.load(resume_path, map_location=device, weights_only=False)
866
+ model.load_state_dict(ckpt["model_state_dict"])
867
+ optimizer.load_state_dict(ckpt["optimizer_state_dict"])
868
+ scheduler.load_state_dict(ckpt["scheduler_state_dict"])
869
+ start_epoch = ckpt["epoch"]
870
+ best_f1 = ckpt.get("best_f1", 0.0)
871
+ patience_counter = ckpt.get("patience_counter", 0)
872
+ history = ckpt.get("history", [])
873
+ log.info(" Resumed at epoch %d, best_f1=%.4f", start_epoch, best_f1)
874
+
875
+ log.info("=== Training: %d epochs, batch=%d, lr=%.1e, warmup_ratio=%.1f ===",
876
+ epochs, batch_size, lr, warmup_ratio)
877
+
878
+ for epoch in range(start_epoch, epochs):
879
+ model.train()
880
+ train_loss = 0.0
881
+ train_correct = 0
882
+ train_total = 0
883
+ t_epoch = time.time()
884
+
885
+ for batch_idx, batch in enumerate(train_loader):
886
+ features = batch["input_features"].to(device)
887
+ labels = batch["labels"].to(device)
888
+
889
+ logits = model(features)
890
+ loss = criterion(logits, labels)
891
+
892
+ optimizer.zero_grad()
893
+ loss.backward()
894
+ torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
895
+ optimizer.step()
896
+ scheduler.step()
897
+
898
+ train_loss += loss.item() * len(labels)
899
+ preds = (torch.sigmoid(logits) > 0.5).float()
900
+ hard_labels = (labels > 0.5).float()
901
+ train_correct += (preds == hard_labels).sum().item()
902
+ train_total += len(labels)
903
+
904
+ if batch_idx % 50 == 0 and batch_idx > 0:
905
+ log.info(" batch %d/%d loss=%.4f", batch_idx, len(train_loader), loss.item())
906
+
907
+ # Validate
908
+ model.eval()
909
+ val_metrics = _evaluate(model, val_loader, device, criterion)
910
+ train_acc = train_correct / max(train_total, 1)
911
+ epoch_time = time.time() - t_epoch
912
+
913
+ log.info(
914
+ "Epoch %d/%d (%.0fs): train_loss=%.4f train_acc=%.3f | "
915
+ "val_acc=%.3f val_f1=%.3f prec=%.3f rec=%.3f",
916
+ epoch + 1, epochs, epoch_time,
917
+ train_loss / max(train_total, 1), train_acc,
918
+ val_metrics["accuracy"], val_metrics["f1"],
919
+ val_metrics["precision"], val_metrics["recall"],
920
+ )
921
+
922
+ history.append({
923
+ "epoch": epoch + 1,
924
+ "train_loss": train_loss / max(train_total, 1),
925
+ "train_acc": train_acc,
926
+ **{f"val_{k}": v for k, v in val_metrics.items()},
927
+ })
928
+
929
+ # Save best
930
+ if val_metrics["f1"] > best_f1:
931
+ best_f1 = val_metrics["f1"]
932
+ torch.save({
933
+ "model_state_dict": model.state_dict(),
934
+ "epoch": epoch + 1,
935
+ "val_f1": best_f1,
936
+ "val_metrics": val_metrics,
937
+ "whisper_model": whisper_model,
938
+ }, best_path)
939
+ log.info(" -> New best model (val_f1=%.4f)", best_f1)
940
+ patience_counter = 0
941
+ else:
942
+ patience_counter += 1
943
+ if patience_counter >= patience:
944
+ log.info("Early stopping at epoch %d", epoch + 1)
945
+ break
946
+
947
+ # Resume checkpoint
948
+ torch.save({
949
+ "model_state_dict": model.state_dict(),
950
+ "optimizer_state_dict": optimizer.state_dict(),
951
+ "scheduler_state_dict": scheduler.state_dict(),
952
+ "epoch": epoch + 1,
953
+ "best_f1": best_f1,
954
+ "patience_counter": patience_counter,
955
+ "history": history,
956
+ }, resume_path)
957
+
958
+ if resume_path.exists():
959
+ resume_path.unlink()
960
+
961
+ # ----- Final evaluation on internal test split -----
962
+ log.info("\n=== Internal Test Evaluation ===")
963
+ checkpoint = torch.load(best_path, map_location=device, weights_only=True)
964
+ model.load_state_dict(checkpoint["model_state_dict"])
965
+ model.eval()
966
+
967
+ if internal_test:
968
+ test_ds = SmartTurnDataset(internal_test, feature_extractor, augment=False)
969
+ test_loader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
970
+ test_metrics = _evaluate(model, test_loader, device, criterion)
971
+ _log_metrics("Internal test", test_metrics)
972
+ else:
973
+ test_metrics = {}
974
+
975
+ # ----- Pipecat test set (baseline comparison) -----
976
+ pipecat_test_metrics = {}
977
+ if pipecat_test:
978
+ log.info("\n=== Pipecat PT Test Evaluation (baseline comparison) ===")
979
+ pipecat_ds = SmartTurnDataset(pipecat_test, feature_extractor, augment=False)
980
+ pipecat_loader = DataLoader(pipecat_ds, batch_size=batch_size, shuffle=False)
981
+ pipecat_test_metrics = _evaluate(model, pipecat_loader, device, criterion)
982
+ _log_metrics("Pipecat PT test", pipecat_test_metrics)
983
+ log.info(" Pipecat baseline: 95.42%% accuracy, 2.79%% FP, 1.79%% FN")
984
+
985
+ # ----- Threshold sweep -----
986
+ log.info("\n=== Threshold Sweep ===")
987
+ threshold_results = {}
988
+ eval_loader = test_loader if internal_test else val_loader
989
+ for thresh in [0.3, 0.4, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85]:
990
+ t_metrics = _evaluate(model, eval_loader, device, criterion, threshold=thresh)
991
+ threshold_results[str(thresh)] = t_metrics
992
+ log.info(" threshold=%.2f: prec=%.3f rec=%.3f f1=%.3f acc=%.3f FP=%.3f",
993
+ thresh, t_metrics["precision"], t_metrics["recall"],
994
+ t_metrics["f1"], t_metrics["accuracy"], t_metrics["fp_rate"])
995
+
996
+ # ----- Dual threshold recommendation (inspired by Deepgram Flux) -----
997
+ # For a language-learning avatar:
998
+ # - eager_threshold (0.3-0.5): start preparing response speculatively
999
+ # (e.g., begin LLM generation) — reduces perceived latency
1000
+ # - final_threshold (0.7+): actually take the turn and speak
1001
+ # Higher final threshold = fewer interruptions (critical for L2 learners)
1002
+ log.info("\n=== Dual Threshold Recommendation (Deepgram Flux-inspired) ===")
1003
+ # Find final threshold: maximize precision with recall >= 85%
1004
+ # (we want very few interruptions, even at the cost of some missed turns)
1005
+ final_candidates = [(k, v) for k, v in threshold_results.items()
1006
+ if v["recall"] >= 0.85 and float(k) >= 0.6]
1007
+ if final_candidates:
1008
+ best_final = min(final_candidates, key=lambda x: x[1]["fp_rate"])
1009
+ log.info(" Recommended final_threshold: %.2f (FP=%.1f%%, prec=%.3f, rec=%.3f)",
1010
+ float(best_final[0]), best_final[1]["fp_rate"] * 100,
1011
+ best_final[1]["precision"], best_final[1]["recall"])
1012
+ else:
1013
+ best_final = ("0.7", threshold_results.get("0.7", {}))
1014
+ log.info(" Default final_threshold: 0.70")
1015
+
1016
+ # Eager threshold: lower confidence where we start speculative processing
1017
+ eager_thresh = max(0.3, float(best_final[0]) - 0.3)
1018
+ log.info(" Recommended eager_threshold: %.2f (start LLM generation speculatively)",
1019
+ eager_thresh)
1020
+ log.info(" Latency savings: ~150-250ms earlier response start (Deepgram Flux benchmark)")
1021
+
1022
+ # ----- ONNX export (FP32 → INT8, per Pipecat's deploy pipeline) -----
1023
+ model = model.to("cpu")
1024
+ onnx_fp32_path = OUTPUT_DIR / "smart_turn_pt_v3_fp32.onnx"
1025
+ onnx_int8_path = OUTPUT_DIR / "smart_turn_pt_v3.onnx"
1026
+ dummy = torch.randn(1, 80, 800)
1027
+ try:
1028
+ # Step 1: Export FP32 ONNX (opset 18, per Pipecat train.py)
1029
+ torch.onnx.export(
1030
+ model, dummy, str(onnx_fp32_path),
1031
+ input_names=["input_features"],
1032
+ output_names=["logits"],
1033
+ dynamic_axes={"input_features": {0: "batch"}, "logits": {0: "batch"}},
1034
+ opset_version=18,
1035
+ )
1036
+ log.info("ONNX FP32 exported: %s (%.1f MB)", onnx_fp32_path,
1037
+ onnx_fp32_path.stat().st_size / 1024 / 1024)
1038
+
1039
+ # Step 2: INT8 static quantization (entropy calibration, per Pipecat)
1040
+ # Pipecat uses: 1024 calibration samples, QDQ format, Entropy method
1041
+ try:
1042
+ from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantFormat
1043
+
1044
+ class SmartTurnCalibrationReader(CalibrationDataReader):
1045
+ def __init__(self, dataset, n_samples=1024):
1046
+ self.data = []
1047
+ for i, sample in enumerate(dataset):
1048
+ if i >= n_samples:
1049
+ break
1050
+ self.data.append({"input_features": sample["input_features"].unsqueeze(0).numpy()})
1051
+ self.idx = 0
1052
+
1053
+ def get_next(self):
1054
+ if self.idx >= len(self.data):
1055
+ return None
1056
+ result = self.data[self.idx]
1057
+ self.idx += 1
1058
+ return result
1059
+
1060
+ # Use validation data for calibration
1061
+ calib_reader = SmartTurnCalibrationReader(val_ds, n_samples=1024)
1062
+ quantize_static(
1063
+ str(onnx_fp32_path),
1064
+ str(onnx_int8_path),
1065
+ calib_reader,
1066
+ quant_format=QuantFormat.QDQ,
1067
+ )
1068
+ log.info("ONNX INT8 exported: %s (%.1f MB)", onnx_int8_path,
1069
+ onnx_int8_path.stat().st_size / 1024 / 1024)
1070
+ except ImportError:
1071
+ log.warning("onnxruntime.quantization not available — saving FP32 only")
1072
+ import shutil
1073
+ shutil.copy2(onnx_fp32_path, onnx_int8_path)
1074
+ except Exception as e:
1075
+ log.warning("INT8 quantization failed: %s — saving FP32 only", e)
1076
+ import shutil
1077
+ shutil.copy2(onnx_fp32_path, onnx_int8_path)
1078
+
1079
+ except Exception as e:
1080
+ log.warning("ONNX export failed: %s", e)
1081
+
1082
+ # ----- Save results -----
1083
+ results = {
1084
+ "model": "smart_turn_pt_v3_finetuned",
1085
+ "whisper_model": whisper_model,
1086
+ "architecture": "SmartTurnV3Model (exact Pipecat)",
1087
+ "target_domain": "language_learning_avatar",
1088
+ "learner_profile": "francophone_learning_portuguese",
1089
+ "total_samples": len(all_samples),
1090
+ "sources": sources,
1091
+ "best_epoch": checkpoint["epoch"],
1092
+ "best_val_f1": best_f1,
1093
+ "test_metrics": test_metrics,
1094
+ "pipecat_test_metrics": pipecat_test_metrics,
1095
+ "pipecat_baseline": {"accuracy": 0.9542, "fp_rate": 0.0279, "fn_rate": 0.0179},
1096
+ "threshold_sweep": threshold_results,
1097
+ "dual_threshold": {
1098
+ "eager_threshold": eager_thresh,
1099
+ "final_threshold": float(best_final[0]),
1100
+ "rationale": "For L2 language learning: higher final threshold reduces "
1101
+ "interruptions. Eager threshold enables speculative LLM "
1102
+ "generation 150-250ms earlier (Deepgram Flux approach).",
1103
+ },
1104
+ "history": history,
1105
+ "config": {
1106
+ "epochs": epochs,
1107
+ "batch_size": batch_size,
1108
+ "lr": lr,
1109
+ "warmup_ratio": warmup_ratio,
1110
+ "label_smoothing": LABEL_SMOOTH,
1111
+ "focal_loss_gamma": 2.0,
1112
+ "focal_loss_alpha": 0.25,
1113
+ "fp_penalty": fp_penalty,
1114
+ "loss_fn": loss_fn,
1115
+ },
1116
+ }
1117
+
1118
+ results_path = OUTPUT_DIR / "training_results.json"
1119
+ with open(results_path, "w") as f:
1120
+ json.dump(results, f, indent=2)
1121
+ log.info("Results saved to %s", results_path)
1122
+
1123
+ return onnx_int8_path
1124
+
1125
+
1126
+ def _evaluate(
1127
+ model: nn.Module,
1128
+ loader: DataLoader,
1129
+ device: str,
1130
+ criterion: nn.Module,
1131
+ threshold: float = 0.5,
1132
+ ) -> dict:
1133
+ """Evaluate model at a given threshold."""
1134
+ correct = total = tp = fp = fn = tn = 0
1135
+ total_loss = 0.0
1136
+
1137
+ with torch.no_grad():
1138
+ for batch in loader:
1139
+ features = batch["input_features"].to(device)
1140
+ labels = batch["labels"].to(device)
1141
+ logits = model(features)
1142
+ loss = criterion(logits, labels)
1143
+ total_loss += loss.item() * len(labels)
1144
+
1145
+ preds = (torch.sigmoid(logits) > threshold).float()
1146
+ hard_labels = (labels > 0.5).float()
1147
+ correct += (preds == hard_labels).sum().item()
1148
+ total += len(labels)
1149
+ tp += ((preds == 1) & (hard_labels == 1)).sum().item()
1150
+ fp += ((preds == 1) & (hard_labels == 0)).sum().item()
1151
+ fn += ((preds == 0) & (hard_labels == 1)).sum().item()
1152
+ tn += ((preds == 0) & (hard_labels == 0)).sum().item()
1153
+
1154
+ accuracy = correct / max(total, 1)
1155
+ precision = tp / max(tp + fp, 1)
1156
+ recall = tp / max(tp + fn, 1)
1157
+ f1 = 2 * precision * recall / max(precision + recall, 1e-8)
1158
+
1159
+ return {
1160
+ "accuracy": round(accuracy, 4),
1161
+ "precision": round(precision, 4),
1162
+ "recall": round(recall, 4),
1163
+ "f1": round(f1, 4),
1164
+ "loss": round(total_loss / max(total, 1), 4),
1165
+ "tp": tp, "fp": fp, "fn": fn, "tn": tn,
1166
+ "fp_rate": round(fp / max(fp + tn, 1), 4),
1167
+ "fn_rate": round(fn / max(fn + tp, 1), 4),
1168
+ }
1169
+
1170
+
1171
+ def _log_metrics(name: str, metrics: dict) -> None:
1172
+ log.info("%s results:", name)
1173
+ log.info(" Accuracy: %.3f", metrics["accuracy"])
1174
+ log.info(" Precision: %.3f", metrics["precision"])
1175
+ log.info(" Recall: %.3f", metrics["recall"])
1176
+ log.info(" F1: %.3f", metrics["f1"])
1177
+ log.info(" FP rate: %.3f", metrics["fp_rate"])
1178
+ log.info(" FN rate: %.3f", metrics["fn_rate"])
1179
+ log.info(" TP=%d FP=%d FN=%d TN=%d", metrics["tp"], metrics["fp"], metrics["fn"], metrics["tn"])
1180
+
1181
+
1182
+ if __name__ == "__main__":
1183
+ logging.basicConfig(
1184
+ level=logging.INFO,
1185
+ format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
1186
+ )
1187
+ random.seed(42)
1188
+ np.random.seed(42)
1189
+ torch.manual_seed(42)
1190
+
1191
+ train(
1192
+ epochs=6,
1193
+ batch_size=128,
1194
+ lr=5e-5,
1195
+ warmup_ratio=0.2,
1196
+ max_pipecat_samples=5000,
1197
+ max_tts_samples=10000,
1198
+ max_l2_samples=2000,
1199
+ whisper_model="openai/whisper-tiny",
1200
+ loss_fn="focal", # or "bce" for Pipecat's original
1201
+ fp_penalty=2.0, # interrupting learner costs 2x more than waiting
1202
+ )
03-finetune-pipecat-pt/05_evaluate.py ADDED
@@ -0,0 +1,511 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Evaluate fine-tuned model against Pipecat baseline and ONNX reference.
2
+
3
+ Comparisons:
4
+ 1. Our fine-tuned model (PyTorch) on Pipecat PT test set
5
+ 2. Pipecat original ONNX model on same test set
6
+ 3. Threshold sweep to find optimal operating point
7
+ 4. Per-source breakdown (pipecat data vs our TTS data)
8
+
9
+ Run:
10
+ python 05_evaluate.py --model results/best_model.pt
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import argparse
16
+ import json
17
+ import logging
18
+ import time
19
+ from pathlib import Path
20
+
21
+ import numpy as np
22
+ import torch
23
+ import torch.nn as nn
24
+ from torch.utils.data import DataLoader
25
+
26
+ log = logging.getLogger(__name__)
27
+
28
+ DATA_DIR = Path(__file__).parent / "data"
29
+ _workspace = Path("/workspace") if Path("/workspace").exists() else Path(".")
30
+ RESULTS_DIR = _workspace / "results"
31
+ CACHE_DIR = _workspace / "hf_cache"
32
+
33
+ SAMPLE_RATE = 16000
34
+ WINDOW_SECONDS = 8
35
+ WINDOW_SAMPLES = WINDOW_SECONDS * SAMPLE_RATE
36
+
37
+ # Pipecat baseline (from their blog: Smart Turn v3 Portuguese)
38
+ PIPECAT_BASELINE = {
39
+ "accuracy": 0.9542,
40
+ "fp_rate": 0.0279,
41
+ "fn_rate": 0.0179,
42
+ "source": "https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/",
43
+ }
44
+
45
+
46
+ def load_model(model_path: Path, device: str) -> nn.Module:
47
+ """Load fine-tuned SmartTurnV3Model."""
48
+ # Import from finetune script
49
+ import importlib.util
50
+ spec = importlib.util.spec_from_file_location(
51
+ "finetune", Path(__file__).parent / "04_finetune.py"
52
+ )
53
+ finetune = importlib.util.module_from_spec(spec)
54
+ spec.loader.exec_module(finetune)
55
+
56
+ model = finetune.SmartTurnV3Model(whisper_model="openai/whisper-tiny")
57
+ checkpoint = torch.load(model_path, map_location=device, weights_only=True)
58
+ model.load_state_dict(checkpoint["model_state_dict"])
59
+ model = model.to(device)
60
+ model.eval()
61
+
62
+ log.info("Loaded model from %s (epoch %d, val_f1=%.4f)",
63
+ model_path, checkpoint.get("epoch", -1), checkpoint.get("val_f1", -1))
64
+ return model
65
+
66
+
67
+ def load_onnx_model(onnx_dir: Path | None = None):
68
+ """Load Pipecat ONNX model for comparison."""
69
+ try:
70
+ import onnxruntime as ort
71
+ except ImportError:
72
+ log.warning("onnxruntime not installed — skipping ONNX evaluation")
73
+ return None
74
+
75
+ if onnx_dir is None:
76
+ onnx_dir = DATA_DIR / "pipecat_model"
77
+
78
+ cpu_path = onnx_dir / "smart-turn-v3.2-cpu.onnx"
79
+ gpu_path = onnx_dir / "smart-turn-v3.2-gpu.onnx"
80
+
81
+ onnx_path = cpu_path if cpu_path.exists() else (gpu_path if gpu_path.exists() else None)
82
+ if onnx_path is None:
83
+ log.warning("No Pipecat ONNX model found in %s", onnx_dir)
84
+ return None
85
+
86
+ session = ort.InferenceSession(str(onnx_path))
87
+ log.info("Loaded ONNX model: %s", onnx_path)
88
+ return session
89
+
90
+
91
+ def load_test_data() -> tuple[list, list]:
92
+ """Load test datasets: Pipecat PT test + our TTS test."""
93
+ import soundfile as sf
94
+ from transformers import WhisperFeatureExtractor
95
+
96
+ feature_extractor = WhisperFeatureExtractor(chunk_length=8)
97
+
98
+ # 1. Pipecat test data
99
+ pipecat_samples = []
100
+ test_dir = DATA_DIR / "pipecat_pt_test"
101
+ if test_dir.exists():
102
+ for wav_path in sorted(test_dir.glob("*.wav")):
103
+ try:
104
+ audio, sr = sf.read(str(wav_path))
105
+ audio = np.array(audio, dtype=np.float32)
106
+ if sr != SAMPLE_RATE:
107
+ import torchaudio
108
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
109
+ audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
110
+
111
+ audio = _extract_window(audio)
112
+ if audio is not None:
113
+ label = 1.0 if "_complete" in wav_path.name else 0.0
114
+ features = feature_extractor(
115
+ audio, sampling_rate=SAMPLE_RATE,
116
+ return_tensors="np", padding="max_length",
117
+ max_length=WINDOW_SAMPLES, truncation=True,
118
+ do_normalize=True,
119
+ ).input_features.squeeze(0).astype(np.float32)
120
+
121
+ pipecat_samples.append({
122
+ "features": features,
123
+ "label": label,
124
+ "source": "pipecat_test",
125
+ "file": wav_path.name,
126
+ })
127
+ except Exception as e:
128
+ pass
129
+
130
+ log.info("Pipecat test: %d samples", len(pipecat_samples))
131
+
132
+ # 2. Our TTS test data
133
+ tts_samples = []
134
+ tts_dir = DATA_DIR / "tts_dataset"
135
+ meta_path = tts_dir / "metadata.json"
136
+ if meta_path.exists():
137
+ with open(meta_path) as f:
138
+ metadata = json.load(f)
139
+
140
+ # Use last 20% as test
141
+ test_start = int(len(metadata) * 0.8)
142
+ test_meta = metadata[test_start:]
143
+
144
+ audio_dir = tts_dir / "audio"
145
+ for meta in test_meta:
146
+ wav_path = audio_dir / meta["file"]
147
+ if not wav_path.exists():
148
+ continue
149
+ try:
150
+ audio, sr = sf.read(str(wav_path))
151
+ audio = np.array(audio, dtype=np.float32)
152
+ if sr != SAMPLE_RATE:
153
+ import torchaudio
154
+ tensor = torch.from_numpy(audio).float().unsqueeze(0)
155
+ audio = torchaudio.functional.resample(tensor, sr, SAMPLE_RATE).squeeze().numpy()
156
+
157
+ audio = _extract_window(audio)
158
+ if audio is not None:
159
+ label = 1.0 if meta["label"] == "complete" else 0.0
160
+ features = feature_extractor(
161
+ audio, sampling_rate=SAMPLE_RATE,
162
+ return_tensors="np", padding="max_length",
163
+ max_length=WINDOW_SAMPLES, truncation=True,
164
+ do_normalize=True,
165
+ ).input_features.squeeze(0).astype(np.float32)
166
+
167
+ tts_samples.append({
168
+ "features": features,
169
+ "label": label,
170
+ "source": meta.get("accent", "unknown"),
171
+ "file": meta["file"],
172
+ })
173
+ except Exception:
174
+ pass
175
+
176
+ log.info("TTS test: %d samples", len(tts_samples))
177
+ return pipecat_samples, tts_samples
178
+
179
+
180
+ def _extract_window(audio: np.ndarray) -> np.ndarray | None:
181
+ """Extract 8s window from end of audio."""
182
+ if len(audio) < SAMPLE_RATE:
183
+ return None
184
+ peak = np.max(np.abs(audio))
185
+ if peak > 0:
186
+ audio = audio / peak * 0.9
187
+ if len(audio) > WINDOW_SAMPLES:
188
+ audio = audio[-WINDOW_SAMPLES:]
189
+ elif len(audio) < WINDOW_SAMPLES:
190
+ audio = np.pad(audio, (WINDOW_SAMPLES - len(audio), 0), mode="constant")
191
+ audio[-int(0.2 * SAMPLE_RATE):] = 0.0
192
+ return audio.astype(np.float32)
193
+
194
+
195
+ # ---------------------------------------------------------------------------
196
+ # Evaluation functions
197
+ # ---------------------------------------------------------------------------
198
+
199
+ def evaluate_pytorch(
200
+ model: nn.Module,
201
+ samples: list[dict],
202
+ device: str,
203
+ threshold: float = 0.5,
204
+ ) -> dict:
205
+ """Evaluate PyTorch model on samples."""
206
+ if not samples:
207
+ return {}
208
+
209
+ tp = fp = fn = tn = 0
210
+ latencies = []
211
+
212
+ with torch.no_grad():
213
+ for s in samples:
214
+ features = torch.from_numpy(s["features"]).unsqueeze(0).to(device)
215
+
216
+ t0 = time.perf_counter()
217
+ logits = model(features)
218
+ latency_ms = (time.perf_counter() - t0) * 1000
219
+ latencies.append(latency_ms)
220
+
221
+ pred = float(torch.sigmoid(logits).item() > threshold)
222
+ label = s["label"]
223
+
224
+ if pred == 1 and label == 1:
225
+ tp += 1
226
+ elif pred == 1 and label == 0:
227
+ fp += 1
228
+ elif pred == 0 and label == 1:
229
+ fn += 1
230
+ else:
231
+ tn += 1
232
+
233
+ total = tp + fp + fn + tn
234
+ return _compute_metrics(tp, fp, fn, tn, latencies)
235
+
236
+
237
+ def evaluate_onnx(
238
+ session,
239
+ samples: list[dict],
240
+ threshold: float = 0.5,
241
+ ) -> dict:
242
+ """Evaluate ONNX model on samples."""
243
+ if not samples or session is None:
244
+ return {}
245
+
246
+ tp = fp = fn = tn = 0
247
+ latencies = []
248
+
249
+ input_name = session.get_inputs()[0].name
250
+ for s in samples:
251
+ features = s["features"][np.newaxis, ...]
252
+
253
+ t0 = time.perf_counter()
254
+ outputs = session.run(None, {input_name: features})
255
+ latency_ms = (time.perf_counter() - t0) * 1000
256
+ latencies.append(latency_ms)
257
+
258
+ logit = outputs[0].item() if outputs[0].size == 1 else outputs[0][0].item()
259
+ pred = float(1.0 / (1.0 + np.exp(-logit)) > threshold)
260
+ label = s["label"]
261
+
262
+ if pred == 1 and label == 1:
263
+ tp += 1
264
+ elif pred == 1 and label == 0:
265
+ fp += 1
266
+ elif pred == 0 and label == 1:
267
+ fn += 1
268
+ else:
269
+ tn += 1
270
+
271
+ return _compute_metrics(tp, fp, fn, tn, latencies)
272
+
273
+
274
+ def _compute_metrics(tp, fp, fn, tn, latencies=None) -> dict:
275
+ total = tp + fp + fn + tn
276
+ accuracy = (tp + tn) / max(total, 1)
277
+ precision = tp / max(tp + fp, 1)
278
+ recall = tp / max(tp + fn, 1)
279
+ f1 = 2 * precision * recall / max(precision + recall, 1e-8)
280
+ fp_rate = fp / max(fp + tn, 1)
281
+ fn_rate = fn / max(fn + tp, 1)
282
+
283
+ metrics = {
284
+ "accuracy": round(accuracy, 4),
285
+ "precision": round(precision, 4),
286
+ "recall": round(recall, 4),
287
+ "f1": round(f1, 4),
288
+ "fp_rate": round(fp_rate, 4),
289
+ "fn_rate": round(fn_rate, 4),
290
+ "tp": tp, "fp": fp, "fn": fn, "tn": tn,
291
+ "total": total,
292
+ }
293
+
294
+ if latencies:
295
+ metrics["latency_mean_ms"] = round(np.mean(latencies), 2)
296
+ metrics["latency_p50_ms"] = round(np.median(latencies), 2)
297
+ metrics["latency_p95_ms"] = round(np.percentile(latencies, 95), 2)
298
+
299
+ return metrics
300
+
301
+
302
+ # ---------------------------------------------------------------------------
303
+ # Main evaluation
304
+ # ---------------------------------------------------------------------------
305
+
306
+ def run_evaluation(model_path: Path) -> dict:
307
+ """Run full evaluation suite."""
308
+ device = "cuda" if torch.cuda.is_available() else ("mps" if torch.backends.mps.is_available() else "cpu")
309
+ log.info("Evaluating on device: %s", device)
310
+
311
+ # Load models
312
+ model = load_model(model_path, device)
313
+ onnx_session = load_onnx_model()
314
+
315
+ # Load test data
316
+ pipecat_samples, tts_samples = load_test_data()
317
+ all_samples = pipecat_samples + tts_samples
318
+
319
+ results = {"pipecat_baseline": PIPECAT_BASELINE}
320
+
321
+ # 1. Our model on all test data
322
+ log.info("\n=== Our model (all test data) ===")
323
+ our_all = evaluate_pytorch(model, all_samples, device)
324
+ if our_all:
325
+ _log_metrics("Our model (all)", our_all)
326
+ results["our_model_all"] = our_all
327
+
328
+ # 2. Our model on Pipecat test only
329
+ if pipecat_samples:
330
+ log.info("\n=== Our model (Pipecat PT test only) ===")
331
+ our_pipecat = evaluate_pytorch(model, pipecat_samples, device)
332
+ _log_metrics("Our model (Pipecat test)", our_pipecat)
333
+ results["our_model_pipecat_test"] = our_pipecat
334
+
335
+ # Compare with baseline
336
+ diff_acc = our_pipecat["accuracy"] - PIPECAT_BASELINE["accuracy"]
337
+ diff_fp = our_pipecat["fp_rate"] - PIPECAT_BASELINE["fp_rate"]
338
+ diff_fn = our_pipecat["fn_rate"] - PIPECAT_BASELINE["fn_rate"]
339
+ log.info(" vs Pipecat baseline: accuracy %+.2f%%, FP %+.2f%%, FN %+.2f%%",
340
+ diff_acc * 100, diff_fp * 100, diff_fn * 100)
341
+
342
+ # 3. Our model per accent
343
+ for accent_name, accent_filter in [("native_pt_br", "native_pt_br"), ("french_pt", "french_pt")]:
344
+ accent_samples = [s for s in tts_samples if s["source"] == accent_filter]
345
+ if accent_samples:
346
+ log.info("\n=== Our model (%s) ===", accent_name)
347
+ m = evaluate_pytorch(model, accent_samples, device)
348
+ _log_metrics(f"Our model ({accent_name})", m)
349
+ results[f"our_model_{accent_name}"] = m
350
+
351
+ # 4. ONNX reference model
352
+ if onnx_session and pipecat_samples:
353
+ log.info("\n=== Pipecat ONNX model (reference) ===")
354
+ onnx_metrics = evaluate_onnx(onnx_session, pipecat_samples)
355
+ if onnx_metrics:
356
+ _log_metrics("Pipecat ONNX", onnx_metrics)
357
+ results["pipecat_onnx"] = onnx_metrics
358
+
359
+ # 5. Threshold sweep
360
+ log.info("\n=== Threshold Sweep (all test data) ===")
361
+ sweep = {}
362
+ for thresh in [0.3, 0.4, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9]:
363
+ m = evaluate_pytorch(model, all_samples, device, threshold=thresh)
364
+ sweep[str(thresh)] = m
365
+ log.info(" threshold=%.2f: acc=%.3f prec=%.3f rec=%.3f f1=%.3f FP=%.3f FN=%.3f",
366
+ thresh, m["accuracy"], m["precision"], m["recall"],
367
+ m["f1"], m["fp_rate"], m["fn_rate"])
368
+
369
+ results["threshold_sweep"] = sweep
370
+
371
+ # Find optimal thresholds
372
+ best_f1_thresh = max(sweep.items(), key=lambda x: x[1]["f1"])
373
+ best_prec_thresh = max(
374
+ [(k, v) for k, v in sweep.items() if v["recall"] >= 0.90],
375
+ key=lambda x: x[1]["precision"],
376
+ default=(None, None),
377
+ )
378
+
379
+ log.info("\n=== Recommendations ===")
380
+ log.info(" Best F1: threshold=%.2f (F1=%.3f, prec=%.3f, rec=%.3f)",
381
+ float(best_f1_thresh[0]), best_f1_thresh[1]["f1"],
382
+ best_f1_thresh[1]["precision"], best_f1_thresh[1]["recall"])
383
+ if best_prec_thresh[0]:
384
+ log.info(" Best precision (recall>=90%%): threshold=%.2f (prec=%.3f, rec=%.3f)",
385
+ float(best_prec_thresh[0]), best_prec_thresh[1]["precision"],
386
+ best_prec_thresh[1]["recall"])
387
+
388
+ # ----- Dual Threshold for Language Learning Avatar -----
389
+ # Inspired by Deepgram Flux (eot_threshold + eager_eot_threshold)
390
+ # For L2 learners: higher final threshold to avoid interrupting mid-hesitation
391
+ log.info("\n=== Dual Threshold (Language Learning Mode) ===")
392
+ log.info(" Context: conversational avatar for francophones learning Portuguese")
393
+ log.info(" Priority: minimize interruptions (FP) over fast response (FN)")
394
+
395
+ # Final threshold: minimize FP rate while keeping recall >= 85%
396
+ final_candidates = [(k, v) for k, v in sweep.items()
397
+ if v["recall"] >= 0.85 and float(k) >= 0.6]
398
+ if final_candidates:
399
+ best_final = min(final_candidates, key=lambda x: x[1]["fp_rate"])
400
+ final_thresh = float(best_final[0])
401
+ log.info(" Final threshold: %.2f (FP=%.1f%%, rec=%.1f%%)",
402
+ final_thresh, best_final[1]["fp_rate"] * 100,
403
+ best_final[1]["recall"] * 100)
404
+ else:
405
+ final_thresh = 0.7
406
+ log.info(" Final threshold: 0.70 (default)")
407
+
408
+ # Eager threshold: lower confidence for speculative LLM generation
409
+ eager_thresh = max(0.3, final_thresh - 0.3)
410
+ log.info(" Eager threshold: %.2f (start speculative LLM prep)", eager_thresh)
411
+ log.info(" Expected latency savings: ~150-250ms on response start")
412
+
413
+ # Simulate dual-threshold behavior
414
+ if all_samples:
415
+ log.info("\n Dual-threshold simulation:")
416
+ eager_m = evaluate_pytorch(model, all_samples, device, threshold=eager_thresh)
417
+ final_m = evaluate_pytorch(model, all_samples, device, threshold=final_thresh)
418
+ log.info(" Eager (%.2f): would trigger on %.1f%% of samples (%.1f%% false triggers)",
419
+ eager_thresh, (eager_m["tp"] + eager_m["fp"]) / max(eager_m["total"], 1) * 100,
420
+ eager_m["fp_rate"] * 100)
421
+ log.info(" Final (%.2f): confirms %.1f%% of turns (%.1f%% false confirms, %.1f%% missed)",
422
+ final_thresh, final_m["recall"] * 100,
423
+ final_m["fp_rate"] * 100, final_m["fn_rate"] * 100)
424
+ wasted_speculative = eager_m["fp"] - final_m["fp"]
425
+ log.info(" Wasted speculative preps: %d (started but not confirmed)", max(0, wasted_speculative))
426
+
427
+ results["recommended"] = {
428
+ "best_f1_threshold": float(best_f1_thresh[0]),
429
+ "best_precision_threshold": float(best_prec_thresh[0]) if best_prec_thresh[0] else None,
430
+ "dual_threshold": {
431
+ "eager": eager_thresh,
432
+ "final": final_thresh,
433
+ "mode": "language_learning",
434
+ "rationale": "Higher final threshold minimizes interruptions for L2 learners "
435
+ "who pause 500-2000ms mid-sentence. Eager threshold enables "
436
+ "speculative LLM prep for lower perceived latency.",
437
+ },
438
+ }
439
+
440
+ # 6. Summary comparison table
441
+ log.info("\n" + "=" * 70)
442
+ log.info("SUMMARY COMPARISON")
443
+ log.info("=" * 70)
444
+ log.info("%-30s %8s %8s %8s %8s", "Model", "Acc", "FP%", "FN%", "F1")
445
+ log.info("-" * 70)
446
+ log.info("%-30s %7.1f%% %7.2f%% %7.2f%% %8s",
447
+ "Pipecat v3.2 (baseline)",
448
+ PIPECAT_BASELINE["accuracy"] * 100,
449
+ PIPECAT_BASELINE["fp_rate"] * 100,
450
+ PIPECAT_BASELINE["fn_rate"] * 100,
451
+ "~0.96")
452
+
453
+ if "pipecat_onnx" in results:
454
+ m = results["pipecat_onnx"]
455
+ log.info("%-30s %7.1f%% %7.2f%% %7.2f%% %7.3f",
456
+ "Pipecat ONNX (our eval)",
457
+ m["accuracy"] * 100, m["fp_rate"] * 100,
458
+ m["fn_rate"] * 100, m["f1"])
459
+
460
+ if "our_model_pipecat_test" in results:
461
+ m = results["our_model_pipecat_test"]
462
+ log.info("%-30s %7.1f%% %7.2f%% %7.2f%% %7.3f",
463
+ "Our model (Pipecat test)",
464
+ m["accuracy"] * 100, m["fp_rate"] * 100,
465
+ m["fn_rate"] * 100, m["f1"])
466
+
467
+ if "our_model_all" in results:
468
+ m = results["our_model_all"]
469
+ log.info("%-30s %7.1f%% %7.2f%% %7.2f%% %7.3f",
470
+ "Our model (all test)",
471
+ m["accuracy"] * 100, m["fp_rate"] * 100,
472
+ m["fn_rate"] * 100, m["f1"])
473
+
474
+ log.info("=" * 70)
475
+
476
+ # Save results
477
+ RESULTS_DIR.mkdir(parents=True, exist_ok=True)
478
+ results_path = RESULTS_DIR / "evaluation_results.json"
479
+ with open(results_path, "w") as f:
480
+ json.dump(results, f, indent=2)
481
+ log.info("\nResults saved to %s", results_path)
482
+
483
+ return results
484
+
485
+
486
+ def _log_metrics(name: str, m: dict) -> None:
487
+ log.info("%s:", name)
488
+ log.info(" Accuracy: %.3f (%.1f%%)", m["accuracy"], m["accuracy"] * 100)
489
+ log.info(" Precision: %.3f", m["precision"])
490
+ log.info(" Recall: %.3f", m["recall"])
491
+ log.info(" F1: %.3f", m["f1"])
492
+ log.info(" FP rate: %.3f (%.1f%%)", m["fp_rate"], m["fp_rate"] * 100)
493
+ log.info(" FN rate: %.3f (%.1f%%)", m["fn_rate"], m["fn_rate"] * 100)
494
+ log.info(" TP=%d FP=%d FN=%d TN=%d (total=%d)", m["tp"], m["fp"], m["fn"], m["tn"], m["total"])
495
+ if "latency_mean_ms" in m:
496
+ log.info(" Latency: mean=%.1fms p50=%.1fms p95=%.1fms",
497
+ m["latency_mean_ms"], m["latency_p50_ms"], m["latency_p95_ms"])
498
+
499
+
500
+ if __name__ == "__main__":
501
+ logging.basicConfig(
502
+ level=logging.INFO,
503
+ format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
504
+ )
505
+
506
+ parser = argparse.ArgumentParser()
507
+ parser.add_argument("--model", type=str, default=str(RESULTS_DIR / "best_model.pt"),
508
+ help="Path to fine-tuned model checkpoint")
509
+ args = parser.parse_args()
510
+
511
+ run_evaluation(Path(args.model))
03-finetune-pipecat-pt/06_inference.py ADDED
@@ -0,0 +1,560 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Inference engine for the language-learning avatar turn-taking model.
2
+
3
+ This module implements the dual-threshold + backchannel system designed
4
+ for a conversational avatar that teaches Portuguese to French speakers.
5
+
6
+ Key insight: L2 learners pause 1-3s mid-sentence (word search, conjugation,
7
+ code-switching). A naive system either:
8
+ a) Interrupts them (bad UX — kills confidence), or
9
+ b) Stays silent too long (learner thinks avatar froze)
10
+
11
+ Solution: dual-threshold with backchannel signals.
12
+
13
+ Architecture:
14
+ Audio stream → SmartTurnV3 model → confidence score (0.0-1.0)
15
+
16
+ ├─ score < eager_threshold (0.4) → LISTENING (do nothing)
17
+ ├─ eager ≤ score < final (0.7) → PREPARING (backchannel + speculative LLM)
18
+ └─ score ≥ final_threshold (0.7) → RESPONDING (take the turn)
19
+
20
+ Silence timer (parallel):
21
+ 0-600ms → normal (no action)
22
+ 600ms-1.5s → visual backchannel (nod, eye contact)
23
+ 1.5s-3.0s → verbal backchannel ("mhm", "continue...")
24
+ 3.0s+ → encouragement ("sem pressa, pode pensar...")
25
+
26
+ References:
27
+ - Deepgram Flux: eot_threshold + eager_eot_threshold (2025)
28
+ - ConversAR: "infinite thinking period" for L2 learners (2025)
29
+ - Tavus: 600ms response latency threshold (2025)
30
+ - Kosmala (2022): L2 repair pauses average 844ms
31
+ - Cenoz (2000): L2 silent pauses range 205ms to 11,569ms
32
+
33
+ Usage:
34
+ engine = TurnTakingEngine(model_path="results/smart_turn_pt_v3.onnx")
35
+ engine.on_state_change(my_callback)
36
+
37
+ # Feed audio chunks continuously
38
+ for chunk in audio_stream:
39
+ engine.feed_audio(chunk)
40
+ """
41
+
42
+ from __future__ import annotations
43
+
44
+ import logging
45
+ import time
46
+ from dataclasses import dataclass, field
47
+ from enum import Enum
48
+ from pathlib import Path
49
+ from typing import Callable
50
+
51
+ import numpy as np
52
+
53
+ log = logging.getLogger(__name__)
54
+
55
+ SAMPLE_RATE = 16000
56
+ WINDOW_SECONDS = 8
57
+ WINDOW_SAMPLES = WINDOW_SECONDS * SAMPLE_RATE
58
+
59
+
60
+ # ---------------------------------------------------------------------------
61
+ # States and events
62
+ # ---------------------------------------------------------------------------
63
+
64
+ class TurnState(Enum):
65
+ """Avatar turn-taking state machine."""
66
+ LISTENING = "listening" # User is speaking, avatar listens
67
+ SILENCE = "silence" # User stopped, waiting to classify
68
+ BACKCHANNEL_VISUAL = "bc_visual" # Nod/eye contact (600ms-1.5s silence)
69
+ BACKCHANNEL_VERBAL = "bc_verbal" # "Mhm" / "continue" (1.5s-3.0s)
70
+ ENCOURAGEMENT = "encouragement" # "Sem pressa..." (3.0s+ silence)
71
+ PREPARING = "preparing" # Eager threshold hit, speculative LLM started
72
+ RESPONDING = "responding" # Final threshold hit, avatar speaks
73
+
74
+
75
+ @dataclass
76
+ class TurnEvent:
77
+ """Event emitted on state transitions."""
78
+ state: TurnState
79
+ confidence: float # Model's end-of-turn confidence (0-1)
80
+ silence_duration_ms: float # How long the user has been silent
81
+ timestamp: float # time.monotonic()
82
+ message: str = "" # Backchannel text (if applicable)
83
+
84
+
85
+ @dataclass
86
+ class BackchannelConfig:
87
+ """Configurable backchannel behavior for the avatar.
88
+
89
+ These can be tuned per learner profile or CEFR level:
90
+ - A1/A2: longer delays, more encouraging messages
91
+ - B1/B2: shorter delays, less frequent backchannels
92
+ """
93
+ # Silence thresholds (milliseconds)
94
+ visual_backchannel_ms: float = 600.0
95
+ verbal_backchannel_ms: float = 1500.0
96
+ encouragement_ms: float = 3000.0
97
+
98
+ # Model confidence thresholds
99
+ eager_threshold: float = 0.4 # Start speculative LLM generation
100
+ final_threshold: float = 0.7 # Confirm end-of-turn, avatar speaks
101
+
102
+ # Backchannel messages (Portuguese, with French-speaker-friendly variants)
103
+ visual_actions: list[str] = field(default_factory=lambda: [
104
+ "nod", # Aceno de cabeça
105
+ "eye_contact", # Olhar atento
106
+ "slight_smile", # Sorriso leve
107
+ ])
108
+
109
+ verbal_backchannels: list[str] = field(default_factory=lambda: [
110
+ "Mhm...",
111
+ "Uhum...",
112
+ "Sim...",
113
+ "Tá...",
114
+ "Sei...",
115
+ "Continue...",
116
+ ])
117
+
118
+ encouragement_messages: list[str] = field(default_factory=lambda: [
119
+ "Pode continuar, sem pressa...",
120
+ "Tá pensando? Tranquilo...",
121
+ "Pode pensar, eu espero...",
122
+ "Sem pressa, tá tudo bem...",
123
+ "Take your time... pode falar em português...", # Code-switch friendly
124
+ "Prenez votre temps... quando estiver pronto...", # French reassurance
125
+ ])
126
+
127
+ # Cooldowns (don't spam backchannels)
128
+ verbal_cooldown_ms: float = 3000.0 # Min time between verbal backchannels
129
+ encouragement_cooldown_ms: float = 8000.0 # Min time between encouragements
130
+
131
+
132
+ # ---------------------------------------------------------------------------
133
+ # CEFR-aware presets
134
+ # ---------------------------------------------------------------------------
135
+
136
+ CEFR_PRESETS: dict[str, BackchannelConfig] = {
137
+ "A1": BackchannelConfig(
138
+ visual_backchannel_ms=500.0,
139
+ verbal_backchannel_ms=1200.0,
140
+ encouragement_ms=2500.0,
141
+ eager_threshold=0.5, # Higher — need more confidence before preparing
142
+ final_threshold=0.8, # Much higher — very patient, rarely interrupt
143
+ ),
144
+ "A2": BackchannelConfig(
145
+ visual_backchannel_ms=550.0,
146
+ verbal_backchannel_ms=1300.0,
147
+ encouragement_ms=2800.0,
148
+ eager_threshold=0.45,
149
+ final_threshold=0.75,
150
+ ),
151
+ "B1": BackchannelConfig(
152
+ visual_backchannel_ms=600.0,
153
+ verbal_backchannel_ms=1500.0,
154
+ encouragement_ms=3000.0,
155
+ eager_threshold=0.4,
156
+ final_threshold=0.7, # Default
157
+ ),
158
+ "B2": BackchannelConfig(
159
+ visual_backchannel_ms=600.0,
160
+ verbal_backchannel_ms=1800.0,
161
+ encouragement_ms=4000.0,
162
+ eager_threshold=0.35,
163
+ final_threshold=0.65, # More responsive — B2 pauses less
164
+ ),
165
+ "C1": BackchannelConfig(
166
+ visual_backchannel_ms=600.0,
167
+ verbal_backchannel_ms=2000.0,
168
+ encouragement_ms=5000.0,
169
+ eager_threshold=0.35,
170
+ final_threshold=0.6, # Near-native responsiveness
171
+ ),
172
+ }
173
+
174
+
175
+ # ---------------------------------------------------------------------------
176
+ # Turn-taking engine
177
+ # ---------------------------------------------------------------------------
178
+
179
+ class TurnTakingEngine:
180
+ """Real-time turn-taking engine for language-learning avatar.
181
+
182
+ Combines the SmartTurnV3 model with a silence timer and backchannel
183
+ state machine to create a patient, encouraging conversational partner.
184
+
185
+ Usage:
186
+ engine = TurnTakingEngine("results/smart_turn_pt_v3.onnx")
187
+ engine.on_state_change(handle_event)
188
+
189
+ # In audio loop:
190
+ for chunk in mic_stream:
191
+ engine.feed_audio(chunk)
192
+
193
+ # Or feed pre-computed features:
194
+ engine.feed_score(model_confidence, is_speech=True)
195
+ """
196
+
197
+ def __init__(
198
+ self,
199
+ model_path: str | Path | None = None,
200
+ config: BackchannelConfig | None = None,
201
+ cefr_level: str = "B1",
202
+ ):
203
+ # Config: use CEFR preset or custom
204
+ if config is not None:
205
+ self.config = config
206
+ elif cefr_level in CEFR_PRESETS:
207
+ self.config = CEFR_PRESETS[cefr_level]
208
+ log.info("Using CEFR %s preset: eager=%.2f, final=%.2f",
209
+ cefr_level, self.config.eager_threshold, self.config.final_threshold)
210
+ else:
211
+ self.config = BackchannelConfig()
212
+
213
+ # State
214
+ self._state = TurnState.LISTENING
215
+ self._silence_start: float | None = None
216
+ self._last_speech_time: float = time.monotonic()
217
+ self._last_verbal_bc: float = 0.0
218
+ self._last_encouragement: float = 0.0
219
+ self._last_confidence: float = 0.0
220
+ self._callbacks: list[Callable[[TurnEvent], None]] = []
221
+ self._audio_buffer = np.zeros(WINDOW_SAMPLES, dtype=np.float32)
222
+ self._speculative_started = False
223
+
224
+ # Load ONNX model if provided
225
+ self._session = None
226
+ self._feature_extractor = None
227
+ if model_path is not None:
228
+ self._load_model(model_path)
229
+
230
+ def _load_model(self, model_path: str | Path) -> None:
231
+ """Load ONNX model for inference."""
232
+ try:
233
+ import onnxruntime as ort
234
+ self._session = ort.InferenceSession(
235
+ str(model_path),
236
+ providers=["CPUExecutionProvider"],
237
+ )
238
+ log.info("Loaded turn-taking model: %s", model_path)
239
+ except Exception as e:
240
+ log.warning("Failed to load model %s: %s — running in manual mode", model_path, e)
241
+
242
+ try:
243
+ from transformers import WhisperFeatureExtractor
244
+ self._feature_extractor = WhisperFeatureExtractor(chunk_length=WINDOW_SECONDS)
245
+ except ImportError:
246
+ log.warning("transformers not available — cannot extract features")
247
+
248
+ def on_state_change(self, callback: Callable[[TurnEvent], None]) -> None:
249
+ """Register callback for state change events."""
250
+ self._callbacks.append(callback)
251
+
252
+ @property
253
+ def state(self) -> TurnState:
254
+ return self._state
255
+
256
+ @property
257
+ def silence_duration_ms(self) -> float:
258
+ if self._silence_start is None:
259
+ return 0.0
260
+ return (time.monotonic() - self._silence_start) * 1000
261
+
262
+ # ----- Audio input -----
263
+
264
+ def feed_audio(self, chunk: np.ndarray, is_speech: bool | None = None) -> TurnEvent | None:
265
+ """Feed an audio chunk and get turn-taking decision.
266
+
267
+ chunk: float32 audio at 16kHz
268
+ is_speech: if None, will use simple energy-based VAD
269
+
270
+ Returns TurnEvent if state changed, None otherwise.
271
+ """
272
+ now = time.monotonic()
273
+
274
+ # Update audio buffer (sliding window)
275
+ chunk = chunk.astype(np.float32)
276
+ if len(chunk) >= WINDOW_SAMPLES:
277
+ self._audio_buffer = chunk[-WINDOW_SAMPLES:]
278
+ else:
279
+ self._audio_buffer = np.roll(self._audio_buffer, -len(chunk))
280
+ self._audio_buffer[-len(chunk):] = chunk
281
+
282
+ # Simple VAD if not provided
283
+ if is_speech is None:
284
+ rms = np.sqrt(np.mean(chunk ** 2))
285
+ is_speech = rms > 0.01 # Simple threshold
286
+
287
+ if is_speech:
288
+ return self._on_speech(now)
289
+ else:
290
+ return self._on_silence(now)
291
+
292
+ def feed_score(self, confidence: float, is_speech: bool) -> TurnEvent | None:
293
+ """Feed pre-computed model confidence score.
294
+
295
+ Use this when you run the model externally and just want
296
+ the backchannel/threshold logic.
297
+
298
+ confidence: 0.0 (definitely incomplete) to 1.0 (definitely complete)
299
+ is_speech: whether VAD detected speech in this frame
300
+ """
301
+ now = time.monotonic()
302
+ self._last_confidence = confidence
303
+
304
+ if is_speech:
305
+ return self._on_speech(now)
306
+ else:
307
+ return self._on_silence(now, confidence=confidence)
308
+
309
+ # ----- Internal state machine -----
310
+
311
+ def _on_speech(self, now: float) -> TurnEvent | None:
312
+ """User is speaking."""
313
+ self._last_speech_time = now
314
+ self._silence_start = None
315
+ self._speculative_started = False
316
+
317
+ if self._state != TurnState.LISTENING:
318
+ return self._transition(TurnState.LISTENING, confidence=0.0, now=now)
319
+ return None
320
+
321
+ def _on_silence(self, now: float, confidence: float | None = None) -> TurnEvent | None:
322
+ """User is silent — run the state machine."""
323
+ # Mark silence start
324
+ if self._silence_start is None:
325
+ self._silence_start = now
326
+
327
+ silence_ms = (now - self._silence_start) * 1000
328
+
329
+ # Get model confidence if we have a model and no external score
330
+ if confidence is None:
331
+ confidence = self._run_model()
332
+ self._last_confidence = confidence
333
+
334
+ cfg = self.config
335
+
336
+ # ----- Decision tree -----
337
+
338
+ # 1. Final threshold → RESPONDING (take the turn)
339
+ if confidence >= cfg.final_threshold:
340
+ return self._transition(TurnState.RESPONDING, confidence, now)
341
+
342
+ # 2. Eager threshold → PREPARING (speculative LLM, backchannel)
343
+ if confidence >= cfg.eager_threshold and not self._speculative_started:
344
+ self._speculative_started = True
345
+ return self._transition(TurnState.PREPARING, confidence, now)
346
+
347
+ # 3. Silence-based backchannels (even if model is unsure)
348
+ # These keep the learner engaged so they don't think the avatar froze
349
+
350
+ if silence_ms >= cfg.encouragement_ms:
351
+ if now - self._last_encouragement >= cfg.encouragement_cooldown_ms / 1000:
352
+ self._last_encouragement = now
353
+ return self._transition(TurnState.ENCOURAGEMENT, confidence, now)
354
+
355
+ if silence_ms >= cfg.verbal_backchannel_ms:
356
+ if now - self._last_verbal_bc >= cfg.verbal_cooldown_ms / 1000:
357
+ self._last_verbal_bc = now
358
+ return self._transition(TurnState.BACKCHANNEL_VERBAL, confidence, now)
359
+
360
+ if silence_ms >= cfg.visual_backchannel_ms:
361
+ if self._state not in (TurnState.BACKCHANNEL_VISUAL,
362
+ TurnState.BACKCHANNEL_VERBAL,
363
+ TurnState.ENCOURAGEMENT,
364
+ TurnState.PREPARING):
365
+ return self._transition(TurnState.BACKCHANNEL_VISUAL, confidence, now)
366
+
367
+ # Still in silence, no state change
368
+ if self._state == TurnState.LISTENING:
369
+ return self._transition(TurnState.SILENCE, confidence, now)
370
+
371
+ return None
372
+
373
+ def _transition(self, new_state: TurnState, confidence: float, now: float) -> TurnEvent:
374
+ """Transition to a new state and emit event."""
375
+ import random
376
+
377
+ old_state = self._state
378
+ self._state = new_state
379
+
380
+ silence_ms = (now - self._silence_start) * 1000 if self._silence_start else 0.0
381
+
382
+ # Pick appropriate message
383
+ message = ""
384
+ if new_state == TurnState.BACKCHANNEL_VISUAL:
385
+ message = random.choice(self.config.visual_actions)
386
+ elif new_state == TurnState.BACKCHANNEL_VERBAL:
387
+ message = random.choice(self.config.verbal_backchannels)
388
+ elif new_state == TurnState.ENCOURAGEMENT:
389
+ message = random.choice(self.config.encouragement_messages)
390
+ elif new_state == TurnState.PREPARING:
391
+ message = "speculative_llm_start"
392
+ elif new_state == TurnState.RESPONDING:
393
+ message = "take_turn"
394
+
395
+ event = TurnEvent(
396
+ state=new_state,
397
+ confidence=confidence,
398
+ silence_duration_ms=silence_ms,
399
+ timestamp=now,
400
+ message=message,
401
+ )
402
+
403
+ if old_state != new_state:
404
+ log.debug("Turn state: %s → %s (conf=%.2f, silence=%.0fms, msg=%s)",
405
+ old_state.value, new_state.value, confidence, silence_ms, message)
406
+
407
+ for cb in self._callbacks:
408
+ try:
409
+ cb(event)
410
+ except Exception as e:
411
+ log.warning("Callback error: %s", e)
412
+
413
+ return event
414
+
415
+ def _run_model(self) -> float:
416
+ """Run the ONNX model on current audio buffer."""
417
+ if self._session is None or self._feature_extractor is None:
418
+ return 0.0
419
+
420
+ try:
421
+ inputs = self._feature_extractor(
422
+ self._audio_buffer,
423
+ sampling_rate=SAMPLE_RATE,
424
+ return_tensors="np",
425
+ padding="max_length",
426
+ max_length=WINDOW_SAMPLES,
427
+ truncation=True,
428
+ do_normalize=True,
429
+ )
430
+ features = inputs.input_features.astype(np.float32)
431
+
432
+ input_name = self._session.get_inputs()[0].name
433
+ outputs = self._session.run(None, {input_name: features})
434
+ logit = outputs[0].item() if outputs[0].size == 1 else outputs[0][0].item()
435
+
436
+ # Sigmoid
437
+ confidence = 1.0 / (1.0 + np.exp(-logit))
438
+ return float(confidence)
439
+ except Exception as e:
440
+ log.warning("Model inference error: %s", e)
441
+ return 0.0
442
+
443
+ # ----- Convenience -----
444
+
445
+ def reset(self) -> None:
446
+ """Reset state (e.g., when starting a new conversation turn)."""
447
+ self._state = TurnState.LISTENING
448
+ self._silence_start = None
449
+ self._speculative_started = False
450
+ self._last_confidence = 0.0
451
+ self._audio_buffer = np.zeros(WINDOW_SAMPLES, dtype=np.float32)
452
+
453
+ def set_cefr_level(self, level: str) -> None:
454
+ """Change CEFR level at runtime (e.g., after proficiency assessment)."""
455
+ if level in CEFR_PRESETS:
456
+ self.config = CEFR_PRESETS[level]
457
+ log.info("Switched to CEFR %s: eager=%.2f, final=%.2f",
458
+ level, self.config.eager_threshold, self.config.final_threshold)
459
+ else:
460
+ log.warning("Unknown CEFR level: %s", level)
461
+
462
+
463
+ # ---------------------------------------------------------------------------
464
+ # Example usage and demo
465
+ # ---------------------------------------------------------------------------
466
+
467
+ def demo_simulation():
468
+ """Simulate a conversation to demonstrate the backchannel system.
469
+
470
+ Scenario: French speaker (B1) learning Portuguese, pausing frequently.
471
+ """
472
+ print("=" * 70)
473
+ print("DEMO: Turn-Taking Engine for Language Learning Avatar")
474
+ print("Scenario: French B1 learner speaking Portuguese")
475
+ print("=" * 70)
476
+
477
+ events_log = []
478
+
479
+ def on_event(event: TurnEvent):
480
+ events_log.append(event)
481
+ state_emoji = {
482
+ TurnState.LISTENING: "👂",
483
+ TurnState.SILENCE: "⏸️",
484
+ TurnState.BACKCHANNEL_VISUAL: "😊",
485
+ TurnState.BACKCHANNEL_VERBAL: "💬",
486
+ TurnState.ENCOURAGEMENT: "🤗",
487
+ TurnState.PREPARING: "🧠",
488
+ TurnState.RESPONDING: "🗣️",
489
+ }
490
+ emoji = state_emoji.get(event.state, "❓")
491
+ print(f" {emoji} [{event.silence_duration_ms:6.0f}ms] "
492
+ f"conf={event.confidence:.2f} → {event.state.value}: {event.message}")
493
+
494
+ engine = TurnTakingEngine(config=CEFR_PRESETS["B1"])
495
+ engine.on_state_change(on_event)
496
+
497
+ # Simulate a conversation timeline
498
+ # Each entry: (description, duration_ms, is_speech, model_confidence)
499
+ timeline = [
500
+ # Learner starts speaking
501
+ ("Learner: 'Eu fui ao...'", 1200, True, 0.1),
502
+ # Pause — searching for word (conjugation hesitation)
503
+ (" [silence — thinking about conjugation]", 300, False, 0.15),
504
+ (" [still thinking...]", 400, False, 0.20),
505
+ (" [600ms — visual backchannel]", 300, False, 0.22),
506
+ (" [1s — still silent]", 400, False, 0.25),
507
+ (" [1.5s — verbal backchannel]", 500, False, 0.28),
508
+ # Learner continues
509
+ ("Learner: '...mercado... euh...'", 800, True, 0.1),
510
+ # Another pause — code-switching hesitation
511
+ (" [silence — 'comment dit-on...']", 500, False, 0.20),
512
+ (" [1s silence]", 500, False, 0.30),
513
+ (" [1.5s — verbal backchannel]", 500, False, 0.35),
514
+ (" [2s — model getting uncertain]", 500, False, 0.42),
515
+ # Learner continues again
516
+ ("Learner: '...para comprar... uma tesoura.'", 1500, True, 0.1),
517
+ # Final silence — this time it's a real end of turn
518
+ (" [silence after complete sentence]", 300, False, 0.45),
519
+ (" [600ms]", 300, False, 0.55),
520
+ (" [model confidence rising]", 300, False, 0.65),
521
+ (" [model confident — end of turn]", 200, False, 0.75),
522
+ ]
523
+
524
+ print("\nTimeline:")
525
+ print("-" * 70)
526
+
527
+ for description, duration_ms, is_speech, confidence in timeline:
528
+ print(f"\n{description} ({duration_ms}ms)")
529
+ # Simulate in 100ms chunks
530
+ n_chunks = max(1, duration_ms // 100)
531
+ for _ in range(n_chunks):
532
+ engine.feed_score(confidence, is_speech)
533
+ time.sleep(0.001) # Tiny sleep for monotonic clock to advance
534
+
535
+ print("\n" + "=" * 70)
536
+ print(f"Total events: {len(events_log)}")
537
+ print(f"Final state: {engine.state.value}")
538
+
539
+ # Show CEFR comparison
540
+ print("\n" + "=" * 70)
541
+ print("CEFR LEVEL COMPARISON")
542
+ print("=" * 70)
543
+ print(f"{'Level':<6} {'Eager':<8} {'Final':<8} {'Visual BC':<10} {'Verbal BC':<10} {'Encourage':<10}")
544
+ print("-" * 52)
545
+ for level, preset in CEFR_PRESETS.items():
546
+ print(f"{level:<6} {preset.eager_threshold:<8.2f} {preset.final_threshold:<8.2f} "
547
+ f"{preset.visual_backchannel_ms:<10.0f} {preset.verbal_backchannel_ms:<10.0f} "
548
+ f"{preset.encouragement_ms:<10.0f}")
549
+ print()
550
+ print("A1/A2: Very patient — high final threshold (0.75-0.80), early encouragement")
551
+ print("B1: Default — balanced patience and responsiveness")
552
+ print("B2/C1: More responsive — lower final threshold (0.60-0.65)")
553
+
554
+
555
+ if __name__ == "__main__":
556
+ logging.basicConfig(
557
+ level=logging.DEBUG,
558
+ format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
559
+ )
560
+ demo_simulation()
03-finetune-pipecat-pt/README.md ADDED
@@ -0,0 +1,489 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Experimento 03 — Fine-tune Pipecat Smart Turn para Portugues + Frances
2
+
3
+ > Fine-tuning do Pipecat Smart Turn v3 para **avatar conversacional de aprendizado de portugues** por francofonos. Primeiro modelo de turn-taking otimizado para aprendizes L2.
4
+
5
+ ---
6
+
7
+ ## Sumario
8
+
9
+ 1. [Objetivo](#objetivo)
10
+ 2. [Por que a partir do Pipecat](#por-que-a-partir-do-pipecat)
11
+ 3. [Metodo — Pipeline de dados](#metodo--pipeline-de-dados)
12
+ - [Pipeline de criacao de dados](#pipeline-de-criacao-de-dados)
13
+ - [Projetos que validam este metodo](#projetos-que-validam-este-metodo)
14
+ 4. [Fine-tuning](#fine-tuning)
15
+ - [Modelo base](#modelo-base)
16
+ - [Estrategia de fine-tuning](#estrategia-de-fine-tuning)
17
+ - [Infra](#infra)
18
+ 5. [Melhorias para aprendizado de idiomas (L2)](#melhorias-para-aprendizado-de-idiomas-l2)
19
+ - [Custo assimetrico](#custo-assimetrico)
20
+ - [Threshold duplo (Deepgram Flux)](#threshold-duplo-deepgram-flux)
21
+ - [Dados L2 reais (Speak & Improve)](#dados-l2-reais-speak--improve)
22
+ - [CEFR-aware presets](#cefr-aware-presets)
23
+ 6. [Sistema de backchannel](#sistema-de-backchannel)
24
+ - [Problema: "o avatar travou?"](#problema-o-avatar-travou)
25
+ - [Solucao: sinais de escuta ativa](#solucao-sinais-de-escuta-ativa)
26
+ - [Presets por nivel CEFR](#presets-por-nivel-cefr)
27
+ 7. [Engine de inferencia (06_inference.py)](#engine-de-inferencia-06_inferencepy)
28
+ - [Estados do turn-taking](#estados-do-turn-taking)
29
+ - [Exemplo: aprendiz B1 conjugando verbo](#exemplo-aprendiz-b1-conjugando-verbo)
30
+ 8. [Dados especificos — Frances falando portugues](#dados-especificos--frances-falando-portugues)
31
+ - [Tipos de hesitacao](#tipos-de-hesitacao)
32
+ - [Geracoes com Claude](#geracoes-com-claude)
33
+ 9. [Benchmarks de referencia](#benchmarks-de-referencia)
34
+ - [Pipecat Smart Turn v3.0](#pipecat-smart-turn-v30)
35
+ - [LiveKit v0.4.1](#livekit-v041)
36
+ - [Outros modelos](#outros-modelos)
37
+ - [Gap sintetico → real](#gap-sintetico--real)
38
+ 10. [Metricas alvo](#metricas-alvo)
39
+ 11. [Correcoes apos analise de referencias](#correcoes-apos-analise-de-referencias)
40
+ 12. [Estrutura de arquivos](#estrutura-de-arquivos)
41
+ 13. [Cronograma](#cronograma)
42
+ 14. [Dependencias](#dependencias)
43
+ 15. [Referencias](#referencias)
44
+
45
+ ---
46
+
47
+ ## Objetivo
48
+
49
+ Fine-tunar o modelo pre-treinado do **Pipecat Smart Turn v3** (ja treinado em 23 linguas, 270K amostras) especificamente para:
50
+
51
+ 1. **Portugues brasileiro** — melhorar deteccao de fim de turno em conversas em PT-BR
52
+ 2. **Frances falando portugues** — detectar fim de turno de falantes nativos de frances que estao falando portugues (com sotaque, hesitacoes e code-switching tipicos)
53
+ 3. **Avatar conversacional L2** — o modelo sera usado em um avatar de IA que ensina portugues a francofonos, exigindo paciencia extra com pausas de aprendiz
54
+
55
+ > **Pioneirismo**: Nenhum modelo de turn-taking otimizado para aprendizes L2 existe (comercial ou academico). O BabelCast e o primeiro.
56
+
57
+ ## Por que a partir do Pipecat
58
+
59
+ Nos experimentos anteriores (`previous-experiments/02-finetune-scratch/`) treinamos do zero com Whisper Tiny + dados CORAA/MUPE. Resultados:
60
+
61
+ | Metrica | Do zero (melhor) | Pipecat original (PT) |
62
+ |---------|------------------|----------------------|
63
+ | Accuracy | 78.2% | **95.42%** |
64
+ | False Positive | 16.8% @0.5 | **2.79%** |
65
+ | False Negative | 5.1% @0.5 | **1.79%** |
66
+ | F1 | 0.798 | ~0.96 (estimado) |
67
+ | Dados | 15K amostras | 270K amostras |
68
+ | Labels | Pontuacao + corte artificial | LLM-curados + TTS |
69
+
70
+ A diferenca e brutal: o Pipecat chega a **95.42% em portugues** com dados LLM-curados + TTS; nos chegamos a 78.2% com labels de pontuacao. O problema nao e o modelo — e a **qualidade dos dados**.
71
+
72
+ **Problemas do treino do zero:**
73
+ - Labels de baixa qualidade (pontuacao nao indica fim de turno de verdade)
74
+ - Corte artificial em 30-75% nao simula pausas reais de hesitacao
75
+ - Poucos dados (15K vs 270K do Pipecat)
76
+ - Modelo nao entende pausas de hesitacao — acerta so 67.7% das pausas @threshold=0.5
77
+
78
+ **Vantagens de partir do Pipecat:**
79
+ - Modelo ja entende turn-taking em 23 linguas
80
+ - Precisa de muito menos dados pra adaptar (5-10K vs 270K)
81
+ - Transfer learning: mantem conhecimento geral, adapta pra PT-BR
82
+ - Treino muito mais rapido (~30 min vs horas)
83
+
84
+ ## Metodo — Pipeline de dados
85
+
86
+ ### Pipeline de criacao de dados
87
+
88
+ Baseado no pipeline comprovado do Pipecat v3.1:
89
+
90
+ ```
91
+ Etapa 1: Frases fonte
92
+ - Transcricoes do CORAA (portugues conversacional real)
93
+ - Frases geradas pelo Claude (contextos especificos)
94
+ - Frases tipicas de aprendiz de frances falando PT
95
+ - Speak & Improve Corpus 2025 (340h L2 com disfluencias)
96
+ |
97
+ Etapa 2: LLM processa (Claude Haiku — custo baixo)
98
+ - Filtra frases com erros / ambiguas (Pipecat: Gemini removeu 50-80%)
99
+ - Classifica: COMPLETO vs INCOMPLETO (semantico, nao por pontuacao)
100
+ - Insere fillers brasileiros: "hum", "tipo", "ne", "entao", "e..."
101
+ - Insere fillers de frances falando PT: "euh", "comment dit-on", "como se diz"
102
+ - Gera variantes incompletas: corta frases em pontos naturais
103
+ |
104
+ Etapa 3: TTS gera audio
105
+ - Vozes PT-BR nativas (Kokoro, Google Chirp3)
106
+ - Vozes com sotaque frances (XTTS voice cloning, ou Chirp3 com accent)
107
+ - Variacao de velocidade, tom, ruido de fundo
108
+ |
109
+ Etapa 4: Dataset final
110
+ - 5-10K amostras balanceadas (50% completo / 50% incompleto)
111
+ - Metadados: lingua, sotaque, tipo_filler, confianca_label
112
+ ```
113
+
114
+ ### Projetos que validam este metodo
115
+
116
+ | Projeto | O que fizeram | Resultado |
117
+ |---------|--------------|-----------|
118
+ | **Pipecat v3.1** | Gemini filtra frases + insere fillers, Claude/GPT geram listas de fillers, Chirp3 TTS | 270K amostras, 81-97% accuracy, 23 linguas |
119
+ | **LiveKit** | Qwen 7B como professor gera soft labels, destila pra 0.5B | 99.3% TP rate, -39% interrupcoes |
120
+ | **Vogent Turn 80M** | Dados humanos + sinteticos com edge cases (disfluencias, listas, pausas) | 94.1% accuracy, estado da arte |
121
+ | **Deepgram** | 100+ horas anotadas por humanos, refinamento iterativo de labels | Melhor calibracao entre todos |
122
+ | **SpeculativeETD** | MultiWOZ texto → TTS + fillers injetados + pausas sinteticas | 120K amostras, dataset publico |
123
+ | **SODA (Allen AI)** | GPT-3.5 gera 1.5M dialogos a partir de knowledge graph | Preferido sobre BlenderBot, Vicuna |
124
+ | **Refuel Autolabel** | GPT-4 como anotador: 88.4% agreement (vs 86% humano) | 20x mais rapido, 7x mais barato |
125
+
126
+ **Referencia principal:** [Pipecat Data Generation Contribution Guide](https://github.com/pipecat-ai/smart-turn/blob/main/docs/data_generation_contribution_guide.md)
127
+
128
+ ## Fine-tuning
129
+
130
+ ### Modelo base
131
+
132
+ ```python
133
+ # Baixar modelo pre-treinado do HuggingFace
134
+ from huggingface_hub import hf_hub_download
135
+ model_path = hf_hub_download("pipecat-ai/smart-turn-v3", "model.onnx")
136
+
137
+ # Whisper Tiny encoder (39M) + linear classifier
138
+ # Hidden size: 384, ONNX INT8: 8MB, FP32: 32MB
139
+ ```
140
+
141
+ ### Estrategia de fine-tuning
142
+
143
+ ```python
144
+ # Carregar pesos pre-treinados do Pipecat
145
+ model = SmartTurnModel(whisper_model="openai/whisper-tiny")
146
+ model.load_state_dict(pipecat_weights)
147
+
148
+ # LR uniforme 5e-5 (igual ao Pipecat train.py)
149
+ optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)
150
+
151
+ # Focal Loss com custo assimetrico para aprendizes L2
152
+ criterion = FocalLoss(gamma=2.0, alpha=0.25, fp_penalty=2.0)
153
+
154
+ # 6 epochs, batch_size=128, cosine schedule + warmup 0.2
155
+ train(model, portuguese_dataset, epochs=6, batch_size=128)
156
+ ```
157
+
158
+ ### Infra
159
+
160
+ - **GPU**: Modal A10G (~$0.50/run)
161
+ - **Tempo estimado**: 15-30 min
162
+ - **Alternativa**: ai-gateway → TensorDock/Vast.ai
163
+
164
+ ## Melhorias para aprendizado de idiomas (L2)
165
+
166
+ > Pesquisa realizada em 2026-03-16 confirmou que **nenhum modelo de turn-taking para aprendizes L2 existe** — nem comercial (Praktika, ELSA, Gliglish, TalkPal) nem academico. Todos usam abordagens genericas.
167
+
168
+ ### Custo assimetrico
169
+
170
+ Para aprendizado de idiomas, **interromper o aluno (FP) e muito pior que esperar demais (FN)**. Um aluno interrompido perde confianca e para de tentar.
171
+
172
+ ```python
173
+ # fp_penalty=2.0: falsos positivos custam 2x mais na loss
174
+ criterion = FocalLoss(gamma=2.0, alpha=0.25, fp_penalty=2.0)
175
+ ```
176
+
177
+ Inspirado no ConversAR (Meta, 2025) que usa "infinite thinking period" para L2.
178
+
179
+ ### Threshold duplo (Deepgram Flux)
180
+
181
+ Dois thresholds separados para latencia vs precisao:
182
+
183
+ | Threshold | Valor | Funcao |
184
+ |-----------|-------|--------|
185
+ | **eager** (0.3-0.5) | Prepara resposta do LLM especulativamente | Reduz latencia percebida |
186
+ | **final** (0.7+) | Autoriza o avatar a falar | Minimiza interrupcoes |
187
+
188
+ Se o score cai antes de atingir `final`, a preparacao especulativa e descartada — sem custo para o aluno.
189
+
190
+ ### Dados L2 reais (Speak & Improve)
191
+
192
+ O [Speak & Improve Corpus 2025](https://huggingface.co/datasets/speak-improve/corpus) contem **340 horas** de fala L2 em ingles com anotacoes de disfluencia, niveis CEFR A2-C1.
193
+
194
+ Embora seja ingles (nao portugues), os **padroes de hesitacao L2 sao cross-linguisticos**: pausas longas, repeticoes, auto-correcoes. Usamos como fonte de hesitacao patterns no fine-tuning.
195
+
196
+ ### CEFR-aware presets
197
+
198
+ O modelo adapta paciencia conforme o nivel do aluno:
199
+
200
+ | Nivel | final_threshold | eager_threshold | Comportamento |
201
+ |-------|----------------|-----------------|---------------|
202
+ | **A1** (iniciante) | 0.80 | 0.40 | Muito paciente — espera pausas longas |
203
+ | **A2** | 0.75 | 0.38 | Paciente |
204
+ | **B1** (intermediario) | 0.70 | 0.35 | Moderado |
205
+ | **B2** | 0.65 | 0.33 | Responsivo |
206
+ | **C1** (avancado) | 0.60 | 0.30 | Quase nativo — resposta rapida |
207
+
208
+ ## Sistema de backchannel
209
+
210
+ ### Problema: "o avatar travou?"
211
+
212
+ Aprendizes L2 frequentemente pausam 1-3 segundos para pensar em conjugacoes, vocabulario, ou estrutura. Sem feedback, o aprendiz acha que o avatar travou e desiste de falar.
213
+
214
+ > **Pesquisa**: Tavus identifica um threshold de **600ms** — apos esse tempo, o falante espera alguma resposta do sistema. Para L2, esse tempo e ainda mais critico.
215
+
216
+ ### Solucao: sinais de escuta ativa
217
+
218
+ O avatar emite sinais progressivos de que esta ouvindo:
219
+
220
+ | Tempo de silencio | Sinal | Tipo | Exemplo |
221
+ |-------------------|-------|------|---------|
222
+ | **600ms** | Aceno visual | Visual | Avatar faz "mhm" com a cabeca |
223
+ | **1.5s** | Backchannel verbal | Audio | "mhm", "sim", "uhum" |
224
+ | **3.0s** | Encorajamento | Audio | "sem pressa", "pode continuar", "estou ouvindo" |
225
+
226
+ Os sinais **nao interrompem o turno do aluno** — sao sobreposicoes curtas que indicam escuta ativa.
227
+
228
+ ### Presets por nivel CEFR
229
+
230
+ | Nivel | Visual (ms) | Verbal (ms) | Encorajamento (ms) |
231
+ |-------|-------------|-------------|---------------------|
232
+ | **A1** | 500 | 1200 | 2500 |
233
+ | **B1** | 600 | 1500 | 3000 |
234
+ | **C1** | 800 | 2000 | 4000 |
235
+
236
+ Alunos A1 recebem sinais mais cedo; alunos C1 precisam de menos apoio.
237
+
238
+ ## Engine de inferencia (06_inference.py)
239
+
240
+ O arquivo `06_inference.py` implementa a engine completa de turn-taking com backchannel.
241
+
242
+ ### Estados do turn-taking
243
+
244
+ ```
245
+ LISTENING → SILENCE → BACKCHANNEL_VISUAL → BACKCHANNEL_VERBAL → ENCOURAGEMENT
246
+ ↑ ↓ ↓ ↓ ↓
247
+ ←──────────←──────────────←─────────────────────←──────────────────←
248
+ (aluno volta a falar)
249
+
250
+ SILENCE → PREPARING → RESPONDING
251
+ (eager) (final)
252
+ ```
253
+
254
+ - **LISTENING**: aluno esta falando, modelo monitora score
255
+ - **SILENCE**: pausa detectada, timer inicia
256
+ - **BACKCHANNEL_***: sinais de escuta ativa (nao interrompe turno)
257
+ - **PREPARING**: score atingiu eager_threshold, LLM comeca a preparar resposta
258
+ - **RESPONDING**: score atingiu final_threshold, avatar fala
259
+
260
+ ### Exemplo: aprendiz B1 conjugando verbo
261
+
262
+ ```
263
+ Aluno: "Ontem eu... [pausa 600ms]"
264
+ Avatar: [aceno visual - nod]
265
+ Aluno: "... fui? fiz? [pausa 1.5s]"
266
+ Avatar: "mhm" [backchannel verbal]
267
+ Aluno: "... fui ao mercado."
268
+ Avatar: [espera final_threshold] → responde normalmente
269
+ ```
270
+
271
+ ## Dados especificos — Frances falando portugues
272
+
273
+ ### Tipos de hesitacao
274
+
275
+ | Tipo | Exemplo | Label |
276
+ |------|---------|-------|
277
+ | Filler frances | "Eu fui... euh... ao mercado" | INCOMPLETO |
278
+ | Busca de palavra | "Eu preciso de... comment dit-on... uma tesoura" | INCOMPLETO |
279
+ | Code-switching | "Eu gosto de... enfin... tipo... de praia" | INCOMPLETO |
280
+ | Pausa de conjugacao | "Eu... fui? fiz?... ontem" | INCOMPLETO |
281
+ | Entonacao francesa | Frase completa mas com pitch plano (sem queda final) | COMPLETO (dificil) |
282
+ | Ritmo silabico | Frances e syllable-timed, PT e stress-timed | Ambos |
283
+
284
+ ### Geracoes com Claude
285
+
286
+ ```
287
+ Prompt para Claude gerar frases de aprendiz:
288
+
289
+ "Gere 100 frases que um frances de nivel B1 falando portugues diria
290
+ em uma reuniao de trabalho. Inclua:
291
+ - Hesitacoes tipicas (euh, alors, comment dire)
292
+ - Erros comuns de conjugacao
293
+ - Pausas naturais pra pensar na palavra
294
+ - Code-switching involuntario (palavras em frances no meio)
295
+ - Frases completas com entonacao plana (sem queda de pitch)
296
+
297
+ Para cada frase, indique: COMPLETO ou INCOMPLETO"
298
+ ```
299
+
300
+ ## Benchmarks de referencia
301
+
302
+ ### Pipecat Smart Turn v3.0
303
+
304
+ Audio-only, 8M params:
305
+
306
+ | Lingua | Accuracy | FP | FN |
307
+ |--------|----------|-----|-----|
308
+ | Turco | 97.10% | 1.66% | 1.24% |
309
+ | Coreano | 96.85% | 1.12% | 2.02% |
310
+ | Japones | 96.76% | 2.04% | 1.20% |
311
+ | Frances | 96.01% | 1.60% | 2.39% |
312
+ | **Portugues** | **95.42%** | **2.79%** | **1.79%** |
313
+ | Ingles | 94.31% | 2.64% | 3.06% |
314
+ | Espanhol | 91.97% | 4.48% | 3.55% |
315
+
316
+ Fonte: [Smart Turn v3 blog](https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/)
317
+
318
+ ### LiveKit v0.4.1
319
+
320
+ Texto-only, 500M params, @99.3% TPR:
321
+
322
+ | Lingua | TNR | Melhoria vs anterior |
323
+ |--------|------|---------------------|
324
+ | Hindi | 96.3% | +31.48% |
325
+ | Coreano | 94.5% | +30.38% |
326
+ | Frances | 88.9% | +33.93% |
327
+ | **Portugues** | **87.4%** | **+45.97%** |
328
+ | Ingles | 87.0% | +21.69% |
329
+
330
+ Fonte: [LiveKit v0.4.1 blog](https://livekit.com/blog/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/)
331
+
332
+ ### Outros modelos
333
+
334
+ | Modelo | Tipo | Accuracy | Linguas | Nota |
335
+ |--------|------|----------|---------|------|
336
+ | Vogent Turn 80M | Audio+texto | 94.1% | 1 (EN) | Estado da arte multimodal |
337
+ | Krisp v2 | Audio | 82.3% bal.acc | "Agnostico" | Proprietario |
338
+ | Deepgram Flux | ASR+EoT | #1 VAQI | 1 (EN) | Proprietario |
339
+ | VAP | Audio | 79.6% bal.acc | 3 (EN/ZH/JP) | Academico, auto-supervisionado |
340
+
341
+ ### Gap sintetico → real
342
+
343
+ O SpeculativeETD mostrou que modelos treinados so em dados sinteticos (TTS) perdem **muito** em dados reais:
344
+
345
+ | Dataset | Wav2vec F1 |
346
+ |---------|-----------|
347
+ | Sintetico | 94.7% |
348
+ | Real | **30.3%** |
349
+
350
+ Isso mostra que alem de gerar dados com TTS, precisamos incluir **audio real** no fine-tuning.
351
+
352
+ ## Metricas alvo
353
+
354
+ | Metrica | Exp 02 (do zero) | Pipecat PT (ref) | Alvo (este exp) |
355
+ |---------|------------------|------------------|-----------------|
356
+ | Accuracy | 78.2% | 95.42% | > 96% |
357
+ | False Positive | 16.8% | 2.79% | < 2.5% |
358
+ | False Negative | 5.1% | 1.79% | < 2.0% |
359
+ | F1 | 0.798 | ~0.96 | > 0.96 |
360
+ | Tamanho modelo | 30.5 MB | 8 MB (ONNX INT8) | ~8 MB |
361
+ | Inferencia CPU | ~12ms | 12ms | ~12ms |
362
+
363
+ ## Correcoes apos analise de referencias
364
+
365
+ *(2026-03-16)*
366
+
367
+ Baixamos 6 papers, 10 blog posts e 7 guias tecnicos (ver `references/`). A analise cruzada revelou problemas no plano original:
368
+
369
+ | Mudanca | Antes | Depois | Fonte |
370
+ |---------|-------|--------|-------|
371
+ | Focal Loss alpha | 0.6 | **0.25** | Lin et al. 2017, Table 1a |
372
+ | Batch size | 32 | **128** | Pipecat train.py usa 384 |
373
+ | Epochs | 10 | **6** | Pipecat usa 4 em 270K |
374
+ | Label smoothing | 0.05 | **removido** | EMNLP 2022: dupla regularizacao com FL |
375
+ | Learning rate | diferencial (0.1x encoder) | **uniforme 5e-5** | Pipecat train.py usa lr unica |
376
+ | Ruido augmentation | Gaussiano | **ruido real** (cafe/escritorio) | Pipecat v3.2: -40% erros |
377
+ | ONNX opset | 17 | **18** | Pipecat train.py |
378
+ | Quantizacao | nenhuma | **INT8 estatica** (entropy, 1024 calib) | Pipecat deploy: 32MB → 8MB |
379
+ | Loss alternativa | so Focal | **Focal + BCE** comparados | Pipecat original usa BCE |
380
+
381
+ ### Tecnicas adicionais identificadas
382
+
383
+ 1. **Knowledge distillation** (LiveKit: -39% interrupcoes, +45.97% melhoria PT)
384
+ 2. **Short utterance dataset** (Pipecat v3.2: -40% erros em respostas curtas) — implementado
385
+ 3. **Audio real misturado com TTS** (SpeculativeETD: F1 cai de 94.7% → 30.3% so com sintetico) — implementado via CORAA
386
+ 4. **Pausa de 1.5-3s apos fillers** (SpeculativeETD V3: melhor variante) — implementado
387
+ 5. **Threshold per-language** (LiveKit: languages.json com thresholds por lingua)
388
+
389
+ ## Estrutura de arquivos
390
+
391
+ ```
392
+ 03-finetune-pipecat-pt/
393
+ README.md # Este documento
394
+ 01_download_pipecat.py # Baixa modelo pre-treinado do HuggingFace
395
+ 02_generate_labels.py # Claude API gera/filtra/classifica frases PT
396
+ 03_generate_audio.py # TTS gera audio (nativo + sotaque frances)
397
+ 04_finetune.py # Fine-tune com custo assimetrico + dados L2
398
+ 05_evaluate.py # Avaliacao com dual threshold sweep
399
+ 06_inference.py # Engine de inferencia com backchannel + CEFR presets
400
+ modal_run.py # Deploy no Modal (GPU)
401
+
402
+ references/
403
+ hesitation_turn_taking_l2_review.md # Mini-artigo com 35 referencias
404
+ papers/ # 6 PDFs academicos + summaries
405
+ focal_loss_lin_2017.*
406
+ speculative_etd_2025.*
407
+ vap_turn_taking_ekstedt_2024.*
408
+ turn_taking_review_skantze_2021.*
409
+ soda_dialog_distillation_2023.*
410
+ finite_state_turn_taking_raux_2009.*
411
+ papers/hesitation-l2-french/ # 19 PDFs + 10 MD (L2 hesitacao/francofono)
412
+ papers/language-learning-turn-taking/ # Pesquisa L2 turn-taking (2026-03-16)
413
+ survey_turn_taking_iwsds2025.* # Survey IWSDS 2025
414
+ multilingual_vap_2024.* # VAP multilingual
415
+ speak_improve_corpus_2025.* # Speak & Improve L2 corpus
416
+ hesitation_tagging_l2_whisper.* # Whisper+LoRA hesitation
417
+ conversar_mixed_reality_l2.* # ConversAR (Meta Quest)
418
+ deepgram_flux.* # Deepgram Flux overview
419
+ hume_evi.* # Hume EVI overview
420
+ praktika_openai.* # Praktika case study
421
+ tavus_turn_taking_guide.* # Tavus guide
422
+ blogs/ # 10 blog posts
423
+ guides/ # 7 technical guides
424
+
425
+ data/ # (gitignored)
426
+ pipecat_pt_audio/ # Audio PT do Pipecat v3.2
427
+ pipecat_pt_test/ # Audio PT test do Pipecat v3.2
428
+ claude_labeled/ # Frases processadas pelo Claude (JSON)
429
+ tts_dataset/ # Audio gerado (nativo + sotaque frances)
430
+ noise_samples/ # Ruido real CC-0 (cafe, escritorio)
431
+
432
+ results/ # (gitignored)
433
+ best_model.pt # Modelo treinado
434
+ smart_turn_pt_v3.onnx # ONNX INT8 (~8 MB)
435
+ smart_turn_pt_v3_fp32.onnx # ONNX FP32 (~32 MB)
436
+ training_results.json # Metricas
437
+ evaluation_results.json # Comparacao com baseline
438
+ ```
439
+
440
+ ## Cronograma
441
+
442
+ | Etapa | Descricao | Tempo |
443
+ |-------|-----------|-------|
444
+ | 1 | Baixar e inspecionar modelo Pipecat | 1h |
445
+ | 2 | Script de labeling com Claude API | 1 dia |
446
+ | 3 | Gerar audio TTS (nativo PT-BR) | 1 dia |
447
+ | 4 | Gerar audio com sotaque frances | 1-2 dias |
448
+ | 5 | Fine-tune no Modal | 30 min |
449
+ | 6 | Avaliacao + dual threshold sweep | 2h |
450
+ | 7 | Teste no avatar conversacional | 1 dia |
451
+ | **Total** | | **~5 dias** |
452
+
453
+ ## Dependencias
454
+
455
+ ```
456
+ # Python
457
+ torch, torchaudio, transformers # Modelo
458
+ anthropic # Claude API (labeling)
459
+ kokoro, TTS # Geracao de audio
460
+ datasets, soundfile, librosa # Processamento
461
+ modal # Deploy GPU
462
+
463
+ # APIs
464
+ ANTHROPIC_API_KEY # Claude Haiku pra labeling
465
+ MODAL_TOKEN # Modal pra treino GPU
466
+
467
+ # Modelos
468
+ pipecat-ai/smart-turn-v3 # HuggingFace — modelo pre-treinado
469
+ openai/whisper-tiny # HuggingFace — encoder base
470
+ ```
471
+
472
+ ## Referencias
473
+
474
+ Documento completo de pesquisa com 35 referencias: [`references/hesitation_turn_taking_l2_review.md`](references/hesitation_turn_taking_l2_review.md)
475
+
476
+ **Principais:**
477
+
478
+ | # | Referencia | Contribuicao |
479
+ |---|-----------|--------------|
480
+ | 1 | Pipecat Smart Turn v3 (Daily, 2025) | Modelo base, pipeline de dados |
481
+ | 2 | Lin et al. (ICCV 2017) | Focal Loss, alpha=0.25 |
482
+ | 3 | Skantze (CSL 2021) | Survey de turn-taking |
483
+ | 4 | Knill et al. (2025) | Speak & Improve L2 corpus |
484
+ | 5 | Saeki et al. (2025) | Whisper+LoRA hesitation tagging |
485
+ | 6 | Gamboa et al. (2025) | ConversAR, custo assimetrico L2 |
486
+ | 7 | Deepgram Flux (2025) | Dual threshold, speculative ASR |
487
+ | 8 | Ekstedt et al. (2024) | VAP multilingual |
488
+ | 9 | Raux & Eskenazi (NAACL 2009) | FSM turn-taking |
489
+ | 10 | LiveKit (2025) | Knowledge distillation, 500M params |
03-finetune-pipecat-pt/modal_run.py ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Run the full fine-tuning pipeline on Modal (A10G GPU).
2
+
3
+ Usage:
4
+ modal run modal_run.py
5
+
6
+ This deploys the pipeline on a Modal A10G GPU (~$0.50/run):
7
+ 1. Downloads Pipecat Portuguese data
8
+ 2. Generates labels with Claude API (if ANTHROPIC_API_KEY set)
9
+ 3. Generates TTS audio with Kokoro
10
+ 4. Fine-tunes SmartTurnV3Model
11
+ 5. Evaluates against Pipecat baseline
12
+ 6. Saves results to Modal volume
13
+
14
+ Estimated time: 30-60 min total
15
+ Estimated cost: ~$0.50-1.00
16
+ """
17
+
18
+ from __future__ import annotations
19
+
20
+ import modal
21
+
22
+ app = modal.App("babelcast-finetune-pipecat-pt")
23
+
24
+ # Persistent volume for data + results
25
+ volume = modal.Volume.from_name("finetune-pipecat-pt", create_if_missing=True)
26
+
27
+ # GPU image with all dependencies
28
+ image = (
29
+ modal.Image.debian_slim(python_version="3.11")
30
+ .pip_install(
31
+ # Core ML
32
+ "torch>=2.1",
33
+ "torchaudio>=2.1",
34
+ "transformers>=4.36",
35
+ "datasets>=2.16",
36
+ # Audio processing
37
+ "soundfile>=0.12",
38
+ "librosa>=0.10",
39
+ "numpy<2",
40
+ # TTS
41
+ "kokoro>=0.3",
42
+ # ONNX for evaluation
43
+ "onnxruntime>=1.16",
44
+ # Claude API for labeling
45
+ "anthropic>=0.40",
46
+ # Utilities
47
+ "huggingface-hub>=0.20",
48
+ )
49
+ .apt_install("ffmpeg", "libsndfile1")
50
+ )
51
+
52
+
53
+ @app.function(
54
+ image=image,
55
+ gpu="A10G",
56
+ timeout=3600, # 1 hour max
57
+ volumes={"/workspace": volume},
58
+ secrets=[
59
+ modal.Secret.from_name("anthropic-api-key", required=False),
60
+ modal.Secret.from_name("huggingface-token", required=False),
61
+ ],
62
+ )
63
+ def run_pipeline(
64
+ skip_download: bool = False,
65
+ skip_labels: bool = False,
66
+ skip_audio: bool = False,
67
+ skip_train: bool = False,
68
+ max_pipecat_samples: int = 5000,
69
+ max_tts_samples: int = 10000,
70
+ max_l2_samples: int = 2000,
71
+ epochs: int = 6,
72
+ batch_size: int = 128,
73
+ lr: float = 5e-5,
74
+ loss_fn: str = "focal",
75
+ fp_penalty: float = 2.0,
76
+ ):
77
+ """Run the full pipeline on Modal GPU."""
78
+ import logging
79
+ import os
80
+ import sys
81
+ from pathlib import Path
82
+
83
+ logging.basicConfig(
84
+ level=logging.INFO,
85
+ format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
86
+ )
87
+ log = logging.getLogger("modal_run")
88
+
89
+ # Add script directory to path
90
+ script_dir = Path(__file__).parent
91
+ sys.path.insert(0, str(script_dir))
92
+
93
+ # Symlink data dir to /workspace for persistence
94
+ data_dir = script_dir / "data"
95
+ workspace_data = Path("/workspace/data")
96
+ workspace_data.mkdir(parents=True, exist_ok=True)
97
+ if not data_dir.exists():
98
+ data_dir.symlink_to(workspace_data)
99
+
100
+ # Step 1: Download Pipecat data
101
+ if not skip_download:
102
+ log.info("=" * 60)
103
+ log.info("STEP 1: Download Pipecat Portuguese data")
104
+ log.info("=" * 60)
105
+ from import_module_01 import download_pipecat_dataset, download_pipecat_test_data, download_onnx_model
106
+ try:
107
+ # Use importlib since filenames start with numbers
108
+ import importlib.util
109
+ spec = importlib.util.spec_from_file_location("dl", script_dir / "01_download_pipecat.py")
110
+ dl = importlib.util.module_from_spec(spec)
111
+ spec.loader.exec_module(dl)
112
+
113
+ dl.download_pipecat_dataset(max_pt_samples=max_pipecat_samples)
114
+ dl.download_pipecat_test_data(max_pt_samples=2000)
115
+ dl.download_onnx_model()
116
+ except Exception as e:
117
+ log.error("Download failed: %s", e)
118
+ raise
119
+
120
+ volume.commit()
121
+ log.info("Step 1 complete — data saved to volume")
122
+
123
+ # Step 2: Generate labels (requires ANTHROPIC_API_KEY)
124
+ if not skip_labels:
125
+ log.info("=" * 60)
126
+ log.info("STEP 2: Generate labels with Claude API")
127
+ log.info("=" * 60)
128
+ try:
129
+ import importlib.util
130
+ spec = importlib.util.spec_from_file_location("labels", script_dir / "02_generate_labels.py")
131
+ labels = importlib.util.module_from_spec(spec)
132
+ spec.loader.exec_module(labels)
133
+
134
+ api_key = os.environ.get("ANTHROPIC_API_KEY")
135
+ if api_key:
136
+ log.info("ANTHROPIC_API_KEY found — using Claude for labeling")
137
+ else:
138
+ log.info("No ANTHROPIC_API_KEY — using rule-based fallback")
139
+
140
+ labels.run_full_pipeline(max_transcripts=5000, max_fr_sentences=500)
141
+ except Exception as e:
142
+ log.warning("Label generation failed: %s — continuing without custom labels", e)
143
+
144
+ volume.commit()
145
+
146
+ # Step 3: Generate TTS audio
147
+ if not skip_audio:
148
+ log.info("=" * 60)
149
+ log.info("STEP 3: Generate TTS audio")
150
+ log.info("=" * 60)
151
+ try:
152
+ import importlib.util
153
+ spec = importlib.util.spec_from_file_location("audio", script_dir / "03_generate_audio.py")
154
+ audio = importlib.util.module_from_spec(spec)
155
+ spec.loader.exec_module(audio)
156
+
157
+ audio.run_audio_generation(
158
+ max_native_sentences=3000,
159
+ max_french_sentences=1000,
160
+ augment_copies=2,
161
+ )
162
+ except Exception as e:
163
+ log.warning("TTS generation failed: %s — continuing with Pipecat data only", e)
164
+
165
+ volume.commit()
166
+
167
+ # Step 4: Fine-tune
168
+ if not skip_train:
169
+ log.info("=" * 60)
170
+ log.info("STEP 4: Fine-tune SmartTurnV3Model")
171
+ log.info("=" * 60)
172
+ try:
173
+ import importlib.util
174
+ spec = importlib.util.spec_from_file_location("ft", script_dir / "04_finetune.py")
175
+ ft = importlib.util.module_from_spec(spec)
176
+ spec.loader.exec_module(ft)
177
+
178
+ ft.train(
179
+ epochs=epochs,
180
+ batch_size=batch_size,
181
+ lr=lr,
182
+ warmup_ratio=0.2,
183
+ max_pipecat_samples=max_pipecat_samples,
184
+ max_tts_samples=max_tts_samples,
185
+ max_l2_samples=max_l2_samples,
186
+ loss_fn=loss_fn,
187
+ fp_penalty=fp_penalty,
188
+ )
189
+ except Exception as e:
190
+ log.error("Training failed: %s", e)
191
+ raise
192
+
193
+ volume.commit()
194
+
195
+ # Step 5: Evaluate
196
+ log.info("=" * 60)
197
+ log.info("STEP 5: Evaluate")
198
+ log.info("=" * 60)
199
+ try:
200
+ import importlib.util
201
+ spec = importlib.util.spec_from_file_location("ev", script_dir / "05_evaluate.py")
202
+ ev = importlib.util.module_from_spec(spec)
203
+ spec.loader.exec_module(ev)
204
+
205
+ results_dir = Path("/workspace/results")
206
+ model_path = results_dir / "best_model.pt"
207
+ if model_path.exists():
208
+ results = ev.run_evaluation(model_path)
209
+ log.info("Evaluation complete!")
210
+ else:
211
+ log.warning("No model found at %s — skipping evaluation", model_path)
212
+ except Exception as e:
213
+ log.error("Evaluation failed: %s", e)
214
+
215
+ volume.commit()
216
+ log.info("=" * 60)
217
+ log.info("PIPELINE COMPLETE — results saved to Modal volume 'finetune-pipecat-pt'")
218
+ log.info("=" * 60)
219
+
220
+
221
+ @app.function(
222
+ image=image,
223
+ volumes={"/workspace": volume},
224
+ )
225
+ def download_results(local_dir: str = "results") -> list[str]:
226
+ """Download results from Modal volume."""
227
+ from pathlib import Path
228
+ import shutil
229
+
230
+ results_dir = Path("/workspace/results")
231
+ if not results_dir.exists():
232
+ print("No results found on volume")
233
+ return []
234
+
235
+ files = []
236
+ for f in results_dir.rglob("*"):
237
+ if f.is_file():
238
+ files.append(str(f.relative_to(results_dir)))
239
+ print(f" {f.name}: {f.stat().st_size / 1024:.1f} KB")
240
+
241
+ return files
242
+
243
+
244
+ @app.local_entrypoint()
245
+ def main(
246
+ skip_download: bool = False,
247
+ skip_labels: bool = False,
248
+ skip_audio: bool = False,
249
+ skip_train: bool = False,
250
+ download_only: bool = False,
251
+ epochs: int = 10,
252
+ batch_size: int = 32,
253
+ ):
254
+ """Entry point for `modal run modal_run.py`."""
255
+ if download_only:
256
+ files = download_results.remote()
257
+ print(f"\nFound {len(files)} result files on volume")
258
+ return
259
+
260
+ run_pipeline.remote(
261
+ skip_download=skip_download,
262
+ skip_labels=skip_labels,
263
+ skip_audio=skip_audio,
264
+ skip_train=skip_train,
265
+ epochs=epochs,
266
+ batch_size=batch_size,
267
+ )
03-finetune-pipecat-pt/references/blogs/assemblyai_turn_detection.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "How Intelligent Turn Detection (Endpointing) Solves the Biggest Challenge in Voice Agent Development"
3
+ source: https://www.assemblyai.com/blog/turn-detection-endpointing-voice-agent
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # How Intelligent Turn Detection (Endpointing) Solves the Biggest Challenge in Voice Agent Development
8
+
9
+ ## Introduction
10
+
11
+ Voice agents enable natural conversations between humans and AI systems, representing one of Speech AI's fastest-growing applications. However, a persistent challenge affects all voice agent implementations: **turn detection**, or determining when a human finishes speaking so the AI can respond appropriately.
12
+
13
+ ## The Latency Problem in Voice Agents
14
+
15
+ Voice agent developers prioritize low latency for optimal user experience. Three fundamental latencies exist in streaming speech-to-text models:
16
+
17
+ 1. **Partial transcript latency** -- Speed of returning initial transcript predictions
18
+ 2. **Final transcript latency** -- Speed of returning finalized transcripts after speech ends
19
+ 3. **Endpointing latency** -- Speed of detecting when someone stops speaking
20
+
21
+ Voice agents specifically require fast endpointing latency. Developers have historically over-optimized for general latency reduction, treating end-of-turn detection as secondary. However, addressing turn detection provides "far greater improvements to the user experience than incremental latency optimizations," as endpointing delay exists at a different magnitude than millisecond-level latency gains.
22
+
23
+ ## Three Endpointing Methods
24
+
25
+ ### Manual Endpointing
26
+ Users explicitly indicate completion through button presses or voice commands. This creates poor user experience and harms adoption despite potentially fast response times.
27
+
28
+ ### Silence Detection
29
+ The current industry standard waits for specified silence duration thresholds. This approach balances responsiveness against premature interruptions but struggles finding optimal threshold values -- long enough to avoid cutoffs mid-thought, yet short enough to maintain conversational fluidity.
30
+
31
+ ### Semantic Endpointing
32
+ Advanced systems analyze what someone says rather than just silence duration. This approach understands when thoughts are complete, distinguishing natural pauses mid-sentence from genuine utterance endings. Modern implementations use language models to predict sentence boundaries and semantic completeness.
33
+
34
+ ## How Semantic Endpointing Works
35
+
36
+ Semantic endpointing encompasses multiple implementation approaches. AssemblyAI's Universal-Streaming model predicts a special token that the model learns to identify during training based on context.
37
+
38
+ ### Configuration Parameters
39
+
40
+ Several user-configurable options enable developers to tune endpointing behavior:
41
+
42
+ - **end_of_turn_confidence_threshold** (default: 0.7) -- Required confidence level for end-of-turn predictions
43
+ - **min_end_of_turn_silence_when_confident** (default: 160ms) -- Minimum silence duration after confident prediction to prevent false positives
44
+ - **max_turn_silence** (default: 2400ms) -- Fallback silence-based endpointing for unusual speech patterns
45
+
46
+ The system requires three conditions for semantic end-of-turn:
47
+
48
+ 1. Model predicts end-of-turn with sufficient confidence
49
+ 2. Minimum silence duration has passed
50
+ 3. Minimal speech present to prevent false positives from audio noise
51
+
52
+ Traditional silence-based endpointing remains enabled as a final catch-all mechanism for atypical patterns.
53
+
54
+ ## Comparative Analysis: Turn Detection Approaches
55
+
56
+ ### LiveKit: Semantic-Only Approach
57
+
58
+ **Input:** Transcribed text only
59
+ **Key Dependency:** Voice Activity Detection (VAD)
60
+ **Processing:** Runs after VAD detects speech end
61
+
62
+ **Strengths:**
63
+ - Simple, focused semantic analysis
64
+ - Open source customization available
65
+
66
+ **Limitations:**
67
+ - Heavy VAD dependency creates delays if background noise triggers continuous VAD
68
+ - Ignores audio cues humans naturally use for turn boundaries
69
+ - Tends toward maximum delay periods, creating sluggish conversations
70
+
71
+ ### Pipecat: Audio-Centric Detection
72
+
73
+ **Input:** Audio features only (prosody, intonation)
74
+ **Key Dependency:** Direct audio analysis
75
+ **Processing:** Real-time audio pattern recognition
76
+
77
+ **Strengths:**
78
+ - Leverages natural prosodic and intonation cues
79
+ - Potentially faster turn detection than text-dependent models
80
+ - Open source flexibility
81
+
82
+ **Limitations:**
83
+ - Sensitive to background noise interference
84
+ - Performance varies significantly with speaker accents
85
+ - Struggles with atypical prosodic patterns
86
+ - Limited by audio quality availability
87
+
88
+ ### AssemblyAI: Hybrid Approach
89
+
90
+ **Input:** Both transcribed text and audio context
91
+ **Key Dependency:** Integrated approach with dynamic thresholds
92
+ **Processing:** Context-aware analysis enabling early turn detection during silence based on syntactic completeness
93
+
94
+ **Strengths:**
95
+ - Robust across varying acoustic conditions
96
+ - Dynamic adaptation based on sentence completeness rather than static VAD
97
+ - Better background noise handling through semantic backup
98
+ - More natural conversation flow with context-aware timing
99
+
100
+ **Limitations:**
101
+ - Closed source limits customization
102
+ - Potentially higher computational requirements
103
+ - Depends on transcription quality for optimal performance
104
+
105
+ ## Feature Comparison Matrix
106
+
107
+ | Feature | LiveKit | Pipecat | AssemblyAI |
108
+ |---------|---------|---------|-----------|
109
+ | Input Type | Text only | Audio only | Text + Audio |
110
+ | VAD Dependency | High | None | Low |
111
+ | Noise Robustness | Poor (via VAD) | Poor | Good |
112
+ | Speaker Variation | Good | Poor | Good |
113
+ | Response Speed | Slow | Fast | Adaptive |
114
+ | Complexity | Low | Medium | High |
115
+
116
+ ## Selection Guidance
117
+
118
+ **Choose LiveKit if:**
119
+ - You need simple, semantic-focused approaches
120
+ - You have clean audio environments with minimal background noise
121
+ - Response speed is less critical than accuracy
122
+
123
+ **Choose Pipecat if:**
124
+ - You have consistent speaker profiles and clean audio
125
+ - You want to experiment with pure audio-based detection
126
+
127
+ **Choose AssemblyAI if:**
128
+ - You want combined semantic and acoustic endpointing
129
+ - You handle diverse speakers and noisy environments
130
+ - Natural conversation flow is critical
131
+ - Deployment flexibility matters
132
+
133
+ ## The Future of Turn Detection
134
+
135
+ Single-modality solutions offer simplicity and specialized performance, while hybrid approaches like AssemblyAI's demonstrate benefits of combining multiple signal types for robust and natural conversational experiences. Future developments will likely incorporate conversational context, speaker identification, and multimodal cues (including visual information in relevant scenarios).
136
+
137
+ The optimal choice depends on specific use case requirements, technical constraints, and acceptable trade-offs between accuracy, speed, and robustness.
03-finetune-pipecat-pt/references/blogs/deepgram_eot_evaluation.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Evaluating End-of-Turn Detection Models"
3
+ source: https://deepgram.com/learn/evaluating-end-of-turn-detection-models
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # Evaluating End-of-Turn (Turn Detection) Models
8
+
9
+ ## Overview
10
+
11
+ Natural voice agent interactions depend critically on accurate turn-taking behavior. End-of-Turn (EoT) detection -- predicting when a speaker has finished and awaits a response -- is essential for creating conversational agents that feel responsive rather than interruptive.
12
+
13
+ Deepgram developed a comprehensive evaluation methodology for turn detection after finding existing approaches inadequate. Their approach prioritizes real-world performance over proxy metrics, evaluating "complete conversations between humans" rather than isolated turns.
14
+
15
+ ## Full Conversational Evaluation
16
+
17
+ Rather than testing individual turns, Deepgram analyzes entire conversations. This methodology reflects realistic scenarios where:
18
+
19
+ - **Natural labeling**: Conversation partners provide high-quality, low-latency ground truth detection
20
+ - **Variable detection budgets**: Different turns demand faster responses based on context (simple queries versus complex reasoning)
21
+ - **Natural pause patterns**: Pre-turn silences enable evaluation of start-of-speech detection and backchannel phenomena
22
+
23
+ The team labeled over 100 hours of real conversations with ground truth transcripts and timestamped EoT labels. Notably, refining their annotation specification from "End-of-Thought" to "End-of-Turn" improved annotator confidence from 5/10 to 8.5/10 by reducing ambiguity.
24
+
25
+ ## The Challenge with Timestamps
26
+
27
+ Human annotations revealed an unexpected problem: annotators consistently left small gaps between speech completion and label placement. Deepgram discovered their system frequently detected EoT within these gaps -- appearing as false positives under naive temporal alignment.
28
+
29
+ ### Forced Alignment Solution
30
+
31
+ Rather than trusting raw human timestamps, Deepgram applied "forced alignment to update the human timestamps," extracting more precise end-of-speech markers. While conservative human placement might reflect natural pauses before responses, voice agents benefit from fastest possible detection since "their turn starts are frequently delayed relative to EoT detection due to extra latency associated with LLM and TTS generation."
32
+
33
+ ## Sequence Alignment for Improved Evaluation
34
+
35
+ Standard temporal matching proved insufficiently precise. Deepgram adopted sequence alignment -- treating turn boundaries as special tokens (`[EoT]`) within transcripts, similar to Word Error Rate (WER) calculation.
36
+
37
+ **Impact**: This shift yielded "3-5% absolute increases in precision and recall across models," with "manual investigation of results suggested the new values were more representative of the true performance."
38
+
39
+ The approach handled all detector types: all-in-one STT solutions, text-based detectors, and audio-only models (using Nova-3 for inter-turn transcription).
40
+
41
+ ## Handling Dropped Turns
42
+
43
+ A modified Levenshtein algorithm addressed cases where intermediate turns were missed. Beyond simple edit distance, the system "determines the best alignment not only based on overall edit distance but also based on the most likely alignment between EoT tokens," preventing misalignment when turns are dropped.
44
+
45
+ ## Start-of-Turn (SoT) Evaluation
46
+
47
+ Complementing EoT analysis, Deepgram evaluated "start-of-turn (SoT)" detection -- identifying when users resume speaking, critical for handling interruptions or "barge-in" scenarios.
48
+
49
+ **Flux performance metrics**:
50
+ - Detection within ~100-200ms of first word onset
51
+ - False positive rate: less than or equal to 1-2%
52
+ - Analysis used word start times as detection targets (representing unachievable lower bounds)
53
+
54
+ Negative latency directly indicates false positives -- detecting speech before it actually begins.
55
+
56
+ ## Future Directions
57
+
58
+ The team highlighted emerging evaluation frontiers: "one could combine the various aspects of turn detection, such as false positive rate and latency, into single quality metrics." They specifically mentioned their Voice Agent Quality Index (VAQI) framework and anticipated incorporating semantic/conversational flow considerations -- enabling "stratified or weighted metrics that reflect that some turns merit faster responses than others."
03-finetune-pipecat-pt/references/blogs/krisp_turn_taking_v1.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Turn-Taking for Voice AI"
3
+ source: https://krisp.ai/blog/turn-taking-for-voice-ai/
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # Audio-only, 6M Weights Turn-Taking Model for Voice AI Agents
8
+
9
+ ## Overview
10
+
11
+ Krisp has developed a lightweight turn-taking model designed to determine when speakers should transition in voice conversations. This technology is included at no additional cost in Krisp's VIVA SDK.
12
+
13
+ ## What is Turn-Taking?
14
+
15
+ Turn-taking represents "the fundamental mechanism by which participants in a conversation coordinate who speaks when." In voice AI contexts, it manages when agents should listen, speak, or remain silent. The model addresses two primary tasks:
16
+
17
+ - **End-of-turn prediction**: Detecting when the current speaker will finish
18
+ - **Backchannel prediction**: Recognizing brief acknowledgments like "uh-huh" without speaker transitions
19
+
20
+ ## Implementation Approaches
21
+
22
+ **Audio-based methods** analyze acoustic features including pitch variations, energy levels, intonation, pauses, and speaking rate. These enable low-latency responses critical for real-time scenarios.
23
+
24
+ **Text-based approaches** examine transcribed content, identifying linguistic cues like sentence boundaries and discourse markers, though they typically require larger architectures.
25
+
26
+ **Multimodal (fusion) systems** combine both modalities, leveraging acoustic cues alongside semantic understanding.
27
+
28
+ ## Key Challenges
29
+
30
+ - **Hesitation vs. completion**: Distinguishing filler words ("um," "you know") from true turn endings
31
+ - **Natural pauses**: Differentiating conversational pauses from actual turn boundaries
32
+ - **Response speed**: Minimizing latency while maintaining accuracy
33
+ - **Speaking diversity**: Accounting for varying rhythms, accents, and intonation patterns
34
+
35
+ ## Model Architecture
36
+
37
+ The Krisp model processes 100ms audio frames, outputting confidence scores (0-1) indicating shift likelihood. A configurable threshold determines binary shift predictions, with a default 5-second maximum hold duration.
38
+
39
+ ## Performance Comparison
40
+
41
+ | Attribute | Krisp TT | SmartTurn v1 | SmartTurn v2 | VAD-based |
42
+ |-----------|----------|-------------|-------------|-----------|
43
+ | Parameters | 6.1M | 581M | 95M | 260k |
44
+ | Model Size | 65 MB | 2.3 GB | 360 MB | 2.3 MB |
45
+ | Execution | CPU | GPU | GPU | CPU |
46
+
47
+ ## Evaluation Metrics
48
+
49
+ ### Accuracy Results
50
+
51
+ | Model | Balanced Accuracy | AUC Shift | F1 Score Shift | F1 Score Hold | AUC (MST vs FPR) |
52
+ |-------|------------------|-----------|----------------|---------------|-----------------|
53
+ | Krisp TT | 0.82 | 0.89 | 0.80 | 0.83 | 0.21 |
54
+ | VAD-based | 0.59 | -- | 0.48 | 0.70 | -- |
55
+ | SmartTurn V1 | 0.78 | 0.86 | 0.73 | 0.84 | 0.39 |
56
+ | SmartTurn V2 | 0.78 | 0.83 | 0.76 | 0.78 | 0.44 |
57
+
58
+ ### Training Data
59
+
60
+ - Approximately 2,000 hours of conversational speech
61
+ - Around 700,000 speaker turns
62
+ - Test dataset: 1,875 manually labeled audio samples
63
+
64
+ ## Key Performance Findings
65
+
66
+ The Krisp model achieves "considerably faster average response time (0.9 vs. 1.3 seconds at a 0.06 FPR) compared to SmartTurn" while being 5-10 times smaller and optimized for CPU execution.
67
+
68
+ ## Future Development
69
+
70
+ **Planned enhancements include:**
71
+
72
+ - Text-based turn-taking using custom neural networks
73
+ - Multimodal audio-text fusion for improved accuracy
74
+ - Backchannel detection to distinguish meaningful interruptions from casual listening acknowledgments
75
+
76
+ ---
77
+
78
+ *Article published August 5, 2025 by Krisp Engineering Team*
03-finetune-pipecat-pt/references/blogs/krisp_turn_taking_v2.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Krisp Turn-Taking v2 - Voice AI VIVA SDK"
3
+ source: https://krisp.ai/blog/krisp-turn-taking-v2-voice-ai-viva-sdk/
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # Audio-Only Turn-Taking Model v2 - Krisp
8
+
9
+ ## Overview
10
+
11
+ Krisp has released Turn-Taking v2, an updated model for detecting end-of-turns in real-time conversational AI systems. The model processes audio input only, making it suitable for "human-bot interactions" and other voice AI applications integrated through Krisp's VIVA SDK.
12
+
13
+ ## Technical Architecture
14
+
15
+ The latest iteration, **krisp-viva-tt-v2**, represents a substantial advancement over its predecessor. According to the engineering team, it was "trained on a more diverse and better-structured dataset, with richer data augmentations that help the model perform more reliably in real-world conditions."
16
+
17
+ ## Key Improvements
18
+
19
+ - Enhanced robustness in noisy environments
20
+ - Superior accuracy when combined with Krisp's Voice Isolation models
21
+ - Faster and more stable turn detection during live conversations
22
+
23
+ ## Performance Benchmarks
24
+
25
+ ### Clean Audio Testing
26
+
27
+ Testing evaluated approximately 1,800 real conversation samples (~1,000 "hold" cases and ~800 "shift" cases) with mild background noise:
28
+
29
+ | Model | Balanced Accuracy | AUC | F1 Score |
30
+ |-------|-------------------|-----|----------|
31
+ | krisp-viva-tt-v1 | 0.82 | 0.89 | 0.804 |
32
+ | **krisp-viva-tt-v2** | **0.823** | **0.904** | **0.813** |
33
+
34
+ ### Noisy Audio Testing (5-15 dB noise levels)
35
+
36
+ | Model | Balanced Accuracy | AUC | F1 Score |
37
+ |-------|-------------------|-----|----------|
38
+ | krisp-viva-tt-v1 | 0.723 | 0.799 | 0.71 |
39
+ | **krisp-viva-tt-v2** | **0.768** | **0.842** | **0.757** |
40
+
41
+ V2 demonstrated "up to a 6% improvement in F1 score under noisy conditions."
42
+
43
+ ### Post-Processing with Voice Isolation
44
+
45
+ After applying background noise and voice removal through krisp-viva-tel-v2:
46
+
47
+ | Model | Balanced Accuracy | AUC | F1 Score |
48
+ |-------|-------------------|-----|----------|
49
+ | krisp-viva-tt-v1 | 0.787 | 0.854 | 0.775 |
50
+ | **krisp-viva-tt-v2** | **0.816** | **0.885** | **0.808** |
51
+
52
+ ## Availability
53
+
54
+ The model is now available as part of Krisp's VIVA SDK, designed for developers building Voice AI agents and conversational systems.
03-finetune-pipecat-pt/references/blogs/livekit_eot_v0_4.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Improved End-of-Turn Model Cuts Voice AI Interruptions 39%"
3
+ source: https://livekit.com/blog/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # Improved End-of-Turn Model Cuts Voice AI Interruptions 39%
8
+
9
+ ## Overview
10
+
11
+ LiveKit released an updated transformer-based end-of-turn detection model, `v0.4.1-intl`, featuring "a 39.23% relative reduction in false-positive interruptions" compared to the previous version. The model now emphasizes structured data handling and multilingual performance across supported languages.
12
+
13
+ ## Key Improvements
14
+
15
+ ### Performance Metrics
16
+
17
+ The benchmark demonstrates consistent gains across all tested languages:
18
+
19
+ | Language | v0.3.0 Error Rate | v0.4.1 Error Rate | Relative Improvement |
20
+ |----------|------------------|------------------|----------------------|
21
+ | Chinese | 18.70% | 13.40% | 28.34% |
22
+ | Dutch | 26.10% | 11.90% | 54.33% |
23
+ | English | 16.60% | 13.00% | 21.69% |
24
+ | French | 16.80% | 11.10% | 33.93% |
25
+ | German | 23.40% | 12.20% | 47.86% |
26
+ | Hindi | 5.40% | 3.70% | 31.48% |
27
+ | Indonesian | 17.00% | 10.60% | 37.65% |
28
+ | Italian | 20.10% | 14.90% | 25.87% |
29
+ | Japanese | 19.70% | 11.20% | 43.15% |
30
+ | Korean | 7.90% | 5.50% | 30.38% |
31
+ | Portuguese | 23.30% | 12.60% | 45.97% |
32
+ | Russian | 19.50% | 12.00% | 38.46% |
33
+ | Spanish | 21.50% | 14.00% | 33.88% |
34
+ | Turkish | 25.40% | 12.70% | 50.0% |
35
+ | **All** | **18.66%** | **11.34%** | **39.23%** |
36
+
37
+ Error rate represents the false positive rate at a fixed true positive rate of 99.3%.
38
+
39
+ ## Technical Architecture
40
+
41
+ ### The Challenge of End-of-Turn Detection
42
+
43
+ Voice AI systems must detect speech completion using three primary cues:
44
+
45
+ - **Semantic content**: Word meaning and language understanding
46
+ - **Context**: Dialogue history and conversational flow
47
+ - **Prosody**: Tone, pauses, and rhythmic patterns
48
+
49
+ The model uses an LLM backbone to effectively combine content and context information.
50
+
51
+ ### Structured Data Handling
52
+
53
+ A primary enhancement addresses structured information collection. When users provide phone numbers, emails, or addresses, natural speech markers (intonation changes, grammatical endings) are absent. The updated model "addresses this by inferring expected formats from the agent's prompt," enabling:
54
+
55
+ - Detection of complete phone numbers (typically 10 digits for US numbers)
56
+ - Email pattern recognition
57
+ - Address validation including street, city, state, and zip code
58
+
59
+ The visualization tool demonstrates the improvement: `v0.3.0-intl` incorrectly detected end points after individual digits, while `v0.4.1-intl` waited for the complete sequence.
60
+
61
+ ### Handling Speech-to-Text Variability
62
+
63
+ Training data incorporated multiple STT output formats. One engine might transcribe "forty two" as words while another outputs "42" numerically. By training across "common STT output formats," the model maintains consistent performance across different provider implementations.
64
+
65
+ ### Multilingual Generalization
66
+
67
+ Despite structured data enhancements targeting English training data, improvements transferred across languages -- particularly for phone number and address detection in Spanish, French, and other languages. This stems from the Qwen2.5 base model's multilingual pretraining, which encodes "knowledge of global formats."
68
+
69
+ ## Model Architecture & Optimization
70
+
71
+ ### Base Model Selection
72
+
73
+ The team selected **Qwen2.5-0.5B-Instruct** for its balance of performance and low-latency CPU inference capabilities.
74
+
75
+ ### Knowledge Distillation
76
+
77
+ To improve multilingual accuracy without sacrificing efficiency:
78
+
79
+ 1. A larger **Qwen2.5-7B-Instruct** teacher model was trained
80
+ 2. Knowledge was distilled into the smaller 0.5B student model
81
+ 3. The distilled model achieves "higher multilingual accuracy with the efficiency of the smaller size"
82
+
83
+ Training curves show the distilled model outperforming the baseline 0.5B version and approaching teacher performance after approximately 1,500 steps.
84
+
85
+ ## Availability & Deprecation
86
+
87
+ - **Deployment**: Available in Agents Python 1.3.0 and Agents JS 1.0.19
88
+ - **Model**: `MultilingualModel` now recommended for all use cases
89
+ - **Deprecation**: The legacy `EnglishModel` is being deprecated as the multilingual version "not only matches but in most cases exceeds the performance"
90
+
91
+ ## Observability Integration
92
+
93
+ LiveKit Agents now include built-in observability for end-of-turn detection. When Agent Observability is enabled, "every turn detection decision is logged with the exact input the model saw." Production debugging accesses `eou_detection` traces showing full prediction context.
94
+
95
+ ## Future Directions
96
+
97
+ Current implementation relies on transcribed text. Future iterations will integrate raw audio features like pauses and emphasis through multimodal architectures, "fusing prosody directly into predictions for more precise detection."
98
+
99
+ ---
100
+
101
+ **Resources**:
102
+ - [GitHub Repository](https://github.com/livekit/agents)
103
+ - [Visualization Tool](https://huggingface.co/spaces/livekit/eot-visualization)
104
+ - [Voice AI Quickstart](https://docs.livekit.io/agents/start/voice-ai/)
03-finetune-pipecat-pt/references/blogs/livekit_transformer_turn_detection.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Using a Transformer to Improve End of Turn Detection"
3
+ source: https://livekit.com/blog/using-a-transformer-to-improve-end-of-turn-detection
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # Using a Transformer to Improve End of Turn Detection
8
+
9
+ **Published:** December 20, 2024 | **Author:** Russ D'Sa | **Reading Time:** 6 minutes
10
+
11
+ ## Overview
12
+
13
+ End-of-turn detection represents one of the most challenging problems in voice AI applications. The core task involves determining when a user has finished speaking, allowing an AI model to respond without unintentionally interrupting the user.
14
+
15
+ ## Current Approach: Phrase Endpointing
16
+
17
+ The predominant technique for turn detection is phrase endpointing, typically implemented through voice activity detection (VAD). This method operates as follows:
18
+
19
+ - Audio packets are processed through a neural network to determine if human speech is present
20
+ - If speech is detected, the user continues speaking
21
+ - If silence is detected, a timer begins tracking the duration of the absence of detectable human speech
22
+ - Once a configured silence threshold passes, an end-of-turn event triggers, allowing LLM inference to begin
23
+
24
+ LiveKit Agents uses "Silero VAD to detect voice activity and provide timing parameters" to adjust sensitivity. The framework introduces a configurable delay via the `min_endpointing_delay` parameter (default: 500ms) between when VAD detects speech cessation and when LLM inference begins.
25
+
26
+ ## The VAD Limitation
27
+
28
+ VAD only detects *when* someone is speaking based on audio signal presence. Humans, however, use semantic understanding -- analyzing what is said and how it's expressed -- to determine turn-taking. For example, "I understand your point, but..." would trigger VAD's end-of-turn signal despite the human listener recognizing the speaker intends to continue.
29
+
30
+ ## The EOU (End of Utterance) Model
31
+
32
+ LiveKit released "an open source transformer model that uses the content of speech to predict when a user has" finished speaking. The model incorporates semantic understanding into turn detection.
33
+
34
+ ### Model Architecture
35
+
36
+ - **Base Model:** 135M parameter transformer based on SmolLM v2 from HuggingFace
37
+ - **Design Choice:** Small model size enables CPU-based, real-time inference
38
+ - **Context Window:** Sliding window of the last four conversational turns
39
+ - **Language Support:** Currently English transcriptions only; additional languages planned
40
+
41
+ ### How It Works
42
+
43
+ During user speech, transcriptions from a speech-to-text (STT) service are appended word-by-word to the model's context window. For each final STT transcription, the model generates a confidence-level prediction regarding whether the current context represents turn completion.
44
+
45
+ The model integrates with VAD by dynamically adjusting the silence timeout: longer silence periods are permitted when EOU indicates the user hasn't finished speaking, reducing interruptions while maintaining responsiveness.
46
+
47
+ ## Performance Results
48
+
49
+ Compared to VAD alone, the EOU + VAD approach achieves:
50
+
51
+ - **85% reduction** in unintentional interruptions
52
+ - **3% false negative rate** (incorrectly indicating turn continuation)
53
+
54
+ ### Use Cases
55
+
56
+ The model proves particularly valuable for conversational AI and customer support applications requiring data collection:
57
+
58
+ - Conducting interviews
59
+ - Collecting addresses for shipments
60
+ - Gathering phone numbers
61
+ - Processing payment information
62
+ - Ordering transactions (pizza ordering demonstrated)
63
+
64
+ ## Implementation
65
+
66
+ The turn detector is packaged as a LiveKit Agents plugin, enabling simple integration through a single additional parameter in the `VoicePipelineAgent` constructor:
67
+
68
+ ```
69
+ turn_detector=TurnDetector.from_plugin()
70
+ ```
71
+
72
+ Example implementation code is available in the LiveKit agents repository.
73
+
74
+ ## Future Development
75
+
76
+ Planned improvements include:
77
+
78
+ - **Multi-language Support:** Extending EOU beyond English
79
+ - **Inference Optimization:** Reducing current ~50ms inference latency
80
+ - **Context Window Expansion:** Increasing the four-turn window
81
+ - **Audio-Based Detection:** Developing models for multimodal systems that process audio directly, accounting for prosodic features like intonation and cadence
82
+
83
+ The team acknowledges that EOU's text-based training limits its use with natively multimodal models such as OpenAI's Realtime API. An audio-native model is under development to address this constraint.
84
+
85
+ ## Broader Implications
86
+
87
+ The researchers emphasize that "even humans don't get this right all the time" and that non-verbal cues improve human turn-taking performance. They consider this research essential for developing natural, humanlike AI interactions and anticipate significant community innovation in this area.
03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Announcing Smart Turn v3: CPU Inference in Just 12ms"
3
+ source: https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # Smart Turn v3: CPU Inference Breakthrough
8
+
9
+ ## Overview
10
+
11
+ Daily announced Smart Turn v3, a dramatically improved voice turn detection model featuring unprecedented efficiency. The system achieves "CPU inference in just 12ms" while maintaining open-source accessibility for weights, training data, and scripts.
12
+
13
+ ## Key Improvements
14
+
15
+ ### Size and Performance
16
+ - **Model size:** Reduced to 8 MB (nearly 50x smaller than v2)
17
+ - **CPU inference speed:** 12ms on modern processors; 60ms on budget AWS instances
18
+ - **No GPU required:** Runs directly within Pipecat Cloud instances
19
+
20
+ ### Language Coverage
21
+ Expanded to 23 languages including Arabic, Bengali, Chinese, Danish, Dutch, German, English, Finnish, French, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, and Vietnamese.
22
+
23
+ ### Architecture
24
+ - **Foundation:** Whisper Tiny encoder (39M parameters)
25
+ - **Classification layers:** Adapted from Smart Turn v2
26
+ - **Total parameters:** 8M
27
+ - **Optimization:** int8 quantization via static QAT
28
+ - **Export format:** ONNX
29
+
30
+ ## Competitive Comparison
31
+
32
+ | Metric | Smart Turn v3 | Krisp | Ultravox |
33
+ |--------|---------------|-------|----------|
34
+ | Size | 8 MB | 65 MB | 1.37 GB |
35
+ | Languages | 23 | English only | 26 |
36
+ | Availability | Open weights/data | Proprietary | Open weights |
37
+ | Focus | Single-inference latency | Decision confidence | Conversation context |
38
+
39
+ ## Performance Benchmarks
40
+
41
+ CPU inference results (including preprocessing):
42
+
43
+ | Platform | Speed |
44
+ |----------|-------|
45
+ | AWS c7a.2xlarge | 12.6 ms |
46
+ | AWS c8g.2xlarge | 15.2 ms |
47
+ | Modal (6 cores) | 17.7 ms |
48
+ | AWS t3.2xlarge | 33.8 ms |
49
+ | AWS c8g.medium | 59.8 ms |
50
+ | AWS t3.medium | 94.8 ms |
51
+
52
+ GPU performance shows 3.3-6.6ms latency across various NVIDIA processors.
53
+
54
+ ## Accuracy Results
55
+
56
+ Highest performers: Turkish (97.10%), Korean (96.85%), Japanese (96.76%)
57
+
58
+ Lower performers: Vietnamese (81.27%), Bengali (84.10%), Marathi (87.60%)
59
+
60
+ English achieved 94.31% accuracy across 2,846 test samples.
61
+
62
+ ## Implementation
63
+
64
+ ### With Pipecat
65
+ `LocalSmartTurnAnalyzerV3` integration available (v0.0.85+). Users download ONNX model from HuggingFace repository.
66
+
67
+ ### Standalone
68
+ Direct ONNX runtime usage via provided inference scripts. Requires accompanying VAD model (Silero recommended) for optimal results.
69
+
70
+ ## Resources
71
+
72
+ - Model weights on HuggingFace
73
+ - GitHub repository containing training code and inference examples
74
+ - Open test datasets available for benchmark reproduction
75
+ - Community dataset annotation available at smart-turn-dataset.pipecat.ai
03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3_1.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Improved Accuracy in Smart Turn v3.1"
3
+ source: https://www.daily.co/blog/improved-accuracy-in-smart-turn-v3-1/
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # Improved Accuracy in Smart Turn v3.1
8
+
9
+ Smart Turn v3.1 represents a significant advancement in conversation turn detection, leveraging expanded training datasets to enhance model performance across languages.
10
+
11
+ ## Overview
12
+
13
+ Smart Turn v3.1 maintains the same architecture as its predecessor while offering improved accuracy through additional human audio training data and refined quantization methods. The update functions as a direct replacement for v3.0, requiring no code modifications to existing implementations.
14
+
15
+ ## Training Data Enhancement
16
+
17
+ The model benefits from contributions by three specialized data partners:
18
+
19
+ **Liva AI** provides "real human voice data to improve speech models" across multiple languages and dialects, founded by researchers with publications in IEEE and machine learning conferences.
20
+
21
+ **Midcentury** develops "multimodal-native research" datasets spanning 12+ languages, emphasizing real-world performance scenarios.
22
+
23
+ **MundoAI** constructs "the world's largest and highest quality multimodal datasets" across 16+ languages, prioritizing multilingual diversity.
24
+
25
+ These partners contributed audio samples in English and Spanish, publicly released through HuggingFace as `smart-turn-data-v3.1-train` and `smart-turn-data-v3.1-test` datasets.
26
+
27
+ ## Model Variants
28
+
29
+ Two deployment options accommodate different hardware configurations:
30
+
31
+ - **CPU Model (8MB, int8 quantized)**: Lightweight variant enabling ~12ms CPU inference, matching v3.0 sizing
32
+ - **GPU Model (32MB, unquantized)**: Enhanced variant for GPU deployment with ~1% accuracy improvement
33
+
34
+ ## Accuracy Improvements
35
+
36
+ Testing on the new v3.1 dataset demonstrates substantial gains:
37
+
38
+ | Language | v3.0 | v3.1 (8MB) | v3.1 (32MB) |
39
+ |----------|------|-----------|-----------|
40
+ | English | 88.3% | 94.7% | 95.6% |
41
+ | Spanish | 86.7% | 90.1% | 91.0% |
42
+
43
+ The model supports 23 total languages; the remaining 21 maintain parity with v3.0 performance.
44
+
45
+ ## Performance Benchmarks
46
+
47
+ Single-inference latency across representative hardware:
48
+
49
+ | Device | v3.1 (8MB) | v3.1 (32MB) | Preprocessing |
50
+ |--------|-----------|-----------|---------------|
51
+ | GPU (NVIDIA L40S) | 2 ms | 1 ms | 1 ms |
52
+ | GPU (NVIDIA T4) | 5 ms | 4 ms | 2 ms |
53
+ | CPU (AWS c7a.2xlarge) | 9 ms | 13 ms | 7 ms |
54
+ | CPU (AWS c8g.2xlarge) | 20 ms | 32 ms | 9 ms |
55
+ | CPU (AWS c7a.medium) | 37 ms | 73 ms | 7 ms |
56
+ | CPU (AWS c8g.medium) | 57 ms | 159 ms | 9 ms |
57
+
58
+ Performance optimization is achievable through environment variable configuration:
59
+
60
+ ```
61
+ OMP_NUM_THREADS=1
62
+ OMP_WAIT_POLICY="PASSIVE"
63
+ ```
64
+
65
+ ## Resources
66
+
67
+ - Model weights: https://huggingface.co/pipecat-ai/smart-turn-v3
68
+ - GitHub repository: https://github.com/pipecat-ai/smart-turn
69
+ - Datasets: https://huggingface.co/pipecat-ai
03-finetune-pipecat-pt/references/blogs/pipecat_smart_turn_v3_2.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Smart Turn v3.2: Handling Noisy Environments and Short Responses"
3
+ source: https://www.daily.co/blog/smart-turn-v3-2-handling-noisy-environments-and-short-responses/
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # Smart Turn v3.2: Better Accuracy for AI Voice Agents
8
+
9
+ ## Overview
10
+
11
+ Smart Turn v3.2 represents an update to an open-source turn detection model designed to help AI voice agents identify when users finish speaking. The model listens to raw audio and determines optimal response timing -- preventing interruptions while eliminating unnecessary delays.
12
+
13
+ ## Key Improvements in v3.2
14
+
15
+ ### Short Utterances Enhancement
16
+ The model now handles brief verbal responses significantly better. Single words like "yes" or "okay" are "miscategorized 40% less often according to our public benchmarks." Two changes enabled this improvement:
17
+
18
+ - Introduction of a new dataset focused on short utterances (planned for expansion)
19
+ - Resolution of a padding issue during training that was compromising accuracy
20
+
21
+ ### Background Noise Robustness
22
+ The updated version performs better in real-world environments by incorporating realistic cafe and office noise into training and testing datasets, moving beyond studio-quality audio assumptions.
23
+
24
+ ## Technical Specifications
25
+
26
+ **Model Variants:**
27
+ - CPU version: 8MB
28
+ - GPU version: 32MB
29
+
30
+ Both serve as drop-in replacements for v3.1, maintaining compatibility with existing implementations.
31
+
32
+ ## Available Resources
33
+
34
+ **Open-source components:**
35
+ - Model weights: Available on HuggingFace
36
+ - Training code: Published on GitHub
37
+ - Datasets: Two new datasets released for training and testing purposes
38
+
39
+ ## Integration
40
+
41
+ The model integrates with Pipecat through the `LocalSmartTurnAnalyzerV3` constructor, allowing immediate use via the `smart_turn_model_path` parameter.
03-finetune-pipecat-pt/references/blogs/speechmatics_semantic_turn.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "How to Build Smarter Turn Detection for Voice AI"
3
+ source: https://blog.speechmatics.com/semantic-turn-detection
4
+ date_accessed: 2026-03-16
5
+ ---
6
+
7
+ # How to Build Smarter Turn Detection for Voice AI
8
+
9
+ **Published:** May 12, 2025 | **Author:** Aaron Ng, Machine Learning Engineer | **Read Time:** 13 minutes
10
+
11
+ ## Introduction
12
+
13
+ Voice AI systems face a fundamental challenge: determining when users have finished speaking. Traditional approaches relying on fixed silence periods (like 500ms) create frustrating interruptions. This article explores how semantic understanding can improve turn detection.
14
+
15
+ ## The Problem with Traditional Voice Activity Detection
16
+
17
+ VAD systems detect speech boundaries through audio patterns alone. As the article explains, "VAD only understands audio patterns. It knows _when_ there's speech and when there isn't. What it doesn't know is _why_ there's a pause."
18
+
19
+ The example demonstrates this limitation: when a customer pauses to check information ("Sure it's 123 764... (pauses to check their notes)"), VAD incorrectly interprets the silence as turn completion and triggers an interruption.
20
+
21
+ ## Semantic Turn Detection Solution
22
+
23
+ Rather than relying solely on silence detection, the proposed approach uses instruction-tuned Small Language Models (SLMs) to understand conversational context. These models contain fewer than 10 billion parameters, enabling fast local inference critical for voice interactions.
24
+
25
+ ### Key Benefits
26
+
27
+ - **Reduced Latency**: Local SLM inference avoids the 500ms+ delays of external API calls
28
+ - **Cost Savings**: Fewer false interruptions mean fewer unnecessary LLM API calls for responses that get discarded
29
+ - **Better UX**: Systems recognize contextual pauses versus actual turn completion
30
+
31
+ ## Technical Implementation
32
+
33
+ ### Core Mechanism
34
+
35
+ The approach monitors the probability of the `<|im_end|>` token -- a special ChatML marker indicating turn completion. When this token's probability is high, the model predicts the user has finished speaking.
36
+
37
+ The article illustrates this with examples:
38
+ - "Can I have two chicken McNuggets and" -> Low `<|im_end|>` probability (incomplete thought)
39
+ - "I have a problem with my card" -> Higher `<|im_end|>` probability (complete thought)
40
+
41
+ ### Architecture Components
42
+
43
+ **1. Message Tokenization**
44
+
45
+ Messages are formatted using ChatML structure with special tokens (`<|im_start|>`, `<|im_end|>`, `<|im_sep|>`). The implementation removes the final `<|im_end|>` token since the model predicts its presence.
46
+
47
+ **2. Model Inference**
48
+
49
+ The code uses `AutoModelForCausalLM` to extract logits for the final token position, computing log-softmax probabilities across the vocabulary.
50
+
51
+ **3. Token Probability Extraction**
52
+
53
+ The system identifies the `<|im_end|>` token among top-k predictions (k=20) and converts log probabilities to standard probabilities using exponential transformation.
54
+
55
+ ### Code Example Structure
56
+
57
+ The implementation includes an `EndOfTurnModel` class with:
58
+
59
+ - `_convert_messages_to_chatml()`: Formats conversations in ChatML format
60
+ - `get_next_token_logprobs()`: Performs local inference to retrieve next-token probabilities
61
+ - `process_result()`: Extracts target token probabilities
62
+ - `predict_eot_prob()`: Orchestrates the full pipeline returning probability scores
63
+
64
+ **Model Details**: The implementation uses `SmolLM2-360M-Instruct` from Hugging Face, chosen for efficiency and CPU compatibility.
65
+
66
+ ## Configuration Parameters
67
+
68
+ - **MAX_HISTORY**: 4 messages (recent context window)
69
+ - **DEFAULT_THRESHOLD**: 0.03 (default probability threshold)
70
+ - **Top-k consideration**: 20 tokens
71
+
72
+ ## Practical Considerations
73
+
74
+ ### Token Selection Strategy
75
+
76
+ Beyond `<|im_end|>`, punctuation marks (periods, question marks, exclamation points) can signal thought completion. Hybrid approaches monitoring multiple tokens may improve accuracy but risk introducing noise.
77
+
78
+ ### Hybrid Approach with VAD
79
+
80
+ The article recommends combining semantic detection with VAD:
81
+ - VAD provides high recall detecting speech presence
82
+ - Semantic turn detection improves precision to reduce false interruptions
83
+ - Dynamic grace periods can adjust based on speaking patterns
84
+
85
+ ### Threshold Determination
86
+
87
+ The default 0.03 threshold serves as a starting point. "The optimal threshold depends on your specific SLM and can vary significantly between models." Precision-focused optimization using representative test sets is recommended.
88
+
89
+ ### Multilingual Support
90
+
91
+ Supporting multiple languages requires instruction-tuned SLMs trained on diverse multilingual data, accounting for varied grammar and conversational norms.
92
+
93
+ ### Fine-Tuning Opportunities
94
+
95
+ While general SLMs provide solid conversational understanding, task-specific fine-tuning on domain conversational data can improve accuracy for specialized terminology or industry-specific use cases.
96
+
97
+ ## Future Directions
98
+
99
+ The article acknowledges limitations: "True conversational understanding requires more than just reading text or listening for gaps -- it needs an audio-native model" that processes tone, cadence, hesitation, and emphasis directly from audio signals.
100
+
101
+ ## Resource Reference
102
+
103
+ A complete implementation is available on GitHub at the repository referenced in the article.
03-finetune-pipecat-pt/references/guides/focal_loss_calibration_emnlp_2022.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Calibrating Imbalanced Classifiers with Focal Loss: An Empirical Study"
3
+ source: https://aclanthology.org/2022.emnlp-industry.14/
4
+ date: 2026-03-16
5
+ ---
6
+
7
+ # Calibrating Imbalanced Classifiers with Focal Loss: An Empirical Study
8
+
9
+ **Authors:** Cheng Wang, Jorge Balazs, Gyuri Szarvas, Patrick Ernst, Lahari Poddar, Pavel Danchenko (Amazon)
10
+
11
+ **Venue:** EMNLP 2022 Industry Track, December 9-11, Abu Dhabi, UAE, pages 155-163
12
+
13
+ **DOI:** 10.18653/v1/2022.emnlp-industry.14
14
+
15
+ **PDF:** https://aclanthology.org/2022.emnlp-industry.14.pdf
16
+
17
+ ## Abstract
18
+
19
+ Imbalanced data distribution is a practical and common challenge in building ML models in industry, where data usually exhibits long-tail distributions. For instance, in virtual AI Assistants (Google Assistant, Amazon Alexa, Apple Siri), the "play music" or "set timer" utterance is exposed to an order of magnitude more traffic than other skills. This can easily cause trained models to overfit to the majority classes, leading to model miscalibration. The uncalibrated models output unreliable (mostly overconfident) predictions, which are at high risk of affecting downstream decision-making systems. The authors empirically show the effectiveness of model training with focal loss in learning better calibrated models, as compared to standard cross-entropy loss. Better calibration enables better control of the precision-recall trade-off for trained models.
20
+
21
+ ## 1. Introduction
22
+
23
+ Building ML models in industry faces practical challenges from imbalanced data distributions, particularly long-tail distributions, which make models overfit to majority data classes and lead to miscalibration -- the model-predicted probability fails to estimate the likelihood of true correctness and provides over- or under-confident predictions.
24
+
25
+ The study focuses on **return reason code prediction** in customer service chatbots as a practical application of focal loss for calibration.
26
+
27
+ ### Contributions
28
+
29
+ - Empirically examine the effectiveness of using focal loss in handling model miscalibration in a practical application setting
30
+ - Show that good calibration is important to achieve a desired precision or recall target by tuning classification thresholds (standard cross-entropy loss is incapable of this due to skewed predicted probability distribution)
31
+ - Demonstrate performance of calibrated models through a chatbot serving customers across three conversational bot use-cases
32
+
33
+ ## 2. Focal Loss Formulation
34
+
35
+ Focal loss is defined as:
36
+
37
+ ```
38
+ L_f = - sum_{i=1}^{N} (1 - p_{i,y_i})^gamma * log(p_{i,y_i})
39
+ ```
40
+
41
+ where `p_{i,y_i}` is predicted probability of the i-th sample and `gamma` is a hyper-parameter typically set to `gamma = 1`.
42
+
43
+ ### Theoretical Interpretation
44
+
45
+ Focal loss can be interpreted as a trade-off between minimizing KL divergence and maximizing entropy:
46
+
47
+ ```
48
+ L_f >= KL(q || p) + H(q) - gamma * H(p)
49
+ ```
50
+
51
+ The rationale: we learn a probability `p` to have a high value (confident) due to the KL term, but not too high (overconfident) due to the entropy regularization term.
52
+
53
+ ### Practical Advantages
54
+
55
+ Compared to other calibration methods (temperature scaling, Bayesian methods, label smoothing, kernel-based methods), focal loss:
56
+ - Neither increases computational overhead nor requires architectural modifications
57
+ - Offers **in-training implicit calibration** (unlike temperature scaling which requires post-training calibration)
58
+
59
+ ## 3. Calibration Metrics
60
+
61
+ ### Reliability Diagrams
62
+
63
+ Predictions are grouped into N interval bins; accuracy vs. confidence is computed per bin:
64
+
65
+ ```
66
+ acc(b_n) = (1/I_n) * sum_i 1(y_hat_i = y_i)
67
+ conf(b_n) = (1/I_n) * sum_i p_hat_i
68
+ ```
69
+
70
+ A perfectly calibrated model has `acc(b_n) = conf(b_n)` for all n.
71
+
72
+ ### Expected Calibration Error (ECE)
73
+
74
+ ```
75
+ ECE = (1/I) * sum_{n=1}^{N} I_n * |acc(b_n) - conf(b_n)|
76
+ ```
77
+
78
+ ### Maximum Calibration Error (MCE)
79
+
80
+ ```
81
+ MCE = max_n |acc(b_n) - conf(b_n)|
82
+ ```
83
+
84
+ Particularly important in high-risk applications. For a perfectly calibrated classifier, both ECE and MCE equal 0.
85
+
86
+ ## 4. Datasets and Implementation
87
+
88
+ ### Task
89
+
90
+ Binary and multi-class return reason code prediction in an online retail store:
91
+ - **Binary:** "item is defective or does not work" (Label 0) vs. OTHERS (Label 1)
92
+ - **Multi-class:** 4 specific return reasons + OTHERS (5 total classes)
93
+
94
+ Both datasets exhibit class imbalance with OTHERS as the most frequent class.
95
+
96
+ ### Model Architecture
97
+
98
+ - 2 bidirectional LSTM layers + 2 dense layers
99
+ - Embedding dimension: 1024
100
+ - Hidden layer dimension: 128 (binary), 512 (multi-class)
101
+ - Dropout: 0.1 (embedding), 0.2 (dense)
102
+ - Optimizer: Adam
103
+ - Framework: PyTorch
104
+
105
+ ### Golden Dataset
106
+
107
+ - Binary model: 1,013 human-annotated samples
108
+ - Multi-class model: 1,839 human-annotated samples
109
+ - Data split: 8:1:1 (train/val/test)
110
+
111
+ ## 5. Results
112
+
113
+ ### Binary Reason Code Prediction (Table 1)
114
+
115
+ | Metric | CE | FL1 | FL3 | FL5 | FL10 |
116
+ |--------|------|------|------|------|------|
117
+ | Accuracy | **0.836** | 0.824 | 0.831 | 0.834 | 0.816 |
118
+ | Precision | **0.838** | 0.822 | 0.830 | 0.834 | 0.807 |
119
+ | Recall | **0.823** | 0.814 | 0.821 | 0.823 | 0.805 |
120
+ | F1 | **0.828** | 0.817 | 0.824 | 0.827 | 0.806 |
121
+ | NLL | 2.159 | 1.438 | 0.608 | 0.258 | **0.178** |
122
+ | ECE | 0.168 | 0.166 | 0.139 | 0.080 | **0.078** |
123
+ | MCE | 0.720 | 0.730 | 0.236 | **0.134** | 0.143 |
124
+
125
+ **Key finding:** CE achieves best predictive performance, but FL significantly outperforms on calibration metrics (NLL, ECE, MCE). Higher gamma yields better calibration with modest predictive performance loss.
126
+
127
+ ### Multi-Reason Code Prediction (Table 2)
128
+
129
+ | Metric | CE | FL1 | FL5 |
130
+ |--------|------|------|------|
131
+ | Accuracy | 0.751 | **0.760** | 0.751 |
132
+ | Precision | **0.814** | 0.807 | **0.814** |
133
+ | Recall | **0.757** | 0.755 | **0.757** |
134
+ | F1 | **0.764** | 0.760 | **0.764** |
135
+ | NLL | 0.599 | 0.429 | **0.309** |
136
+ | ECE | 0.037 | **0.023** | 0.037 |
137
+ | MCE | 0.296 | **0.197** | 0.299 |
138
+
139
+ ### Reliability Diagrams
140
+
141
+ - CE model: ECE = 16.78%, MCE = 72.00% (binary)
142
+ - FL5 model: ECE = 7.98%, MCE = 13.40% (binary)
143
+ - FL10 model: ECE = 7.76%, MCE = 14.34% (binary)
144
+
145
+ As gamma increases from CE to FL10, probability distributions shift from "spiking" (overconfident, p close to 0 or 1) to flatter distributions (e.g., p = {0.6, 0.4}).
146
+
147
+ ### Precision-Recall Trade-Off
148
+
149
+ CE model produces polarized probabilities, making it difficult to tune precision based on a given recall or vice versa. FL models learn better-distributed probabilities across [0, 1], enabling effective threshold-based precision/recall tuning.
150
+
151
+ ## 6. Deployment Results (Online A/B Test)
152
+
153
+ Model deployed in three conversational chatbot use-cases with threshold=0.512 for 85% target precision:
154
+
155
+ ### Online Evaluation (Table 3) -- Relative Improvements
156
+
157
+ | Application | Metric | Treatment vs. Control |
158
+ |------------|--------|----------------------|
159
+ | Use-case A | AR | +2.13% |
160
+ | Use-case A | PRR | +3.18% |
161
+ | Use-case A | 24RR | -0.65% |
162
+ | Use-case B | AR | +2.10% |
163
+ | Use-case B | PRR | +0.97% |
164
+ | Use-case B | 24RR | -0.68% |
165
+ | Use-case C | AR | +3.98% |
166
+ | Use-case C | PRR | +12.85% |
167
+ | Use-case C | 24RR | -1.02% |
168
+
169
+ **Metrics:**
170
+ - **AR (Automation Rate):** % contacts resolved without human involvement (higher = better)
171
+ - **PRR (Positive Response Rate):** % positive customer responses to chatbot resolution (higher = better)
172
+ - **24RR (Repeat Rate):** % customers contacting again within 24h for same issue (lower = better)
173
+
174
+ ### Intrinsic Evaluation
175
+
176
+ - Precision on deployed model: 384/485 = 83.8% (aligns with offline 81.4%)
177
+ - Negative predictive value: 194/200 = 97% for OTHERS class
178
+
179
+ ## 7. Key Takeaways
180
+
181
+ 1. Focal loss provides simple, effective in-training calibration via entropy regularization
182
+ 2. Higher gamma values improve calibration without significantly hurting predictive performance
183
+ 3. Well-calibrated models enable practical precision/recall threshold tuning that CE-trained models cannot achieve
184
+ 4. The discrimination-calibration trade-off is modest -- best model balances both (FL5 recommended)
185
+ 5. Better calibration directly translates to improved downstream application metrics
186
+
187
+ ## Citation
188
+
189
+ ```bibtex
190
+ @inproceedings{wang-etal-2022-calibrating,
191
+ title = "Calibrating Imbalanced Classifiers with Focal Loss: An Empirical Study",
192
+ author = {Wang, Cheng and Balazs, Jorge and Szarvas, Gy{\"o}rgy and Ernst, Patrick and Poddar, Lahari and Danchenko, Pavel},
193
+ booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track",
194
+ month = dec,
195
+ year = "2022",
196
+ address = "Abu Dhabi, UAE",
197
+ publisher = "Association for Computational Linguistics",
198
+ pages = "155--163",
199
+ doi = "10.18653/v1/2022.emnlp-industry.14"
200
+ }
201
+ ```
202
+
203
+ ## References (Selected)
204
+
205
+ - Lin et al., 2017 -- Focal loss for dense object detection (original focal loss paper)
206
+ - Mukhoti et al., 2020 -- Calibrating deep neural networks using focal loss (NeurIPS)
207
+ - Guo et al., 2017 -- On calibration of modern neural networks (ICML)
208
+ - Pereyra et al., 2017 -- Regularizing neural networks by penalizing confident output distributions
209
+ - Naeini et al., 2015 -- Obtaining well calibrated probabilities using Bayesian binning (AAAI)
03-finetune-pipecat-pt/references/guides/livekit_turn_detector.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "LiveKit Turn Detector Model Card"
3
+ source: https://huggingface.co/livekit/turn-detector
4
+ date: 2026-03-16
5
+ ---
6
+
7
+ # LiveKit Turn Detector
8
+
9
+ An open-weights language model for contextually-aware end-of-utterance (EOU) detection in voice AI applications. The model predicts whether a user has finished speaking based on the semantic content of their transcribed speech, providing a critical complement to voice activity detection (VAD) systems.
10
+
11
+ > For installation, usage examples, and integration guides, see the [LiveKit documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/).
12
+
13
+ ## Overview
14
+
15
+ Traditional voice agents rely on voice activity detection (VAD) to determine when a user has finished speaking. VAD works by detecting the presence or absence of speech in an audio signal and applying a silence timer. While effective for detecting pauses, VAD lacks language understanding and frequently causes false positives.
16
+
17
+ **Example:** A user who says *"I need to think about that for a moment..."* and then pauses will be interrupted by a VAD-only system, even though they clearly intend to continue.
18
+
19
+ This model adds semantic understanding to the turn detection process by:
20
+ - Analyzing transcribed text of conversations in real time
21
+ - Predicting the probability that the user has completed their turn
22
+ - Reducing unwanted interruptions when integrated with VAD
23
+ - Handling structured data input effectively (addresses, phone numbers, email addresses, credit card numbers)
24
+
25
+ ## Model Variants
26
+
27
+ **Multilingual** (recommended) and **English-only** (deprecated) are distributed as INT8 quantized ONNX models (`model_q8.onnx`) optimized for CPU inference.
28
+
29
+ > **The English-only model (`EnglishModel`) is deprecated.** Use the **multilingual model (`MultilingualModel`)** for all new projects. The multilingual model provides better accuracy across all languages thanks to knowledge distillation from a larger teacher model and expanded training dataset.
30
+
31
+ ## How It Works
32
+
33
+ The model operates on transcribed text from a speech-to-text (STT) system, not raw audio.
34
+
35
+ ### Process Flow
36
+
37
+ 1. **Input**: Recent conversation history (up to **6 turns**, truncated to **128 tokens**) is formatted using the [Qwen chat template](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) with `<|im_start|>` / `<|im_end|>` delimiters. The final user message is left _without_ the closing `<|im_end|>` token.
38
+
39
+ 2. **Prediction**: The model predicts the probability of the `<|im_end|>` token appearing next:
40
+ - **High probability** -> user has likely finished their utterance
41
+ - **Low probability** -> user is likely to continue
42
+
43
+ 3. **Thresholding**: Per-language thresholds (stored in `languages.json`) convert raw probability into a binary decision, tuned to balance responsiveness and accuracy for each supported language.
44
+
45
+ 4. **Integration with VAD**: Works alongside the [Silero VAD](https://docs.livekit.io/agents/logic/turns/vad/) plugin. VAD handles speech presence detection and interruption triggering, while this model provides the semantic signal for when to commit a turn.
46
+
47
+ ### Text Preprocessing
48
+
49
+ **Multilingual variant** applies:
50
+ - NFKC unicode normalization
51
+ - Lowercasing
52
+ - Punctuation removal (preserving apostrophes and hyphens)
53
+ - Whitespace collapsing
54
+
55
+ **English-only variant** passes raw transcribed text without normalization.
56
+
57
+ ## Architecture and Training
58
+
59
+ ### Base Model
60
+
61
+ Both variants are fine-tuned from [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct), selected for strong performance on this task while enabling low-latency CPU inference.
62
+
63
+ **Model Specifications:**
64
+ - **Parameters**: 0.1B
65
+ - **Tensor Type**: F32
66
+ - **Chat Template**: Qwen format
67
+
68
+ ### Knowledge Distillation
69
+
70
+ A **Qwen2.5-7B-Instruct** model was first fine-tuned as a teacher on end-of-turn prediction. Its knowledge was then distilled into the 0.5B student model:
71
+ - Distilled model approaches teacher-level accuracy
72
+ - Maintains efficiency of smaller architecture
73
+ - Converges after approximately 1,500 training steps
74
+
75
+ ### Training Data
76
+
77
+ The training dataset is a mix of:
78
+
79
+ - **Real call center transcripts** covering diverse conversational patterns
80
+ - **Synthetic dialogues** emphasizing structured data input:
81
+ - Addresses
82
+ - Email addresses
83
+ - Phone numbers
84
+ - Credit card numbers
85
+ - **Multi-format STT outputs** to handle provider variation (e.g., "forty two" vs. "42"), ensuring consistent predictions across different STT engines without runtime overhead
86
+
87
+ *Note: Structured data enhancements were added only to the English training set, but performance improvements generalized across languages due to the multilingual knowledge encoded in Qwen2.5.*
88
+
89
+ ### Quantization
90
+
91
+ The trained model is exported to ONNX format and quantized to INT8 (`model_q8.onnx`), enabling efficient CPU-only inference with ONNX Runtime.
92
+
93
+ ## Supported Languages
94
+
95
+ The multilingual model supports **14 languages**. The model relies on the STT provider to report the detected language, which is then used to select the appropriate per-language threshold.
96
+
97
+ **Supported:** English, Spanish, French, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Indonesian, Turkish, Russian, Hindi
98
+
99
+ ## Benchmarks
100
+
101
+ ### Detection Accuracy (Multilingual Variant)
102
+
103
+ - **True positive** -- the model correctly identifies the user has finished speaking
104
+ - **True negative** -- the model correctly identifies the user will continue speaking
105
+
106
+ | Language | True Positive Rate | True Negative Rate |
107
+ |----------|-------------------|-------------------|
108
+ | Hindi | 99.4% | 96.3% |
109
+ | Korean | 99.3% | 94.5% |
110
+ | French | 99.3% | 88.9% |
111
+ | Indonesian | 99.3% | 89.4% |
112
+ | Japanese | 99.3% | 88.8% |
113
+ | Dutch | 99.3% | 88.1% |
114
+ | Russian | 99.3% | 88.0% |
115
+ | German | 99.3% | 87.8% |
116
+ | Portuguese | 99.4% | 87.4% |
117
+ | Turkish | 99.3% | 87.3% |
118
+ | English | 99.3% | 87.0% |
119
+ | Chinese | 99.3% | 86.6% |
120
+ | Spanish | 99.3% | 86.0% |
121
+ | Italian | 99.3% | 85.1% |
122
+
123
+ ### Improvement Over Prior Version
124
+
125
+ The multilingual v0.4.1 release achieved a **39.23% relative improvement** in handling structured inputs (emails, addresses, phone numbers, credit card numbers) compared to the prior version, reducing premature interruptions during data collection scenarios.
126
+
127
+ ## Usage
128
+
129
+ The model is designed for use as a turn detection plugin within the [LiveKit Agents](https://github.com/livekit/agents) framework.
130
+
131
+ For complete installation instructions, code examples (Python and Node.js), and configuration options, see the **[LiveKit turn detector plugin documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/)**.
132
+
133
+ For broader context on how turn detection fits into the voice pipeline -- including VAD configuration, interruption handling, and manual turn control -- see the **[Turns overview](https://docs.livekit.io/agents/logic/turns/)**.
134
+
135
+ ## Deployment Requirements
136
+
137
+ - **Runtime**: CPU-only (no GPU required). Uses [ONNX Runtime](https://onnxruntime.ai/) with the `CPUExecutionProvider`.
138
+ - **RAM**: <500 MB for the multilingual model
139
+ - **Instance type**: Use compute-optimized instances (e.g., AWS c6i, c7i). Avoid burstable instances (e.g., AWS t3, t4g) to prevent inference timeouts from CPU credit exhaustion.
140
+ - **LiveKit Cloud**: The model is deployed globally on LiveKit Cloud. Agents running there automatically use the optimized remote inference service with no local resource requirements.
141
+
142
+ ## Limitations
143
+
144
+ - **Text-only input**: The model operates on STT-transcribed text and cannot incorporate prosodic cues such as pauses, intonation, or emphasis. Future versions may integrate multimodal audio features.
145
+ - **STT dependency**: Prediction quality depends on the accuracy and output format of the upstream STT provider. Mismatches between training and deployment STT formats may degrade performance.
146
+ - **Context window**: Limited to 128 tokens across a maximum of 6 conversation turns.
147
+ - **Language coverage**: Currently supports 14 languages. Performance on unsupported languages is undefined.
148
+ - **Realtime model compatibility**: Cannot be used with audio-native realtime models (e.g., OpenAI Realtime API) without adding a separate STT service, which incurs additional cost and latency.
149
+
150
+ ## License
151
+
152
+ This model is released under the [LiveKit Model License](https://huggingface.co/livekit/turn-detector/blob/main/LICENSE).
153
+
154
+ ## Resources
155
+
156
+ - **[Documentation](https://docs.livekit.io/agents/logic/turns/turn-detector/)**: Full plugin documentation, installation, and integration guide
157
+ - **[Turns Overview](https://docs.livekit.io/agents/logic/turns/)**: How turn detection fits into the LiveKit Agents voice pipeline
158
+ - **[Blog: Improved End-of-Turn Model](https://blog.livekit.io/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/)**: Technical deep dive on multilingual distillation approach and benchmarks
159
+ - **[Blog: Using a Transformer for Turn Detection](https://blog.livekit.io/using-a-transformer-to-improve-end-of-turn-detection/)**: Original blog post introducing the concept and architecture
160
+ - **[Video: LiveKit Turn Detector](https://youtu.be/OZG0oZKctgw)**: Overview video demonstrating the plugin
161
+ - **[GitHub: Plugin Source](https://github.com/livekit/agents/tree/main/livekit-plugins/livekit-plugins-turn-detector)**: Source code for the `livekit-plugins-turn-detector` package
162
+ - **[PyPI](https://pypi.org/project/livekit-plugins-turn-detector/)** | **[npm](https://www.npmjs.com/package/@livekit/agents-plugin-livekit)**: Package registries
163
+
164
+ ---
165
+
166
+ **Model Stats:**
167
+ - Downloads last month: 240,507
168
+ - Model size: 0.1B params
169
+ - Format: Safetensors (INT8 quantized ONNX)
170
+ - Base Model: Qwen/Qwen2.5-0.5B-Instruct
03-finetune-pipecat-pt/references/guides/pipecat_data_generation_guide.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Pipecat Smart Turn - Data Generation Contribution Guide"
3
+ source: https://raw.githubusercontent.com/pipecat-ai/smart-turn/main/docs/data_generation_contribution_guide.md
4
+ date: 2026-03-16
5
+ ---
6
+
7
+ # Contributing Training Data to Smart Turn
8
+
9
+ ## Audio Format Requirements
10
+
11
+ **FLAC is the preferred format** for training data, with lossy formats like MP3 or Opus discouraged. Audio should be **mono audio with a bit depth of 16 bits** and works optimally at 16kHz sample rates, though higher rates are acceptable.
12
+
13
+ ## File Organization
14
+
15
+ Contributors have flexibility in naming and directory structure. The recommended approach uses unique identifiers for files combined with directory labels by language and completeness status (e.g., `eng/incomplete/b3799254-8d6c-11f0-a90e-e7e92780240b.flac`).
16
+
17
+ ## Audio Length and Variation
18
+
19
+ Each file must contain **one speech sample, no longer than 16 seconds.** Variety in length is encouraged, ranging from single words to complex sentences.
20
+
21
+ ## Content Guidelines
22
+
23
+ Samples should represent **a single turn in the conversation** with only one speaker per file. Speech should resemble interactions with voice assistants or customer service representatives. The documentation emphasizes avoiding sentence repetition and background noise while excluding real personally identifiable information.
24
+
25
+ ## Complete vs. Incomplete Classification
26
+
27
+ Samples require binary labeling. "Complete" samples represent finished thoughts suitable for immediate response, while "Incomplete" samples suggest the speaker will continue, ending with filler words, connectives, or suggestive prosody. Critically, incomplete samples **must not be cut off in the middle of a word** but rather end with full words and approximately 200ms of silence.
28
+
29
+ A 50:50 split between categories is recommended for unbiased training.
30
+
31
+ ## Licensing and Submission
32
+
33
+ Contributors must own recordings and secure speaker consent for public release. Datasets are published via HuggingFace. Submissions can occur through various cloud storage methods, with GitHub issues as the contact point.
03-finetune-pipecat-pt/references/guides/pipecat_dataset_v3_2.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Pipecat Smart Turn Data v3.2 - Training Dataset Card"
3
+ source: https://huggingface.co/datasets/pipecat-ai/smart-turn-data-v3.2-train
4
+ date: 2026-03-16
5
+ ---
6
+
7
+ # Smart Turn Data v3.2 Training Dataset
8
+
9
+ The official training dataset for **Smart Turn v3.2**, hosted on Hugging Face Datasets.
10
+
11
+ ## Key Specifications
12
+
13
+ | Attribute | Value |
14
+ |-----------|-------|
15
+ | **Organization** | Pipecat |
16
+ | **Dataset Size** | 100K - 1M rows |
17
+ | **Total Rows** | 270,946 |
18
+ | **Download Size** | 41.4 GB |
19
+ | **Parquet Size** | 41.4 GB |
20
+ | **Format** | Parquet |
21
+ | **Split** | train (271k rows) |
22
+ | **Subset** | default (271k rows) |
23
+
24
+ ## Data Modalities
25
+
26
+ - **Audio** - Audio samples with duration ranging from 0.36s to 32.6s
27
+ - **Text** - Language and metadata fields
28
+
29
+ ## Supported Libraries
30
+
31
+ - Datasets
32
+ - Dask
33
+ - Polars
34
+
35
+ ## Dataset Schema
36
+
37
+ ### Fields
38
+
39
+ | Field | Type | Description |
40
+ |-------|------|-------------|
41
+ | `audio` | AudioObject | Audio samples |
42
+ | `id` | Text | UUID identifier (36 chars) |
43
+ | `language` | Text | 23 language classes |
44
+ | `endpoint_bool` | Boolean | End-of-turn indicator |
45
+ | `midfiller` | Boolean | Mid-utterance filler detection |
46
+ | `endfiller` | Boolean | End-of-utterance filler detection |
47
+ | `synthetic` | Boolean | Synthetic data flag |
48
+ | `dataset` | Text | Source dataset identifier (12 classes) |
49
+ | `spoken_text` | Null | Skipped column |
50
+
51
+ ### Language Coverage
52
+
53
+ English, Chinese (zho), Vietnamese, Finnish, Spanish, Bengali, Hindi, Japanese, Portuguese, Russian, Korean, German, French, Dutch, Arabic, Marathi, Turkish, Icelandic, Polish, Ukrainian, Italian, Indonesian, Norwegian
54
+
55
+ ### Dataset Sources
56
+
57
+ - chirp3_1, chirp3_2
58
+ - liva_1
59
+ - midcentury_1
60
+ - mundo_1
61
+ - rime_2
62
+ - orpheus_endfiller_1
63
+
64
+ ## Metadata (Croissant Format)
65
+
66
+ The dataset uses **Croissant ML Commons** metadata schema (v1.1):
67
+
68
+ ```json
69
+ {
70
+ "@context": "https://schema.org/",
71
+ "conformsTo": "http://mlcommons.org/croissant/1.1",
72
+ "@type": "sc:Dataset",
73
+ "name": "smart-turn-data-v3.2-train"
74
+ }
75
+ ```
76
+
77
+ ## Contributors
78
+
79
+ ### Direct Contributors
80
+ - The Pipecat team
81
+ - [Liva AI](https://www.theliva.ai/)
82
+ - [Midcentury](https://www.midcentury.xyz/)
83
+ - [MundoAI](https://mundoai.world/)
84
+
85
+ ### Background Noise Attribution
86
+ CC-0 licensed background noise samples sourced from Freesound.org contributors including:
87
+ - 4team, tomhannen, craigsmith, mrmayo, martats, and 26+ additional contributors
88
+
89
+ ## Access Methods
90
+
91
+ - **Dataset Viewer:** https://huggingface.co/datasets/pipecat-ai/smart-turn-data-v3.2-train/viewer/
92
+ - **Data Studio:** https://huggingface.co/datasets/pipecat-ai/smart-turn-data-v3.2-train/viewer/default/train
93
+ - **Files Browser:** Git repository with parquet conversion available
03-finetune-pipecat-pt/references/guides/pipecat_smart_turn_readme.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Pipecat Smart Turn v3.2 - README"
3
+ source: https://raw.githubusercontent.com/pipecat-ai/smart-turn/main/README.md
4
+ date: 2026-03-16
5
+ ---
6
+
7
+ # Smart Turn v3.2
8
+
9
+ An open-source, community-driven native audio turn detection model designed to improve upon traditional voice activity detection (VAD) approaches. The model determines when a voice agent should respond to human speech by analyzing linguistic and acoustic cues rather than simply detecting speech presence.
10
+
11
+ ## Key Features
12
+
13
+ **Language Support:** The system supports 23 languages including Arabic, Bengali, Chinese, Danish, Dutch, German, English, Finnish, French, Hindi, Indonesian, Italian, Japanese, Korean, Marathi, Norwegian, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, and Vietnamese.
14
+
15
+ **Performance Characteristics:**
16
+ - Inference completes in as little as 10ms on certain CPUs
17
+ - Most cloud instances see sub-100ms execution times
18
+ - Integrates efficiently with lightweight VAD models like Silero
19
+ - Two versions available: 8MB quantized CPU variant and 32MB unquantized GPU variant
20
+ - GPU version uses fp32 weights for marginally faster inference and ~1% accuracy improvement
21
+ - CPU version uses int8 quantization for reduced size and speed with minimal accuracy trade-off
22
+
23
+ **Technical Approach:** The model operates directly on PCM audio samples rather than text transcriptions, capturing prosodic nuances that inform turn-taking decisions.
24
+
25
+ ## Setup Instructions
26
+
27
+ **Environment Configuration:**
28
+ ```bash
29
+ python3.12 -m venv venv
30
+ source venv/bin/activate
31
+ pip install -r requirements.txt
32
+ ```
33
+
34
+ **Platform-Specific Dependencies:**
35
+
36
+ Ubuntu/Debian systems require:
37
+ ```bash
38
+ sudo apt-get update
39
+ sudo apt-get install portaudio19-dev python3-dev
40
+ ```
41
+
42
+ macOS with Homebrew:
43
+ ```bash
44
+ brew install portaudio
45
+ ```
46
+
47
+ **Running the Demonstration:**
48
+ ```bash
49
+ python record_and_predict.py
50
+ ```
51
+
52
+ Initial startup requires approximately 30 seconds. Test phrases include "I can't seem to, um ..." and "I can't seem to, um, find the return label."
53
+
54
+ ## Audio Input Specifications
55
+
56
+ The model accepts 16kHz mono PCM audio with maximum duration of 8 seconds. The recommended approach involves providing the complete audio of the current user turn. When audio exceeds 8 seconds, truncate from the beginning while maintaining context.
57
+
58
+ For shorter audio, prepend zero-value padding to reach the required length, ensuring actual speech content occupies the end of the input vector.
59
+
60
+ ## Model Architecture
61
+
62
+ Smart Turn v3 employs Whisper Tiny as its foundation with an added linear classification layer. The transformer-based architecture contains approximately 8 million parameters. Development involved experimentation with wav2vec2-BERT, wav2vec2, LSTM implementations, and additional transformer classifier configurations.
63
+
64
+ ## Integration Methods
65
+
66
+ **Pipecat Integration:** The framework supports local inference via `LocalSmartTurnAnalyzerV3` (version 0.0.85 and later). On Pipecat Cloud's standard 1x instance, inference typically completes in around 65ms.
67
+
68
+ **Direct Integration:** Import `model.py` and `inference.py` from the Smart Turn repository and invoke the `predict_endpoint()` function. Reference implementation available in `predict.py`.
69
+
70
+ ## Training Infrastructure
71
+
72
+ Training code resides in `train.py` and downloads datasets from the pipecat-ai HuggingFace repository. Training can execute locally or via Modal using `train_modal.py`. Training sessions log to Weights & Biases unless disabled.
73
+
74
+ **Training command for Modal:**
75
+ ```bash
76
+ modal run --detach train_modal.py
77
+ ```
78
+
79
+ **Current Datasets:**
80
+ - pipecat-ai/smart-turn-data-v3.2-train
81
+ - pipecat-ai/smart-turn-data-v3.2-test
82
+
83
+ ## Community Contributions
84
+
85
+ **Data Classification:** Manual training data categorization assistance needed at https://smart-turn-dataset.pipecat.ai/
86
+
87
+ **Human Data Contribution:** Participants can contribute through turn training games at https://turn-training.pipecat.ai/ or by following the data generation contribution guide.
88
+
89
+ **Future Development Areas:**
90
+ - Additional language support expansion
91
+ - Performance optimization and architecture refinement
92
+ - Expanded human dataset collection
93
+ - Text-conditioned inference for specialized input modes
94
+ - Training platform diversification
95
+
96
+ ## Licensing and Attribution
97
+
98
+ Smart Turn operates under the BSD 2-clause license, permitting unrestricted usage, modification, and contribution.
99
+
100
+ **Project Contributors:**
101
+ - Marcus (marcus-daily)
102
+ - Eli (ebb351)
103
+ - Mark (markbackman)
104
+ - Kwindla (kwindla)
105
+
106
+ **Data Contributors:**
107
+ - Liva AI
108
+ - Midcentury
109
+ - MundoAI
110
+
111
+ ## Resources
112
+
113
+ - [HuggingFace Model Repository](https://huggingface.co/pipecat-ai/smart-turn-v3)
114
+ - [Pipecat Documentation](https://docs.pipecat.ai/server/utilities/smart-turn/smart-turn-overview)
115
+ - [Pipecat Framework](https://pipecat.ai)
03-finetune-pipecat-pt/references/guides/pipecat_train_py.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Pipecat Smart Turn - Training Code (train.py)"
3
+ source: https://raw.githubusercontent.com/pipecat-ai/smart-turn/main/train.py
4
+ date: 2026-03-16
5
+ ---
6
+
7
+ # SmartTurn V3 Speech Endpointing Model — Training Code Reference
8
+
9
+ > This document describes the training pipeline implemented in `train.py` from the
10
+ > [pipecat-ai/smart-turn](https://github.com/pipecat-ai/smart-turn) repository.
11
+ > The original file is Python source code; key architecture and configuration details
12
+ > are extracted below.
13
+
14
+ ## Core Architecture
15
+
16
+ ### SmartTurnV3Model
17
+
18
+ The model extends `WhisperPreTrainedModel` and uses Whisper's encoder as its backbone:
19
+
20
+ - **Input**: Log-mel spectrogram features with shape `(batch_size, 80, 800)`
21
+ - **Encoder**: Modified Whisper encoder with `max_source_positions=400`
22
+ - **Attention Pooling**: Neural network layer that learns weighted attention across time steps
23
+ - **Binary Classifier**: Multi-layer sequential network outputting probability scores
24
+
25
+ ```
26
+ Input Features (batch, 80, 800)
27
+ |
28
+ Whisper Encoder
29
+ |
30
+ Attention Pooling [Linear -> Tanh -> Linear -> Softmax]
31
+ |
32
+ Weighted Pooling (reduces to batch_size, hidden_size)
33
+ |
34
+ Classifier [Linear -> LayerNorm -> GELU -> Dropout -> Linear -> GELU -> Linear]
35
+ |
36
+ Sigmoid Output (probability of completion)
37
+ ```
38
+
39
+ ### Loss Function
40
+
41
+ Uses `BCEWithLogitsLoss` with dynamic positive sample weighting, clamped between 0.1 and 10.0 to handle class imbalance.
42
+
43
+ ## Training Configuration
44
+
45
+ ```
46
+ Base Model: openai/whisper-tiny
47
+ Learning Rate: 5e-5
48
+ Epochs: 4
49
+ Train Batch Size: 384
50
+ Eval Batch Size: 128
51
+ Warmup Ratio: 0.2
52
+ Weight Decay: 0.01
53
+ Eval/Save Steps: 500
54
+ Logging Steps: 100
55
+ LR Scheduler: Cosine
56
+ Dataloader Workers: 6
57
+ ```
58
+
59
+ ## Datasets
60
+
61
+ **Training Sources**:
62
+ - `pipecat-ai/smart-turn-data-v3.2-train` (split 90/10 for train/eval)
63
+
64
+ **Test Sources**:
65
+ - `pipecat-ai/smart-turn-data-v3.2-test`
66
+
67
+ The system supports stratified analysis by language, midfiller presence, and endfiller presence.
68
+
69
+ ## Data Pipeline
70
+
71
+ ### OnDemandSmartTurnDataset
72
+
73
+ On-demand feature extraction pipeline:
74
+ - Audio truncation to last 8 seconds
75
+ - 16kHz sampling rate
76
+ - Padding to max length: `8 * 16000 = 128,000 samples`
77
+ - Normalization enabled
78
+
79
+ ### SmartTurnDataCollator
80
+
81
+ Batches samples while preserving metadata (language, midfiller, endfiller flags) for downstream analysis.
82
+
83
+ ## Export and Quantization
84
+
85
+ ### ONNX FP32 Export
86
+
87
+ The wrapper ensures consistent output shapes across variable batch sizes:
88
+ - Dynamic batch dimension
89
+ - Fixed output shape: `(batch_size, 1)` for compatibility
90
+ - ONNX opset version 18
91
+ - Validation with batch sizes 1 and 2
92
+
93
+ ### INT8 Static Quantization
94
+
95
+ **Configuration**:
96
+ - Calibration dataset size: 1024 samples
97
+ - Quantization format: QDQ (Quantize-Dequantize)
98
+ - Activation type: QUInt8
99
+ - Weight type: QInt8
100
+ - Per-channel quantization enabled
101
+ - Calibration method: Entropy
102
+ - Quantized operations: Conv, MatMul, Gemm
103
+
104
+ Process:
105
+ 1. `quant_pre_process` for optimization and shape inference
106
+ 2. Entropy-based calibration on training data subset
107
+ 3. Static quantization with calibration reader
108
+
109
+ ## Evaluation Metrics
110
+
111
+ Per-sample metrics computed during training and evaluation:
112
+
113
+ - Accuracy
114
+ - Precision (with zero_division warning handling)
115
+ - Recall
116
+ - F1 Score
117
+ - Confusion matrix components (TP, FP, TN, FN)
118
+ - Predicted positive/negative counts
119
+
120
+ ### External Evaluation Callback
121
+
122
+ During training checkpoints, the system evaluates on all test splits and logs:
123
+ - Dataset-specific accuracies
124
+ - Language-stratified metrics
125
+ - Midfiller-stratified metrics
126
+ - Probability distributions (Weights & Biases histograms)
127
+ - Min/max/mean accuracy across categories
128
+ - Sample count distributions
129
+
130
+ ## Key Utility Functions
131
+
132
+ - **`truncate_audio_to_last_n_seconds`**: Crops audio to preserve final n seconds (prevents padding inflation)
133
+ - **`process_predictions`**: Converts logits to probabilities and binary predictions using 0.5 threshold, with NaN validation
134
+ - **`compute_metrics`**: Scikit-learn based metric computation with detailed confusion matrix breakdown
135
+
136
+ ## Workflows
137
+
138
+ ### Training Run (`do_training_run`)
139
+
140
+ 1. Initialize Weights & Biases project "speech-endpointing"
141
+ 2. Load pretrained Whisper-tiny with custom head
142
+ 3. Prepare datasets (load, split, wrap)
143
+ 4. Train with specified hyperparameters
144
+ 5. Save final model and feature extractor
145
+ 6. Export to ONNX FP32
146
+ 7. Return export path
147
+
148
+ ### Quantization Run (`do_quantization_run`)
149
+
150
+ 1. Load FP32 ONNX model
151
+ 2. Create calibration dataset from training split
152
+ 3. Apply pre-processing (optimization)
153
+ 4. Execute static INT8 quantization
154
+ 5. Return quantized model path
155
+
156
+ ### Benchmark Run (`do_benchmark_run`)
157
+
158
+ 1. Load feature extractor
159
+ 2. Prepare test merged dataset
160
+ 3. For each model path:
161
+ - Create benchmark output directory
162
+ - Run benchmark with batch size 256
163
+ - Generate markdown report
164
+
165
+ ## Dependencies
166
+
167
+ - PyTorch with ONNX export capabilities
168
+ - Transformers (HuggingFace) for Whisper and training
169
+ - ONNX Runtime with quantization support
170
+ - Scikit-learn for metrics
171
+ - Weights & Biases for experiment tracking
172
+ - torchcodec (runtime requirement for audio decoding)
173
+
174
+ ## Error Handling
175
+
176
+ - Fast-fail check for missing `torchcodec` dependency
177
+ - ONNX model validation after export
178
+ - Non-finite value detection in predictions
179
+ - Exception logging for failed exports
03-finetune-pipecat-pt/references/guides/vogent_turn_80m.md ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Vogent-Turn-80M Model Card"
3
+ source: https://huggingface.co/vogent/Vogent-Turn-80M
4
+ date: 2026-03-16
5
+ ---
6
+
7
+ # Vogent-Turn-80M
8
+
9
+ A state-of-the-art multimodal turn detection model for voice AI systems, achieving **94.1% accuracy** by combining acoustic and linguistic signals for real-time conversational applications.
10
+
11
+ ## Overview
12
+
13
+ | Attribute | Value |
14
+ |-----------|-------|
15
+ | **Developed by** | Vogent AI |
16
+ | **Model type** | Multimodal Turn Detection (Binary Classification) |
17
+ | **Language(s)** | English |
18
+ | **License** | Modified Apache-2.0 (horizontal voice agent platforms cannot set as default) |
19
+ | **Base model** | SmolLM2-135M (reduced to 80M parameters using first 12 layers) |
20
+ | **Model size** | 79.2M parameters (F32) |
21
+ | **Inference speed** | ~7ms on T4 GPU |
22
+
23
+ ## Model Description
24
+
25
+ Vogent-Turn-80M determines when a speaker has finished their turn in a conversation by processing both:
26
+ - **Acoustic features** (via Whisper encoder)
27
+ - **Semantic context** (via language model)
28
+
29
+ This multimodal approach enables real-time, accurate predictions where audio-only or text-only methods fail.
30
+
31
+ ### Resources
32
+
33
+ - **GitHub Repository:** https://github.com/vogent/vogent-turn
34
+ - **Technical Report:** https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents
35
+ - **Demo Space:** https://huggingface.co/spaces/vogent/vogent-turn-demo
36
+
37
+ ## Quick Install
38
+
39
+ ```bash
40
+ git clone https://github.com/vogent/vogent-turn.git
41
+ cd vogent-turn
42
+ pip install -e .
43
+ ```
44
+
45
+ ## Basic Usage
46
+
47
+ ```python
48
+ from vogent_turn import TurnDetector
49
+ import soundfile as sf
50
+ import urllib.request
51
+
52
+ # Initialize detector
53
+ detector = TurnDetector(compile_model=True, warmup=True)
54
+
55
+ # Download and load audio
56
+ audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
57
+ urllib.request.urlretrieve(audio_url, "sample.wav")
58
+ audio, sr = sf.read("sample.wav")
59
+
60
+ # Run turn detection with conversational context
61
+ result = detector.predict(
62
+ audio,
63
+ prev_line="What is your phone number",
64
+ curr_line="My number is 804",
65
+ sample_rate=sr,
66
+ return_probs=True,
67
+ )
68
+
69
+ print(f"Turn complete: {result['is_endpoint']}")
70
+ print(f"Done speaking probability: {result['prob_endpoint']:.1%}")
71
+ ```
72
+
73
+ ## Available Interfaces
74
+
75
+ - **Python Library:** Direct integration with `TurnDetector` class
76
+ - **CLI Tool:** `vogent-turn-predict speech.wav --prev "What is your phone number" --curr "My number is 804"`
77
+
78
+ ## Technical Architecture
79
+
80
+ ### Model Architecture
81
+
82
+ **Components:**
83
+ 1. **Audio Encoder:** Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
84
+ 2. **Text Model:** SmolLM-135M (12 layers, ~80M parameters)
85
+ 3. **Multimodal Fusion:** Audio embeddings projected into LLM's input space
86
+ 4. **Classifier:** Binary classification head (turn complete/incomplete)
87
+
88
+ ### Processing Flow
89
+
90
+ ```
91
+ 1. Audio (16kHz PCM) -> Whisper Encoder -> Audio Embeddings (~400 tokens)
92
+ 2. Text Context -> SmolLM Tokenizer -> Text Embeddings
93
+ 3. Concatenate embeddings -> SmolLM Transformer -> Last token hidden state
94
+ 4. Linear Classifier -> Softmax -> [P(continue), P(endpoint)]
95
+ ```
96
+
97
+ ## Training Details
98
+
99
+ ### Preprocessing
100
+
101
+ - **Audio:** Last 8 seconds extracted via Whisper-Tiny encoder -> ~400 audio tokens
102
+ - **Text:** Full conversational context (assistant and user utterances)
103
+ - **Labels:** Binary classification (turn complete/incomplete)
104
+ - **Fusion:** Audio embeddings projected into LLM's input space and concatenated with text
105
+
106
+ ### Training Hyperparameters
107
+
108
+ - **Training regime:** fp16 mixed precision
109
+ - **Base model:** SmolLM2-135M (first 12 layers)
110
+ - **Architecture:** Reduced from 135M to ~80M parameters via layer ablation
111
+
112
+ ### Training Data
113
+
114
+ Diverse dataset combining:
115
+ - Human-collected conversational data
116
+ - Synthetic conversational data
117
+
118
+ ## Evaluation Results
119
+
120
+ ### Performance Metrics
121
+
122
+ - **Accuracy:** 94.1%
123
+ - **AUPRC:** 0.975
124
+
125
+ ### Testing Data
126
+
127
+ Internal test set covering diverse conversational scenarios and edge cases where audio-only or text-only approaches fail.
128
+
129
+ ## Compute Infrastructure
130
+
131
+ ### Hardware Optimization
132
+
133
+ - `torch.compile` with max-autotune mode
134
+ - Dynamic tensor shapes without recompilation
135
+ - Pre-warmed bucket sizes (64, 128, 256, 512, 1024)
136
+
137
+ ### Software
138
+
139
+ - **Framework:** PyTorch with torch.compile
140
+ - **Audio processing:** Whisper encoder (up to 8 seconds)
141
+
142
+ ## Limitations
143
+
144
+ - **English-only** support; turn-taking conventions vary across languages and cultures
145
+ - **CPU inference** may be too slow for some real-time applications
146
+
147
+ ## Citation
148
+
149
+ ```bibtex
150
+ @misc{voturn2025,
151
+ title={Vogent-Turn-80M: State-of-the-Art Turn Detection for Voice Agents},
152
+ author={Varadarajan, Vignesh and Vytheeswaran, Jagath},
153
+ year={2025},
154
+ publisher={Vogent AI},
155
+ howpublished={\url{https://huggingface.co/vogent/Vogent-Turn-80M}},
156
+ note={Blog: \url{https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents}}
157
+ }
158
+ ```
159
+
160
+ ## Additional Resources
161
+
162
+ - **Full Documentation & Code:** https://github.com/vogent/vogent-turn
163
+ - **Platform:** https://vogent.ai
164
+ - **Enterprise Contact:** j@vogent.ai
165
+ - **Issues:** https://github.com/vogent/vogent-turn/issues
166
+
167
+ ## Upcoming Releases
168
+
169
+ - Int8 quantized model for faster CPU deployment
170
+ - Multilingual versions
171
+ - Domain-specific adaptations
03-finetune-pipecat-pt/references/hesitation_turn_taking_l2_review.md ADDED
@@ -0,0 +1,453 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pausas de Hesitacao, Deteccao de Fim de Turno e Fala L2: Uma Revisao para Fine-Tuning em Portugues
2
+
3
+ **Contexto**: BabelCast — traducao simultanea em reunioes. O modelo de deteccao de fim de turno precisa distinguir pausas de hesitacao (falante ainda vai continuar) de fim de turno real (pode comecar a traduzir). O desafio e especialmente critico para falantes de frances aprendendo portugues, que produzem pausas longas de hesitacao frequentemente confundidas com fim de turno.
4
+
5
+ **Autores**: Marcos Remar, com assistencia de Claude (Anthropic)
6
+ **Data**: Março 2026
7
+
8
+ ---
9
+
10
+ ## 1. O Problema: Silencio Nao Significa Fim de Turno
11
+
12
+ Sistemas comerciais de IA conversacional usam thresholds de silencio entre 700ms e 1000ms para detectar fim de turno (Castillo-Lopez, de Chalendar & Semmar, 2025). Esse approach e fundamentalmente inadequado por dois motivos:
13
+
14
+ 1. **Humanos sao mais rapidos**: o gap medio entre turnos em conversacao natural e de apenas ~200ms (Levinson & Torreira, 2015, citado em Skantze, 2021). Esperar 700ms+ resulta em respostas percebidas como lentas.
15
+
16
+ 2. **Pausas dentro de turnos sao frequentemente mais longas que gaps entre turnos** (Skantze, 2021). Um falante pode pausar 2-3 segundos no meio de uma frase (buscando uma palavra ou planejando o proximo trecho) e continuar normalmente. Tratar essa pausa como fim de turno causa interrupcao.
17
+
18
+ O problema se agrava dramaticamente quando o falante esta usando uma segunda lingua (L2), como demonstrado na literatura a seguir.
19
+
20
+ ---
21
+
22
+ ## 2. Pausas de Hesitacao em Falantes L2
23
+
24
+ ### 2.1 Duracao: L2 pausa 39% mais que nativos
25
+
26
+ Kosmala & Crible (2022) analisaram pausas preenchidas (filled pauses, FPs) em falantes nativos e nao-nativos de frances, usando o corpus SITAF:
27
+
28
+ | Metrica | Nativos | Nao-nativos | Diferenca |
29
+ |---------|---------|-------------|-----------|
30
+ | Duracao media FP | 378ms (SD=200) | **524ms** (SD=222) | +146ms (+39%) |
31
+ | Taxa FP/100 palavras | 4.4 | 5.3 | +0.9 (n.s.) |
32
+ | Clustering com outros fluencemas | 72% | **82%** | +10 p.p. |
33
+
34
+ A diferenca de **frequencia** nao e significativa — nao-nativos nao produzem mais pausas, mas pausas **significativamente mais longas** (p < .001). A duracao e um indicador mais confiavel de proficiencia que a frequencia.
35
+
36
+ Em situacoes de reparo (self-repair), as pausas sao ainda mais extremas. Kosmala (2025) analisou 167 reparos de alunos franceses falando ingles e encontrou:
37
+
38
+ - Pausa silenciosa media na fase de edicao: **844ms** (SD=573ms)
39
+ - Pausa preenchida media na fase de edicao: **522ms** (SD=571ms)
40
+ - 82% dos auto-reparos contem uma fase de edicao (pausa e/ou filler)
41
+
42
+ Esses 844ms estao **muito acima** de qualquer threshold de silencio usado em sistemas comerciais (700ms).
43
+
44
+ ### 2.2 Tipos de pausa: silenciosas dominam
45
+
46
+ Cenoz (2000) estudou 15 falantes nativos de espanhol falando ingles como L2, analisando 1.085 pausas nao-juntivas:
47
+
48
+ | Tipo | Proporcao | Faixa de duracao |
49
+ |------|-----------|------------------|
50
+ | Silenciosas | **64%** | 205ms a 11.569ms |
51
+ | Preenchidas | 36% | — |
52
+
53
+ Distribuicao por duracao das pausas silenciosas:
54
+ - 70% entre 200ms e 1.000ms
55
+ - 21% entre 1.001ms e 2.000ms
56
+ - 7% entre 2.001ms e 4.000ms
57
+ - 2% acima de 4.000ms
58
+
59
+ **Implicacao critica**: a maioria das hesitacoes sao **silencio puro** — nao contem fillers como "euh" ou "hum" que o modelo poderia usar como pista. Isso torna a deteccao mais dificil, pois o modelo precisa distinguir silencio de hesitacao de silencio de fim de turno usando apenas prosodia.
60
+
61
+ ### 2.3 Funcao das pausas: planejamento vs busca lexical
62
+
63
+ Cenoz (2000) classificou as funcoes das pausas:
64
+
65
+ | Funcao | Pausas silenciosas | Pausas preenchidas |
66
+ |--------|-------------------|-------------------|
67
+ | Planejamento (frases, sintaxe) | 59% | **73%** |
68
+ | Busca lexical (palavras) | 36% | 26% |
69
+ | Outros | 5% | 1% |
70
+
71
+ Pausas preenchidas sao usadas primariamente para **manter o turno** (floor-holding): 77% ocorrem isoladas, sem outros marcadores de hesitacao. Ja pausas silenciosas co-ocorrem com outras estrategias de reparo 54% das vezes.
72
+
73
+ ### 2.4 Paradoxo de proficiencia
74
+
75
+ Um achado contraintuitivo: falantes **mais proficientes** produzem **mais** pausas preenchidas, nao menos (Cenoz, 2000). Na faixa de maior proficiencia, a proporcao muda para 53% silenciosas + 46% preenchidas (vs. 69%/31% nos menos proficientes). Isso ocorre porque falantes avancados aprenderam a usar fillers como estrategia de floor-holding — sinalizam ao interlocutor que ainda estao falando.
76
+
77
+ A variacao individual e enorme: a proporcao de pausas preenchidas varia de **4% a 74.5%** entre falantes individuais.
78
+
79
+ ---
80
+
81
+ ## 3. Fillers Franceses na Fala L2
82
+
83
+ ### 3.1 Transferencia linguistica
84
+
85
+ Falantes franceses transferem seus fillers nativos para a L2. Lo (2018) analisou 15 bilingues alemao-frances e mostrou que as propriedades acusticas dos fillers sao distintas por lingua:
86
+
87
+ - **Frances "euh"**: duracao mais longa, F1-F3 mais altos (vogal central schwa)
88
+ - **Alemao "ah"**: duracao mais curta, F1-F3 mais baixos (vogal aberta)
89
+
90
+ Bilingues mantem formas foneticamente distintas para cada lingua, indicando que **a forma do filler revela qual lingua esta sendo processada**. Isso e confirmado por Kosmala (2025): alunos franceses falando ingles usam "euh" (24 instancias) e "eum" (12 instancias), os fillers franceses, nao os ingleses.
91
+
92
+ ### 3.2 Fillers sao itens lexicais, nao sons universais
93
+
94
+ Bottcher & Zellers (2024) analisaram o corpus RUEG (736 falantes, 5 linguas, 4.468 narracoes) e demonstraram que fillers sao **itens lexicais especificos de cada lingua**, nao sons universais de hesitacao. A proporcao nasal ("uhm") vs. nao-nasal ("uh") varia por lingua, genero e idade.
95
+
96
+ Falantes heritage (bilingues desde a infancia) produzem **mais fillers no total** devido a carga cognitiva da ativacao de duas linguas. A **tolerancia ao silencio** tambem se transfere da L1: falantes de japones mantem maior tolerancia ao silencio mesmo falando ingles.
97
+
98
+ ### 3.3 Code-switching: 93% nas fronteiras de unidade
99
+
100
+ Beatty-Martinez, Navarro-Torres & Dussias (2020) analisaram 10 bilingues espanhol-ingles e encontraram que **93% das trocas de codigo ocorrem em fronteiras de unidades entonacionais** — os mesmos pontos onde transicoes de turno acontecem. Antes de uma troca, ha **reorganizacao prosodica** (mudanca na velocidade de fala), que pode ser confundida com marcadores prosodicos de fim de turno.
101
+
102
+ **Implicacao para o modelo**: quando um frances falando portugues alterna para uma palavra em frances ("Eu preciso de... *comment dit-on*... uma tesoura"), a mudanca prosodica antes da troca NAO indica fim de turno.
103
+
104
+ ---
105
+
106
+ ## 4. Pistas Prosodicas de Fim de Turno
107
+
108
+ ### 4.1 O contorno de F0 e o sinal-chave
109
+
110
+ Ekstedt & Skantze (2022) conduziram experimentos de perturbacao prosodica com o modelo VAP para isolar a importancia relativa de cada pista:
111
+
112
+ | Pista prosodica | Importancia |
113
+ |----------------|-------------|
114
+ | Informacao fonetica/espectral | Mais importante no geral |
115
+ | **Contorno de F0 (pitch)** | **Mais importante para desambiguar pontos sintaticamente completos** |
116
+ | Intensidade | Comparavel ao F0 no geral |
117
+ | Normalizacao de duracao | Menos importante individualmente |
118
+
119
+ O achado central: **F0 e o principal desambiguador em pontos onde a sintaxe poderia indicar completude**. Um F0 ascendente + silaba final mais longa = sinal de completude de turno. O modelo VAP consegue prever trocas de turno **antes** da completude do enunciado a partir da dinamica prosodica.
120
+
121
+ ### 4.2 Prosodia antes de pausas preenchidas
122
+
123
+ Wu, Didirkova & Simon (2023) analisaram o corpus LOCAS-F (2h38m, >1.000 FPs, kappa inter-anotador = 0.86) e encontraram padroes prosodicos especificos antes de fillers:
124
+
125
+ - Queda de F0 de **1-2 semitons** durante pausas preenchidas (17.90 vs. 19.01 ST)
126
+ - Reset melodico **negativo significativo** antes da FP (descontinuidade de pitch)
127
+ - Efeito de preparacao: queda de **4.66 ST** entre fala nao-preparada (18.89 ST) e preparada (14.24 ST)
128
+ - **Nao ha reset significativo** entre a FP e a silaba seguinte
129
+
130
+ **Implicacao**: a queda de pitch antes de um filler "euh" e diferente da queda de pitch de fim de frase. O modelo precisa aprender essa distincao sutil — o que o encoder do Whisper pode capturar atraves de representacoes espectrais.
131
+
132
+ ### 4.3 Resumo das pistas prosodicas de turno
133
+
134
+ Skantze (2021) compilou a literatura:
135
+
136
+ | Pista | Fim de turno | Pausa de hesitacao |
137
+ |-------|-------------|-------------------|
138
+ | Pitch (F0) | Queda (declarativa) ou queda-ascensao | Suspenso ou queda menor |
139
+ | Intensidade | Diminui gradualmente | Mantem-se ou cai abruptamente |
140
+ | Velocidade de fala | Desacelera antes do fim | Pode acelerar antes de parar |
141
+ | Alongamento silabico | Silaba final alongada | Ausente |
142
+ | Completude sintatica | Frase sintaticamente completa | Frase incompleta |
143
+
144
+ ---
145
+
146
+ ## 5. Estado da Arte em Deteccao de Fim de Turno
147
+
148
+ ### 5.1 Modelos e suas acuracias
149
+
150
+ | Modelo | Tipo | Params | Tamanho | Latencia | Acuracia | Linguas |
151
+ |--------|------|--------|---------|----------|----------|---------|
152
+ | **Pipecat Smart Turn v3.1** | Audio (Whisper Tiny) | 8M | 8 MB (INT8) | 12ms CPU | 94.7% (EN) | 23 |
153
+ | **LiveKit v0.4.1** | Texto (Qwen2.5-0.5B) | 500M | — | — | TNR 87.4% (PT) | 15+ |
154
+ | **Vogent-Turn-80M** | Multimodal | 79.2M | — | 7ms T4 | 94.1% (EN) | 1 |
155
+ | **Krisp TT v1** | Audio | 6.1M | 65 MB | — | 82% bal.acc. | 1 |
156
+ | **VAP** | Audio (CPC+Transformer) | — | — | 14.6ms CPU | 76.2% bal.acc. | 3 |
157
+ | **SpeculativeETD** | Audio (GRU+Wav2vec) | 1M+94M | — | 0.26ms+server | 66% real | 1 |
158
+
159
+ Nenhum destes modelos foi especificamente otimizado para portugues ou para fala L2.
160
+
161
+ ### 5.2 Pipecat Smart Turn: resultados por lingua
162
+
163
+ | Lingua | Acuracia | FP | FN |
164
+ |--------|----------|-----|-----|
165
+ | Turco | 97.10% | 1.66% | 1.24% |
166
+ | Coreano | 96.85% | 1.12% | 2.02% |
167
+ | Japones | 96.76% | 2.04% | 1.20% |
168
+ | Frances | 96.01% | 1.60% | 2.39% |
169
+ | **Portugues** | **95.42%** | **2.79%** | **1.79%** |
170
+ | Ingles | 94.31% | 2.64% | 3.06% |
171
+ | Espanhol | 91.97% | 4.48% | 3.55% |
172
+
173
+ Portugues esta atras de frances, japones, coreano e turco. A taxa de falsos positivos (2.79%) indica que o modelo diz "terminou" quando nao terminou em quase 3% dos casos — problematico para traducao simultanea.
174
+
175
+ ### 5.3 O gap sintetico → real: o maior risco
176
+
177
+ Ok, Yoo & Lee (2025) demonstraram um colapso devastador quando modelos treinados em dados sinteticos (TTS) sao avaliados em fala real:
178
+
179
+ | Modelo | F1 sintetico | F1 real | Perda |
180
+ |--------|-------------|---------|-------|
181
+ | Wav2vec 2.0 | 94.7% | **30.3%** | -64.4 p.p. |
182
+ | SpeculativeETD | 88.9 IoU | **16.4 IoU** | -72.5 p.p. |
183
+ | VAP | 87.7 IoU | **10.7 IoU** | -77.0 p.p. |
184
+
185
+ **Todos os modelos colapsam** ao sair de dados sinteticos para conversacao real. A melhoria de v3.0 para v3.1 do Pipecat veio justamente de adicionar **audio humano real** de tres parceiros especializados (Liva AI, Midcentury, MundoAI).
186
+
187
+ ---
188
+
189
+ ## 6. Tecnicas de Treinamento Validadas
190
+
191
+ ### 6.1 Focal Loss (Lin et al., 2017)
192
+
193
+ A Focal Loss resolve o desbalanceamento extremo de classes (a maioria dos frames e "fala em andamento"; fim de turno e raro):
194
+
195
+ ```
196
+ FL(p_t) = -α_t · (1 - p_t)^γ · log(p_t)
197
+ ```
198
+
199
+ Com gamma=2, a perda para exemplos bem classificados (p_t=0.9) e **100x menor** que cross-entropy padrao; com p_t=0.968, **1.000x menor**. Os melhores hiperparametros: **gamma=2, alpha=0.25** (Lin et al., 2017, Tabela 1a).
200
+
201
+ ### 6.2 Knowledge Distillation (LiveKit)
202
+
203
+ O LiveKit treinou um modelo professor (Qwen2.5-7B) e destilou para um aluno (Qwen2.5-0.5B), que converge apos ~1.500 steps. O resultado: reducao de 39% nos falsos positivos de interrupcao. Para portugues especificamente, a melhoria relativa foi de **45.97%** — a segunda maior entre todas as linguas.
204
+
205
+ ### 6.3 Dados curtos e ruidosos (Pipecat v3.2)
206
+
207
+ O Pipecat v3.2 adicionou dois tipos de dados ao treinamento:
208
+ 1. **Respostas curtas** ("yes", "no", "ok"): reduziu erros de classificacao em **40%**
209
+ 2. **Ruido de fundo** (cafe, escritorio, CC-0 Freesound): melhorou robustez
210
+
211
+ ### 6.4 Injecao de fillers e pausas (SpeculativeETD)
212
+
213
+ Ok et al. (2025) testaram tres variantes de dados sinteticos:
214
+ - V1: TTS direto (baseline)
215
+ - V2: Pausas de hesitacao estendidas para 1.5-3.0 segundos
216
+ - **V3: Fillers injetados ("um", "uh") em posicoes aleatorias + pausas apos**
217
+
218
+ V3 foi a melhor variante, validando a abordagem de gerar dados com fillers inseridos por LLM.
219
+
220
+ ### 6.5 Transfer learning cross-lingual
221
+
222
+ Castillo-Lopez et al. (2025) documentam que VAP pre-treinado em ingles (Switchboard) e fine-tunado em japones **supera** o treinamento direto em japones. Isso valida a abordagem de partir do Pipecat (pre-treinado em 23 linguas) e fine-tunar para portugues.
223
+
224
+ ---
225
+
226
+ ## 7. Pesquisa: Modelos de Turn-Taking para Aprendizado de Idiomas (Marco 2026)
227
+
228
+ **Data da pesquisa**: 16 de marco de 2026
229
+ **Objetivo**: identificar modelos e tecnicas especificas para turn-taking em contexto de aprendizado de idiomas com avatar conversacional
230
+
231
+ ### 7.1 Panorama: nenhum modelo especifico existe
232
+
233
+ Nao existe nenhum modelo open-source de turn-taking treinado especificamente para aprendizes L2. Os produtos comerciais usam abordagens genericas:
234
+
235
+ | Produto | Abordagem | Limitacao |
236
+ |---------|-----------|-----------|
237
+ | **Praktika** (OpenAI) | GPT-5.2 Realtime API — turn detection nativo OpenAI | Caixa preta, sem adaptacao L2 |
238
+ | **ConversAR** (Meta Quest) | Timeout fixo 2000ms + "periodo infinito de pensamento" | Nao distingue hesitacao de fim de turno |
239
+ | **Gliglish** | ChatGPT + speech recognition generico | Sem turn detection especializado |
240
+ | **ELSA Speak** | Modelo proprietario treinado com sotaques variados | Foco em pronuncia, nao turn-taking |
241
+ | **TalkPal/SpeakPal/Talkio** | GPT + NLP generico | Sem modelo de audio para end-of-turn |
242
+ | **Hume EVI** | Prosodia (tom de voz) para turn detection | Comercial, sem foco em L2 |
243
+ | **Deepgram Flux** | Modelo fusionado (ASR + turn detection) com `eot_threshold` configuravel | So ingles, sem adaptacao para proficiencia |
244
+
245
+ ### 7.2 Modelos de pesquisa relevantes
246
+
247
+ - **VAP (Voice Activity Projection)** — multilingual (EN/ZH/JA), fine-tuning por lingua melhora significativamente, mas nao cobre L2 ou non-native speakers (Inoue et al., 2024)
248
+ - **Speak & Improve Corpus 2025** — 340 horas de fala L2 ingles com anotacao de disfluencias e CEFR scores (A2-C1, maioria B1-B2) (Knill et al., 2025)
249
+ - **Whisper + LoRA para hesitation tagging** — fine-tune do Whisper Large V3 com tags de hesitacao acusticamente precisas: **11.3% melhoria no WER** para fala L2 (arXiv:2506.04076)
250
+ - **Deepgram Flux** — modelo fusionado (ASR + turn detection em um unico modelo); ~260ms end-of-turn detection, ~30% menos interrupcoes vs pipeline tradicional (Deepgram, 2025)
251
+ - **Survey IWSDS 2025** — 72% dos trabalhos de turn-taking NAO comparam com metodos anteriores, sugerindo falta de benchmarks estabelecidos (Castillo-Lopez et al., 2025)
252
+
253
+ ### 7.3 Tecnicas identificadas para melhorar o fine-tuning
254
+
255
+ **1. Threshold duplo (Eager + Final) — inspirado no Deepgram Flux**
256
+
257
+ Em vez de um unico threshold, usar dois:
258
+ - **Eager threshold (0.3-0.5)**: o avatar comeca a preparar resposta especulativamente (inicia geracao LLM)
259
+ - **Final threshold (0.7+)**: confirma fim de turno e fala
260
+
261
+ Economia de latencia estimada: 150-250ms no inicio da resposta, sem interromper o falante.
262
+
263
+ Deepgram Flux implementa isso como `eot_threshold` (padrao 0.7) e `eager_eot_threshold` (0.3-0.5), com parametro `eot_timeout_ms` para configurar sensibilidade.
264
+
265
+ **2. Custo assimetrico (FP >> FN)**
266
+
267
+ Para um tutor de idiomas, **interromper o aluno e muito pior que esperar demais**:
268
+ - ConversAR (2025): "periodo infinito de pensamento" — learners controlam quando falar
269
+ - Praktika: timeout estendido para fala L2 fragmentada e com sotaque
270
+ - Implementacao: `fp_penalty=2.0` no focal loss — erros de interrupcao custam 2x mais
271
+
272
+ **3. Dados L2 reais — Speak & Improve Corpus**
273
+
274
+ O corpus Speak & Improve 2025 (Knill et al., 2025) tem:
275
+ - 340 horas de audio L2 ingles
276
+ - Anotacao de disfluencias (false starts, repeticoes, hesitacoes)
277
+ - Scores CEFR: B1 (18.3%), B1+ (25.3%), B2 (25.1%), B2+ (18.3%)
278
+ - Embora seja ingles L2, padroes de hesitacao L2 sao transferiveis entre linguas (Cenoz, 2000)
279
+
280
+ Fine-tuning do Whisper com anotacao precisa de hesitacoes (tags "um"/"uh" acusticamente alinhadas) mostrou 11.3% melhoria no WER vs ignorar hesitacoes.
281
+
282
+ **4. Proficiency-aware turn-taking (futuro)**
283
+
284
+ Nenhum sistema implementa isso ainda, mas e logico:
285
+ - Aluno A1: pausa **3-5x mais** que B2
286
+ - Aluno B2: usa fillers como estrategia de floor-holding (Cenoz, 2000)
287
+ - De Jong & Bosker: threshold otimo de 250ms para pausa L2 em holandes
288
+ - Shea & Leonard (2019): threshold de **1000ms** para espanhol L2
289
+ - Implementacao futura: input de nivel CEFR ao modelo, ou adaptar threshold baseado em perfil do falante
290
+
291
+ ### 7.5 Sistema de backchannel para avatar de aprendizado
292
+
293
+ Um problema critico identificado: se o avatar espera 3 segundos em silencio, o aprendiz pensa que o sistema travou e para de falar. A solucao e um sistema de **backchannel signals** que mostra ao aprendiz que o avatar esta ouvindo, sem tomar o turno.
294
+
295
+ | Tempo de silencio | Acao do avatar | Tipo |
296
+ |-------------------|----------------|------|
297
+ | 0-600ms | Nada (normal) | — |
298
+ | 600ms-1.5s | Aceno de cabeca, olhar atento | Backchannel visual |
299
+ | 1.5s-3.0s | "Mhm...", "Continue...", "Uhum..." | Backchannel verbal |
300
+ | 3.0s+ | "Sem pressa, pode pensar...", "Prenez votre temps..." | Encorajamento |
301
+
302
+ O sistema integra o threshold duplo do modelo com os backchannels:
303
+ - **Eager threshold** atingido → avatar da backchannel + inicia geracao LLM especulativa
304
+ - **Final threshold** atingido → avatar fala a resposta
305
+
306
+ Presets por nivel CEFR:
307
+
308
+ | CEFR | Eager | Final | BC Visual | BC Verbal | Encorajamento |
309
+ |------|-------|-------|-----------|-----------|---------------|
310
+ | A1 | 0.50 | **0.80** | 500ms | 1200ms | 2500ms |
311
+ | A2 | 0.45 | 0.75 | 550ms | 1300ms | 2800ms |
312
+ | B1 | 0.40 | 0.70 | 600ms | 1500ms | 3000ms |
313
+ | B2 | 0.35 | 0.65 | 600ms | 1800ms | 4000ms |
314
+ | C1 | 0.35 | **0.60** | 600ms | 2000ms | 5000ms |
315
+
316
+ O avatar e mais paciente com A1/A2 (threshold final 0.80 = raramente interrompe) e mais responsivo com B2/C1 (0.60-0.65 = conversa mais natural). Implementado em `06_inference.py`.
317
+
318
+ ### 7.6 Conclusao da pesquisa
319
+
320
+ O trabalho do BabelCast e confirmado como **pioneiro**: nao existe modelo de turn-taking para aprendizes L2, muito menos para francofonos falando portugues. As melhorias implementadas (custo assimetrico, threshold duplo, dados L2 reais, backchannel por CEFR) trazem o modelo mais proximo do comportamento ideal de um tutor paciente.
321
+
322
+ ---
323
+
324
+ ## 8. Implicacoes para o Fine-Tuning BabelCast
325
+
326
+ ### 8.1 O que o modelo precisa aprender
327
+
328
+ Com base na literatura, os seguintes padroes sao criticos para o modelo distinguir:
329
+
330
+ | Padrao | Duracao tipica | Label | Pista prosodica |
331
+ |--------|---------------|-------|----------------|
332
+ | Fim de turno real | silencio 200-500ms | COMPLETO | F0 caindo, intensidade caindo, silaba final alongada |
333
+ | Hesitacao nativa (PT-BR) | FP 378ms + silencio 200-1000ms | INCOMPLETO | F0 suspenso, "hum"/"tipo"/"ne" |
334
+ | Hesitacao L2 (francofono) | FP **524ms** + silencio **844ms** | INCOMPLETO | F0 suspenso, "euh"/"alors"/"comment dire" |
335
+ | Busca de palavra L2 | silencio 1000-3000ms | INCOMPLETO | Silencio puro, sem pista prosodica clara |
336
+ | Code-switching | prosodic reset + silencio 500-1500ms | INCOMPLETO | Mudanca de velocidade antes da troca |
337
+ | Resposta curta completa | "Sim." + silencio 200ms | COMPLETO | F0 caindo, curta duracao |
338
+ | Resposta curta incompleta | "Sim, mas..." + silencio | INCOMPLETO | F0 suspenso ou ascendente |
339
+
340
+ ### 8.2 Dados necessarios
341
+
342
+ 1. **Audio real de portugues** (CORAA MUPE 365h, NURC-SP 239h) — evitar o gap sintetico→real
343
+ 2. **TTS com fillers** brasileiros ("hum", "tipo", "ne") e franceses ("euh", "alors") inseridos por LLM
344
+ 3. **Pausas longas de hesitacao** (1.5-3.0s) injetadas em amostras incompletas
345
+ 4. **Respostas curtas** ("sim", "nao", "ok", "tá bom") com variantes completas e incompletas
346
+ 5. **Code-switching** frances-portugues em fronteiras de unidade
347
+ 6. **Ruido de fundo** real (cafe, escritorio) — CC-0 Freesound
348
+
349
+ ### 8.3 Escolhas arquiteturais validadas
350
+
351
+ - **Whisper Tiny encoder** (39M params, 8s janela, 384-dim): mesma arquitetura do Pipecat, captura prosodia sem decodificar texto
352
+ - **Attention pooling**: aprende quais frames perto do silencio sao mais informativos (Ekstedt & Skantze, 2022)
353
+ - **Focal Loss** (gamma=2, alpha=0.25): calibracao implícita sem label smoothing (EMNLP 2022)
354
+ - **INT8 quantizacao**: 32MB → 8MB, 12ms CPU (Pipecat deploy pipeline)
355
+
356
+ ### 8.4 Lacuna na literatura
357
+
358
+ **Nenhum paper foi encontrado especificamente sobre deteccao de fim de turno em fala L2**, e especialmente nao sobre francofonos falando portugues. A intersecao de:
359
+ - Deteccao de fim de turno (ML/audio)
360
+ - Pausas de hesitacao em L2 (linguistica)
361
+ - Fala francofona em portugues (fonologia)
362
+
363
+ ...e uma area completamente inexplorada. O trabalho do BabelCast e, ate onde sabemos, o primeiro a atacar esse problema especifico.
364
+
365
+ ---
366
+
367
+ ## Referencias
368
+
369
+ ### Papers Academicos
370
+
371
+ 1. Beatty-Martinez, A. L., Navarro-Torres, C. A., & Dussias, P. E. (2020). Codeswitching: A bilingual toolkit for opportunistic speech planning. *Frontiers in Psychology*, 11, 1699.
372
+
373
+ 2. Bottcher, A. & Zellers, M. (2024). Do you say uh or uhm? A cross-linguistic approach to filler particle use in heritage and majority speakers across three languages. *Frontiers in Psychology*, 15, 1358182.
374
+
375
+ 3. Castillo-Lopez, V., de Chalendar, G., & Semmar, N. (2025). A survey of recent advances on turn-taking modeling in spoken dialogue systems. *arXiv:2503.xxxxx*.
376
+
377
+ 4. Cenoz, J. (2000). Pauses and hesitation phenomena in second language production. *ITL - International Journal of Applied Linguistics*, 127-128, 53-69.
378
+
379
+ 5. Christodoulides, G. & Avanzi, M. (2014). DisMo: A morphosyntactic, disfluency and multi-word unit annotator. *Proceedings of LREC 2014*.
380
+
381
+ 6. Christodoulides, G. & Avanzi, M. (2015). Automatic detection and annotation of disfluencies in spoken French corpora. *Proceedings of Interspeech 2015*.
382
+
383
+ 7. Ekstedt, E. & Skantze, G. (2022). How much does prosody help turn-taking? Investigations using voice activity projection models. *Proceedings of Interspeech 2022*.
384
+
385
+ 8. Inoue, K., Jiang, D., Ekstedt, E., Kawahara, T., & Skantze, G. (2024). Real-time and continuous turn-taking prediction using voice activity projection. *arXiv:2401.04868*.
386
+
387
+ 9. Inoue, K., Jiang, D., Ekstedt, E., Kawahara, T., & Skantze, G. (2024). Multilingual turn-taking prediction using voice activity projection. *Proceedings of LREC-COLING 2024*.
388
+
389
+ 10. Kosmala, L. & Crible, L. (2022). The dual status of filled pauses: Evidence from genre, proficiency and co-occurrence. *Language and Speech*, 65(4), 1-25.
390
+
391
+ 11. Kosmala, L. (2023). Exploring the status of filled pauses as pragmatic markers: The role of gaze and gesture. *Journal of Pragmatics*, 212, 1-15.
392
+
393
+ 12. Kosmala, L. (2025). Multimodal self- and other-initiated repairs in L2 peer interactions. *Proceedings of DiSS 2025*, Lisbon.
394
+
395
+ 13. Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. *Proceedings of ICCV 2017*. arXiv:1708.02002.
396
+
397
+ 14. Lo, S. L. (2018). Between ah(m) and euh(m): Filled pauses in German-French bilinguals. *BAAP 2018 Poster Presentation*.
398
+
399
+ 15. Ok, J., Yoo, I. C., & Lee, Y. (2025). Speculative end-turn detector: Revisiting an efficient design for low-latency end-of-turn detection. *arXiv:2503.23439*.
400
+
401
+ 16. Peters, M. (2017). L2 fluency development in French. *Ph.D. Thesis*.
402
+
403
+ 17. Raux, A. & Eskenazi, M. (2009). A finite-state turn-taking model for spoken dialog systems. *Proceedings of NAACL-HLT 2009*.
404
+
405
+ 18. Skantze, G. (2021). Turn-taking in conversational systems and human-robot interaction: A review. *Computer Speech & Language*, 67, 101178.
406
+
407
+ 19. Wu, X., Didirkova, I., & Simon, A. C. (2023). Disfluencies in continuous speech in French: Prosodic parameters of filled pauses and vowel lengthening. *Proceedings of ICPhS 2023*.
408
+
409
+ 20. Knill, K., et al. (2025). Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback. *arXiv:2412.11986*.
410
+
411
+ 21. Saeki, T., et al. (2025). Acoustically precise hesitation tagging is essential for end-to-end verbatim transcription systems. *arXiv:2506.04076*.
412
+
413
+ 22. Gamboa, H. & Wohlgenannt, G. (2025). ConversAR: Practicing a second language without fear — Mixed reality agents for interactive group conversation. *arXiv:2510.08227*.
414
+
415
+ 23. De Jong, N. & Bosker, H. R. (2013). Choosing a threshold for silent pauses to measure second language fluency. *The 6th Workshop on Disfluency in Spontaneous Speech (DiSS)*.
416
+
417
+ 24. Shea, C. & Leonard, K. (2019). Evaluating measures of pausing for second language fluency research. *Journal of Second Language Pronunciation*, 5(2), 254-277.
418
+
419
+ ### Modelos e Blogs Tecnicos
420
+
421
+ 20. Daily/Pipecat. (2025). Announcing Smart Turn v3, with CPU inference in just 12ms. https://www.daily.co/blog/announcing-smart-turn-v3-with-cpu-inference-in-just-12ms/
422
+
423
+ 21. Daily/Pipecat. (2025). Improved accuracy in Smart Turn v3.1. https://www.daily.co/blog/improved-accuracy-in-smart-turn-v3-1/
424
+
425
+ 22. Daily/Pipecat. (2026). Smart Turn v3.2: Handling noisy environments and short responses. https://www.daily.co/blog/smart-turn-v3-2-handling-noisy-environments-and-short-responses/
426
+
427
+ 23. LiveKit. (2025). Improved end-of-turn model cuts voice AI interruptions 39%. https://livekit.com/blog/improved-end-of-turn-model-cuts-voice-ai-interruptions-39/
428
+
429
+ 24. LiveKit. (2025). Using a transformer to improve end-of-turn detection. https://blog.livekit.io/using-a-transformer-to-improve-end-of-turn-detection
430
+
431
+ 25. Krisp. (2025). Audio-only 6M weights turn-taking model for voice AI agents. https://krisp.ai/blog/turn-taking-for-voice-ai/
432
+
433
+ 26. Deepgram. (2025). Evaluating end-of-turn detection models. https://deepgram.com/learn/evaluating-end-of-turn-detection-models
434
+
435
+ 27. Vogent. (2025). Vogent-Turn-80M model card. https://huggingface.co/vogent/Vogent-Turn-80M
436
+
437
+ 28. Deepgram. (2025). Introducing Flux: Conversational Speech Recognition. https://deepgram.com/learn/introducing-flux-conversational-speech-recognition
438
+
439
+ 29. Hume AI. (2025). Empathic Voice Interface (EVI) documentation. https://dev.hume.ai/docs/speech-to-speech-evi/overview
440
+
441
+ 30. OpenAI. (2025). Inside Praktika's conversational approach to language learning. https://openai.com/index/praktika/
442
+
443
+ 31. Tavus. (2025). The Complete Guide to AI Turn-Taking. https://www.tavus.io/post/ai-turn-taking
444
+
445
+ ### Datasets
446
+
447
+ 32. Pipecat-AI. (2026). Smart Turn Data v3.2 Training Set. HuggingFace: `pipecat-ai/smart-turn-data-v3.2-train`. 270,946 samples, 23 languages.
448
+
449
+ 33. NILC-NLP. CORAA-MUPE-ASR. HuggingFace: `nilc-nlp/CORAA-MUPE-ASR`. 365h, Portuguese interviews.
450
+
451
+ 34. NILC-NLP. CORAA NURC-SP Audio Corpus. HuggingFace: `nilc-nlp/CORAA-NURC-SP-Audio-Corpus`. 239h, Portuguese dialogues.
452
+
453
+ 35. Cambridge English. Speak & Improve Corpus 2025. 340h, L2 English learner speech with disfluency annotations and CEFR scores.
03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "A Finite-State Turn-Taking Model for Spoken Dialog Systems"
3
+ authors:
4
+ - Antoine Raux
5
+ - Maxine Eskenazi
6
+ year: 2009
7
+ source: https://aclanthology.org/N09-1071/
8
+ date_converted: 2026-03-16
9
+ ---
10
+
11
+ ## Abstract
12
+
13
+ This paper introduces the Finite-State Turn-Taking Machine (FSTTM), a new model to control the turn-taking behavior of conversational agents. Based on a non-deterministic finite-state machine, the FSTTM uses a cost matrix and decision-theoretic principles to select a turn-taking action at any time. The authors show how the model can be applied to the problem of end-of-turn detection. Evaluation results on a deployed spoken dialog system show that the FSTTM provides significantly higher responsiveness than previous approaches.
14
+
15
+ **Note**: The PDF file `finite_state_turn_taking_raux_2009.pdf` in this directory is corrupted (contains an unrelated physics paper). Content for this summary was sourced from the actual paper PDF downloaded from ACL Anthology.
16
+
17
+ ## Key Contributions
18
+
19
+ 1. **FSTTM model**: A principled, unified framework for turn-taking control based on a 6-state non-deterministic finite-state machine with decision-theoretic action selection.
20
+ 2. **Data-driven optimization**: Unlike prior hand-coded models, the FSTTM's cost parameters can be optimized from data using logistic regression.
21
+ 3. **Anytime endpointing**: The model can make end-of-turn decisions both during pauses and during speech (before a pause is even detected by VAD), enabling faster response times.
22
+ 4. **Deployed evaluation**: Tested on a real, publicly deployed bus information system, not just in lab conditions.
23
+
24
+ ## Architecture / Method Details
25
+
26
+ ### Six-State Finite-State Model
27
+
28
+ Based on Jaffe & Feldstein (1970) and Brady (1969), the FSTTM uses six states representing who holds the conversational floor:
29
+
30
+ | State | Description |
31
+ |-------|-------------|
32
+ | **USER** | User has the floor (obligation or intention to speak) |
33
+ | **SYSTEM** | System has the floor |
34
+ | **FREES** | Floor is free, following a SYSTEM state |
35
+ | **FREEU** | Floor is free, following a USER state |
36
+ | **BOTHS** | Both claim the floor, following a SYSTEM state |
37
+ | **BOTHU** | Both claim the floor, following a USER state |
38
+
39
+ States are defined in terms of **intentions and obligations**, not surface-level speech/silence observations. For example, USER state persists during mid-turn pauses.
40
+
41
+ ### Four Turn-Taking Actions
42
+
43
+ At any time, the system can take one of four actions:
44
+ - **Grab (G)**: Claim the floor
45
+ - **Release (R)**: Relinquish the floor
46
+ - **Keep (K)**: Maintain floor claim
47
+ - **Wait (W)**: Remain silent without claiming floor
48
+
49
+ ### Turn-Taking Phenomena Captured
50
+
51
+ The model formalizes common turn-taking patterns as 2-step state transitions:
52
+
53
+ 1. **Turn transitions with gap**: `SYSTEM --(R,W)--> FREES --(W,G)--> USER` (most common)
54
+ 2. **Turn transitions with overlap**: `SYSTEM --(K,G)--> BOTHS --(R,K)--> USER` (barge-in)
55
+ 3. **Failed interruptions**: `USER --(G,K)--> BOTHU --(R,K)--> USER` (system interrupts then backs off)
56
+ 4. **Time outs**: `SYSTEM --(R,W)--> FREES --(G,W)--> SYSTEM` (user doesn't respond)
57
+
58
+ ### Decision-Theoretic Action Selection
59
+
60
+ The optimal action minimizes expected cost:
61
+
62
+ ```
63
+ C(A) = sum over S in States: P(s=S|O) * C(A, S)
64
+ ```
65
+
66
+ Where `P(s=S|O)` is estimated from observable features and `C(A,S)` comes from a cost matrix.
67
+
68
+ ### Cost Matrix Design Principles
69
+
70
+ Derived from Sacks et al. (1974) -- "participants minimize gaps and overlaps":
71
+
72
+ 1. Actions that resolve gaps/overlaps have zero cost.
73
+ 2. Actions that create unwanted gaps/overlaps have a constant cost parameter.
74
+ 3. Actions that maintain gaps/overlaps have cost proportional to time in that state.
75
+
76
+ Four cost parameters:
77
+ - **CS**: Cost of false interruption (interrupting system prompt when user is not claiming floor)
78
+ - **CO(tau)**: Cost of remaining in overlap for tau ms
79
+ - **CU**: Cost of cut-in (grabbing floor when user holds it)
80
+ - **CG(tau)**: Cost of remaining in a gap for tau ms (set as CG^p * tau)
81
+
82
+ ### Probability Estimation
83
+
84
+ The key estimation task is `P(FREEU | Ot)` -- the probability that the user has released the floor.
85
+
86
+ **At pauses** (VAD detects silence): Uses Bayes rule combining:
87
+ - `P(F|O)`: Prior probability of floor release from pause-onset features (logistic regression)
88
+ - `P(d >= tau | O, U)`: Probability pause lasts at least tau ms given user holds floor (exponential distribution)
89
+
90
+ **During speech** (before VAD pause detection): Separate logistic regression on each ASR partial hypothesis, enabling endpointing before a full pause is detected.
91
+
92
+ ### Features Used
93
+
94
+ All features are automatically extractable at runtime:
95
+ - **Dialog state**: Open question / Closed question / Confirmation
96
+ - **Turn-taking features**: Whether current utterance is a barge-in
97
+ - **Semantic features**: From dialog state and partial ASR hypotheses
98
+ - **Boundary LM score**: Ratio of log-likelihood of hypothesis being complete vs. incomplete (trained on ASR output, no human transcription needed). **Most informative feature across all states.**
99
+ - **Average words per utterance** so far
100
+ - **Confirmation markers**: "YES", "SURE", etc.
101
+ - **Pause duration** within partial hypothesis (0-200ms range)
102
+
103
+ ## Experimental Results
104
+
105
+ ### Logistic Regression for P(F|O)
106
+
107
+ Trained with stepwise regression, 10-fold cross-validation on 586 dialogs (2008 corpus):
108
+
109
+ | Dialog State | Pause: Class. Error | Pause: Log-Likelihood | Speech: Class. Error | Speech: Log-Likelihood |
110
+ |---|---|---|---|---|
111
+ | Open question | 35% (vs 38% baseline) | -0.61 (vs -0.66) | 17% (vs 20%) | -0.40 (vs -0.50) |
112
+ | Closed question | 26% (vs 25% baseline) | -0.50 (vs -0.56) | 22% (vs 32%) | -0.49 (vs -0.63) |
113
+ | Confirmation | 12% (vs 12% baseline) | -0.30 (vs -0.36) | 17% (vs 36%) | -0.40 (vs -0.65) |
114
+
115
+ The "in speech" model achieves much larger gains, especially for Closed questions and Confirmations -- classification error drops from 32% to 22% and 36% to 17% respectively.
116
+
117
+ ### Batch Evaluation: Latency vs. Cut-in Rate
118
+
119
+ **In-pause FSTTM vs. baselines** (2007 corpus):
120
+ - FSTTM outperforms fixed threshold baseline by up to **29.5% latency reduction**.
121
+ - Slight improvement over Ferrer et al. (2003) reimplementation.
122
+
123
+ **Anytime-FSTTM vs. in-pause FSTTM** (2008 corpus):
124
+ - At 5% cut-in rate: anytime-FSTTM yields latencies **17% shorter** than in-pause-FSTTM, and **40% shorter** than fixed threshold baseline.
125
+ - **30-40% of turns are endpointed before the pause is detected by VAD** (during speech).
126
+
127
+ ### Live Evaluation (Deployed System)
128
+
129
+ A/B test over 10 days: 171 FSTTM dialogs vs. 148 fixed-threshold control dialogs.
130
+
131
+ Settings: `CG^p = 1, CG^s = 500, CU = 5000` (FSTTM) vs. 555ms fixed threshold (control). Both calibrated for ~6.3% cut-in rate.
132
+
133
+ | Metric | FSTTM | Fixed Threshold |
134
+ |--------|-------|-----------------|
135
+ | Average latency | **320ms** | 513ms |
136
+ | Cut-in rate | **4.8%** | 6.3% |
137
+
138
+ Results:
139
+ - **193ms latency reduction** (p < 0.05, statistically significant).
140
+ - 1.5% cut-in rate reduction (not statistically significant, but directionally correct).
141
+
142
+ ## Relevance to Turn-Taking / End-of-Turn Detection
143
+
144
+ The FSTTM is a foundational reference for BabelCast's end-of-turn detection:
145
+
146
+ 1. **Decision-theoretic framework**: The cost-based approach directly applies to our system. We can define costs for premature LLM invocation (wasted compute, incorrect partial translation) vs. delayed response (poor user experience), and optimize the trade-off.
147
+
148
+ 2. **Anytime endpointing**: The insight that 30-40% of turns can be endpointed *before the pause is detected by VAD* is powerful. In our pipeline, this means we could start the translation process before the speaker's pause is even fully formed, dramatically reducing perceived latency.
149
+
150
+ 3. **Boundary LM score**: The most informative feature in Raux's system was a language model score measuring syntactic completeness. We already have ASR partial results from Whisper -- we could compute a similar feature to predict turn completeness without waiting for silence.
151
+
152
+ 4. **Cost parameter tuning**: The four-parameter cost matrix (CS, CO, CU, CG) provides a compact, interpretable way to tune system behavior. We could expose these as configuration parameters in our Pipecat pipeline, allowing adjustment of responsiveness vs. interruption rate.
153
+
154
+ 5. **Real-world validation**: This is one of the few ETD papers validated on a deployed system with real users, not just lab data. The 193ms latency improvement is a concrete, practically meaningful result.
155
+
156
+ 6. **Simplicity**: The FSTTM is mathematically elegant and computationally trivial -- just a few probability estimates and a cost comparison. It could be implemented alongside our Silero VAD as an additional decision layer with negligible overhead.
157
+
158
+ 7. **Limitation for our case**: The system was designed for a task-oriented telephone dialog (bus information), which has more predictable turn structures than the open-domain meeting conversations we handle. The boundary LM approach would need adaptation for our more varied content.
03-finetune-pipecat-pt/references/papers/finite_state_turn_taking_raux_2009.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aaa1054cf8f25ccdfd847cd0a323e8d0473867404c5895551a1ec9889e4f7d97
3
+ size 254541
03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Focal Loss for Dense Object Detection"
3
+ authors:
4
+ - Tsung-Yi Lin
5
+ - Priya Goyal
6
+ - Ross Girshick
7
+ - Kaiming He
8
+ - Piotr Dollar
9
+ year: 2017
10
+ source: https://arxiv.org/abs/1708.02002
11
+ date_converted: 2026-03-16
12
+ ---
13
+
14
+ ## Abstract
15
+
16
+ The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. The authors discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. They propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. The novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. Using focal loss, their one-stage RetinaNet detector matches the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.
17
+
18
+ ## Key Contributions
19
+
20
+ 1. **Focal Loss function**: A dynamically scaled cross entropy loss where the scaling factor decays to zero as confidence in the correct class increases, automatically down-weighting easy examples and focusing on hard examples.
21
+ 2. **RetinaNet**: A simple one-stage dense object detector built on Feature Pyramid Networks (FPN) that, when trained with focal loss, achieves state-of-the-art results.
22
+ 3. **Identification of class imbalance** as the central obstacle preventing one-stage detectors from matching two-stage detector accuracy.
23
+
24
+ ## Method Details
25
+
26
+ ### Focal Loss Definition
27
+
28
+ Standard cross entropy loss:
29
+
30
+ ```
31
+ CE(pt) = -log(pt)
32
+ ```
33
+
34
+ Focal loss adds a modulating factor `(1 - pt)^gamma`:
35
+
36
+ ```
37
+ FL(pt) = -alpha_t * (1 - pt)^gamma * log(pt)
38
+ ```
39
+
40
+ Key properties:
41
+ - When `gamma = 0`, FL is equivalent to CE.
42
+ - When an example is misclassified (`pt` is small), the modulating factor is near 1 and the loss is unaffected.
43
+ - As `pt -> 1`, the factor goes to 0, down-weighting well-classified examples.
44
+ - With `gamma = 2`, an example classified with `pt = 0.9` has 100x lower loss compared to CE; with `pt ~ 0.968`, 1000x lower loss.
45
+ - Best setting found: `gamma = 2`, `alpha = 0.25`.
46
+
47
+ ### RetinaNet Architecture
48
+
49
+ - **Backbone**: ResNet + Feature Pyramid Network (FPN), pyramid levels P3-P7 with C=256 channels.
50
+ - **Classification subnet**: Small FCN with four 3x3 conv layers (256 filters, ReLU) + final 3x3 conv with KA filters + sigmoid. Shared across all pyramid levels.
51
+ - **Box regression subnet**: Identical structure to classification subnet, terminates in 4A linear outputs per location. Class-agnostic.
52
+ - **Anchors**: 9 anchors per level (3 aspect ratios x 3 scales), covering 32-813 pixel range.
53
+ - **Initialization**: Prior probability pi=0.01 for foreground class at initialization to prevent instability from background class dominance.
54
+
55
+ ## Experimental Results
56
+
57
+ ### Ablation Studies (ResNet-50-FPN, COCO minival)
58
+
59
+ | Loss | gamma | alpha | AP | AP50 | AP75 |
60
+ |------|-------|-------|------|------|------|
61
+ | CE (balanced) | 0 | 0.75 | 31.1 | 49.4 | 33.0 |
62
+ | FL | 0.5 | 0.75 | 31.4 | 49.9 | 33.1 |
63
+ | FL | 1.0 | 0.50 | 32.9 | 51.7 | 35.2 |
64
+ | **FL** | **2.0** | **0.25** | **34.0** | **52.5** | **36.5** |
65
+ | FL | 5.0 | 0.25 | 32.2 | 49.6 | 34.8 |
66
+
67
+ ### FL vs. OHEM (ResNet-101-FPN)
68
+
69
+ | Method | AP | AP50 | AP75 |
70
+ |--------|------|------|------|
71
+ | OHEM (best) | 32.8 | 50.3 | 35.1 |
72
+ | FL | **36.0** | **54.9** | **38.7** |
73
+
74
+ FL outperforms the best OHEM variant by **3.2 AP points**.
75
+
76
+ ### State-of-the-Art Comparison (COCO test-dev)
77
+
78
+ | Method | Backbone | AP | AP50 | AP75 |
79
+ |--------|----------|------|------|------|
80
+ | Faster R-CNN w FPN | ResNet-101-FPN | 36.2 | 59.1 | 39.0 |
81
+ | Faster R-CNN by G-RMI | Inception-ResNet-v2 | 34.7 | 55.5 | 36.7 |
82
+ | DSSD513 | ResNet-101-DSSD | 33.2 | 53.3 | 35.2 |
83
+ | **RetinaNet** | **ResNet-101-FPN** | **39.1** | **59.1** | **42.3** |
84
+ | **RetinaNet** | **ResNeXt-101-FPN** | **40.8** | **61.1** | **44.1** |
85
+
86
+ ### Speed vs. Accuracy
87
+
88
+ - RetinaNet-101-600: 122ms inference, matching Faster R-CNN accuracy (36.0 AP) at similar speed.
89
+ - RetinaNet-101-800: 198ms inference, 37.8 AP -- surpassing all prior methods.
90
+
91
+ ## Relevance to Turn-Taking / End-of-Turn Detection
92
+
93
+ Focal loss is directly applicable to end-of-turn detection for BabelCast:
94
+
95
+ 1. **Class imbalance problem**: End-of-turn detection faces severe class imbalance -- the vast majority of audio frames are "continuing speech" (easy negatives), while actual turn-end boundaries are rare events. This is analogous to the foreground-background imbalance in object detection.
96
+
97
+ 2. **Down-weighting easy examples**: In a streaming ETD system, most 100ms frames are clearly mid-utterance. Focal loss would prevent these trivially classified frames from dominating the gradient, focusing learning on the ambiguous boundary cases (pauses vs. turn-ends).
98
+
99
+ 3. **Drop-in replacement**: Focal loss is a simple modification to standard cross entropy -- just adding `(1 - pt)^gamma` -- making it trivial to integrate into any binary or ternary classifier for end-of-turn detection (e.g., the GRU or Wav2Vec models in SpeculativeETD).
100
+
101
+ 4. **Hyperparameter robustness**: The paper shows `gamma = 2, alpha = 0.25` works well across settings, reducing the need for extensive tuning.
102
+
103
+ 5. **No sampling needed**: Unlike OHEM or hard negative mining, focal loss operates on all examples naturally, which is important for real-time streaming where we process every frame.
03-finetune-pipecat-pt/references/papers/focal_loss_lin_2017.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2e6f45b97007009f32b82a3c19f35638d8da7022d8d4e4416b9078de91339f71
3
+ size 1297358
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "The Dual Status of Filled Pauses: Evidence from Genre, Proficiency and Co-occurrence"
3
+ authors:
4
+ - Loulou Kosmala
5
+ - Ludivine Crible
6
+ year: 2022
7
+ source_url: "https://doi.org/10.1177/00238309211010862"
8
+ date_converted: 2026-03-16
9
+ ---
10
+
11
+ ## Abstract
12
+
13
+ A corpus study examining the lexical vs. non-lexical status of filled pauses ("euh" and "eum") in spoken French. Analyzes their distribution across communication settings (prepared monologues vs. spontaneous conversations) and language proficiency levels (native vs. non-native French). Quantitative findings reveal differences in frequency, duration, position, and co-occurrence patterns. Qualitative analysis identifies two distinct patterns: (1) initial position clustered with a discourse marker (fluent, structuring use) vs. (2) medial position clustered with other hesitation markers (disfluent, processing use). The authors argue for a **dual status** of filled pauses based on formal, functional, and contextual features.
14
+
15
+ ## Key Findings Relevant to L2 Turn-Taking
16
+
17
+ ### French Filled Pause Forms
18
+ - Two main variants in French: **"euh"** [schwa] and **"eum"** [nasal]
19
+ - "Eum" is associated with longer delays and major discourse transitions/boundaries
20
+ - "Euh" is the dominant form in both native and non-native French (84-92% of all FPs)
21
+
22
+ ### Genre Effects (DisReg Corpus -- French Native Speakers)
23
+ - **Class presentations**: 6.8 FPs per 100 words (mean duration 415ms)
24
+ - **Casual conversations**: 4.2 FPs per 100 words (mean duration 343ms)
25
+ - More filled pauses in monologues than dialogues (significant, LL = 47.02, p < .001)
26
+ - More "eum" in presentations (23%) vs. conversations (8%)
27
+ - More initial-position FPs in presentations (40%) vs. conversations (24%)
28
+ - More final-position FPs in conversations (14%) vs. presentations (3%) -- reflects **turn-yielding** function
29
+
30
+ ### Native vs. Non-Native French (SITAF Corpus)
31
+ - **Native rate**: 4.4 FPs per 100 words (mean duration **378ms**)
32
+ - **Non-native rate**: 5.3 FPs per 100 words (mean duration **524ms**)
33
+ - Rate difference not significant, but **duration difference highly significant** (p < .001)
34
+ - Both groups: ~60% medial position, ~30% initial, ~10% final
35
+ - Non-native speakers had more **standalone and interrupted** positions (exclusive to learners)
36
+ - Non-native speakers cluster FPs with other fluencemes more often (82% clustered vs. 72% for natives)
37
+ - Longer fluenceme sequences in learner speech signal higher disruption
38
+
39
+ ### Co-occurrence with Discourse Markers
40
+ - 69 instances of FP + discourse marker clusters found
41
+ - Common French discourse markers paired with FPs: "donc" (so), "mais" (but), "ben" (well), "alors" (then/well), "en fait" (actually), "enfin" (I mean)
42
+ - FP position shifts when clustered with discourse markers: **57% initial** (vs. 73% medial when isolated)
43
+ - **Two distinct patterns emerge**:
44
+ 1. **Fluent pattern**: FP + discourse marker at turn/phrase boundary (initial position) -- structuring function
45
+ 2. **Disfluent pattern**: FP in medial position clustered with repetitions, lengthenings, silent pauses -- processing difficulty
46
+
47
+ ### Functional Analysis of Discourse Marker + FP Clusters
48
+ Four discourse domains identified:
49
+ - **Ideational**: connecting facts (e.g., "alors euh" = "then euh")
50
+ - **Rhetorical**: expressing opinions (e.g., "mais euh" = "but euh")
51
+ - **Sequential**: marking transitions (e.g., "donc euh" = "so euh")
52
+ - **Interpersonal**: monitoring/agreement (e.g., "bah oui euh" = "well yes euh")
53
+
54
+ ## Specific Hesitation Pattern Data
55
+
56
+ - Native French FP mean duration: **378ms** (SD=200ms)
57
+ - Non-native French FP mean duration: **524ms** (SD=222ms) -- 146ms longer
58
+ - FP duration is a more reliable index of proficiency than frequency
59
+ - Duration of FPs with/without discourse markers: no significant difference (~433ms vs. ~467ms)
60
+ - Example extreme disfluent sequence from a learner: 8 fluencemes in a row (lengthening + 3 silent pauses + 3 filled pauses + self-interruption), with pauses up to **1,177ms**
61
+
62
+ ## Implications for Turn-Taking Detection in L2 Speech
63
+
64
+ 1. **Dual function detection is essential**: The same sound "euh" can signal either (a) turn/phrase structuring (fluent, initial position + discourse marker) or (b) processing difficulty (disfluent, medial position + hesitation cluster). Both mean "don't take the turn yet" but for different reasons.
65
+
66
+ 2. **Duration as proficiency proxy**: Non-native FPs are ~150ms longer on average. A model trained on native French data will encounter systematically longer hesitations from L2 speakers. This duration difference should NOT be interpreted as turn-yielding.
67
+
68
+ 3. **Final-position FPs signal turn-yielding**: In conversations, 14% of FPs occur in final position (vs. 3% in monologues). The combination of FP + silent pause at utterance end is a reliable turn-yielding signal (e.g., "but he has a role euh (0.924)...").
69
+
70
+ 4. **Discourse marker + FP combinations**: Patterns like "donc euh" (so euh), "mais euh" (but euh) are strong turn-HOLD signals in French. A model should recognize these common French fluenceme clusters.
71
+
72
+ 5. **Cluster length matters**: Isolated FPs may go unnoticed; clustered FPs with repetitions and silent pauses signal genuine difficulty. Longer clusters = speaker is struggling but still holding the floor.
73
+
74
+ 6. **L2 speakers underuse discourse markers**: Non-native speakers rely more heavily on filled pauses alone, whereas native speakers combine FPs with rich discourse markers. The model should expect L2 French-Portuguese speakers to show heavy FP reliance with fewer Portuguese discourse markers.
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/01-kosmala-crible-2022-dual-status-filled-pauses.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46985a0e6f9345c0ae5a756f3bf13523e9ba2a76694c60ef2b861cf0aef67b4b
3
+ size 631099
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "How Much Does Prosody Help Turn-taking? Investigations using Voice Activity Projection Models"
3
+ authors:
4
+ - Erik Ekstedt
5
+ - Gabriel Skantze
6
+ year: 2022
7
+ source_url: "https://doi.org/10.18653/v1/2023.acl-long.304"
8
+ date_converted: 2026-03-16
9
+ ---
10
+
11
+ ## Abstract
12
+
13
+ Investigates the role of prosody in turn-taking using the Voice Activity Projection (VAP) model, which incrementally models upcoming speech activity of interlocutors in a self-supervised manner without relying on explicit annotation of turn-taking events or explicit prosodic feature modeling. Through systematic manipulation of the speech signal (F0 flattening, intensity flattening, low-pass filtering, duration averaging, F0 shifting), the authors show that VAP models learn to utilize various prosodic aspects of speech for turn-taking prediction. The study uses both aggregate quantitative metrics on long-form conversations (Switchboard corpus) and controlled utterance-level experiments with synthesized short/long phrase pairs.
14
+
15
+ ## Key Findings Relevant to L2 Turn-Taking
16
+
17
+ ### VAP Model Architecture
18
+ - Self-supervised model trained on raw waveforms (no hand-crafted features)
19
+ - Predicts future voice activity for both speakers at every time frame
20
+ - Tested at 20Hz, 50Hz, and 100Hz frame rates
21
+ - Trained on **Switchboard corpus** (English telephone conversations, 260h)
22
+ - Evaluated on 4 zero-shot tasks: shift vs. hold, shift prediction at voice activity, backchannel prediction, backchannel vs. turn-shift
23
+
24
+ ### Prosodic Perturbation Results
25
+ Five signal manipulations tested:
26
+ 1. **Low-pass filter** (removes phonetic info, keeps F0 + intensity): Largest overall impact -- phonetic information is crucial
27
+ 2. **F0 flat** (removes pitch contour): Second most impactful -- intonation is important for disambiguating turn completion
28
+ 3. **Intensity flat** (removes loudness variation): Comparable impact to F0
29
+ 4. **F0 shift** (arbitrary pitch shift): Minimal effect on slower models (50Hz), larger effect on faster models (100Hz)
30
+ 5. **Duration average** (normalizes segment durations): Least important individual cue
31
+
32
+ ### Key Prosodic Finding: F0 Contour for Turn Projection
33
+ - At **syntactically ambiguous completion points** (where lexical information is identical for both short and long utterances), the VAP model correctly uses prosody to distinguish turn-yielding from turn-holding
34
+ - **Higher F0 rise + longer duration** at the last syllable = turn completion signal
35
+ - The model predicts turn shifts **before** the utterance is actually complete -- it projects completions from prosodic dynamics
36
+ - Even when F0 is flattened, the model still partially distinguishes hold/shift, indicating **redundant information in intensity and/or duration**
37
+
38
+ ### Relative Importance of Prosodic Cues
39
+ 1. **Phonetic information** (segmental, captured by spectral content): Most important overall
40
+ 2. **F0 contour** (intonation): Most important for disambiguating syntactically equivalent completion points
41
+ 3. **Intensity**: At least as important as pitch for general turn-taking
42
+ 4. **Duration**: Less important, but contributes as redundant cue
43
+
44
+ ### Model Frame Rate Findings
45
+ - 20Hz and 50Hz models: comparable performance, more robust to perturbations
46
+ - 100Hz models: more sensitive to phonetic information and acoustic artifacts
47
+ - Lower frame rates preferred for computational efficiency and robustness
48
+
49
+ ## Specific Data Points
50
+
51
+ - Human inter-turn gap: ~200ms average (Levinson & Torreira 2015)
52
+ - Switchboard corpus: 2,400 dyadic telephone conversations, 260 hours
53
+ - Short/long phrase pairs: 9 pairs, 10 TTS voices each (5 male, 5 female)
54
+ - Three evaluation regions at short completion point: hold (start to -200ms), predictive (-200ms to end), reactive (last frame)
55
+
56
+ ## Implications for Turn-Taking Detection in L2 Speech
57
+
58
+ 1. **VAP models work without explicit prosodic features**: The self-supervised approach learns prosodic patterns from raw audio. This is crucial for L2 speech where prosodic patterns deviate from native norms -- the model can potentially learn L2-specific patterns if fine-tuned on L2 data.
59
+
60
+ 2. **F0 contour is the key disambiguation signal**: When words alone cannot determine turn boundaries (common in L2 speech with simpler syntax), the pitch contour becomes the primary cue. French speakers' intonation patterns in Portuguese will be a critical signal for the model.
61
+
62
+ 3. **Prosody is redundant and multi-dimensional**: Even removing one prosodic dimension (F0 or intensity) doesn't completely collapse turn-taking prediction. This redundancy is useful for L2 speech where some prosodic cues may be atypical.
63
+
64
+ 4. **50Hz frame rate is optimal**: For our Pipecat implementation, a 50Hz (20ms frame) VAP model provides the best balance of performance, robustness, and computational efficiency. This aligns with our current 16kHz/512-sample (32ms) Silero VAD window.
65
+
66
+ 5. **Pre-completion projection**: The VAP model predicts turn shifts BEFORE the speaker finishes, based on prosodic dynamics. This is essential for reducing response latency in real-time systems. For L2 speakers, this projection may be less reliable due to non-standard prosodic patterns, requiring L2-specific fine-tuning.
67
+
68
+ 6. **Cross-linguistic transfer needed**: The model was trained on English (Switchboard). For French speakers in Portuguese, it needs fine-tuning on Romance language conversation data. The underlying prosodic principles (F0 rise at completion, intensity patterns) likely transfer across Romance languages, but language-specific patterns must be learned.
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/02-ekstedt-skantze-2022-prosody-turn-taking-vap.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:472f85f79713f457e2ec5aa866a1e88cbd58d119a8b3e1561d50e7f1c1722661
3
+ size 2825968
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "A Survey of Recent Advances on Turn-taking Modeling in Spoken Dialogue Systems"
3
+ authors:
4
+ - Galo Castillo-Lopez
5
+ - Gael de Chalendar
6
+ - Nasredine Semmar
7
+ year: 2025
8
+ source_url: "https://aclanthology.org/2025.iwsds-1.24/"
9
+ date_converted: 2026-03-16
10
+ ---
11
+
12
+ ## Abstract
13
+
14
+ A comprehensive review of recent methods on turn-taking modeling in spoken dialogue systems. Covers end-of-turn prediction, backchannel prediction, and multi-party turn-taking. Notes that 72% of reviewed works do not compare their methods with previous efforts, and argues that the lack of well-established benchmarks is a key challenge. Provides the first detailed review of datasets used in the field, discusses overlooked limitations, and examines new ideas and approaches since 2021. Published at IWSDS 2025.
15
+
16
+ ## Key Findings Relevant to L2 Turn-Taking
17
+
18
+ ### Turn-Taking Modeling Taxonomy
19
+ Three main approaches to end-of-turn (EOU) prediction:
20
+ 1. **Silence-based**: Uses VAD + silence threshold (e.g., 700ms). Results in poor user experience due to unnaturalness. Current spoken dialogue systems wait 700-1000ms (vs. human 200ms average).
21
+ 2. **IPU-based (Inter-Pausal Unit)**: Predictions after each detected silence. Assumes turns cannot be taken while user speaks.
22
+ 3. **Continuous**: Predictions at every frame (e.g., every 50ms) regardless of silence. Most recent and promising approach.
23
+
24
+ ### Voice Activity Projection (VAP) Models -- Current State of the Art
25
+ - VAP models predict future voice activity of both interlocutors incrementally
26
+ - Self-supervised learning -- no explicit turn-taking event annotation needed
27
+ - Best results in Japanese when trained on English + fine-tuned with Japanese data (vs. direct Japanese training)
28
+ - Cross-lingual performance is poor without proper label alignment across datasets
29
+ - Emerging trend: VAP models dominate continuous methods
30
+
31
+ ### Feature Importance for Turn-Taking
32
+ - **Combined features outperform individual ones**: prosody + words > prosody alone or words alone
33
+ - **Turn-taking cues have additive effect** in human communication
34
+ - Word embeddings + acoustic features together improve backchannel detection
35
+ - Gaze, head pose, and non-verbal features enhance predictions when available
36
+ - **ASR-based linguistic features**: Fine-tuning wav2vec 2.0 for ASR outperforms acoustic-only features
37
+
38
+ ### Key Datasets
39
+ | Dataset | Language | Duration | Dialogues |
40
+ |---|---|---|---|
41
+ | Switchboard | English | 260h | 2,400 |
42
+ | Fisher Corpus | English | 1,960h | 11,700 |
43
+ | NoXi Database | en/es/fr/de/it/ar/id | 25h | 84 |
44
+ | AMI Meeting Corpus | English | 100h | 175 (multi-party) |
45
+
46
+ - **Switchboard** is the dominant benchmark: used in 69% of backchannel and 41% of EOU papers
47
+ - Most research is on **English and Japanese** -- very limited work on other languages
48
+ - IPU silence thresholds vary from **50ms to 1s** across studies -- no standard
49
+
50
+ ### Backchannel Prediction
51
+ - Backchannels are short feedback tokens ("mhm", "yeah") produced during another's speech
52
+ - Key distinction: backchannel vs. actual interruption vs. non-lexical noise
53
+ - Multi-task learning (sentiment + dialogue acts + backchannel) improves prediction
54
+ - Context-aware models using BERT text embeddings + wav2vec acoustic embeddings show strong results
55
+
56
+ ### Open Challenges Identified
57
+ 1. **No standardized benchmarks**: Only 28% of papers compare with prior work
58
+ 2. **Multilinguality**: Very limited -- most work is English/Japanese only
59
+ 3. **Multi-party conversations**: Understudied, requires visual channel
60
+ 4. **Special populations**: Senior adults, mental health disorders need adapted models
61
+ 5. **LLMs are inefficient** at detecting mid-utterance turn opportunities
62
+
63
+ ## Implications for Turn-Taking Detection in L2 Speech
64
+
65
+ 1. **Continuous VAP models are the right approach**: For our fine-tuning task, continuous frame-level prediction (not silence-threshold-based) is the current state of the art. This aligns with our Pipecat architecture.
66
+
67
+ 2. **Cross-lingual fine-tuning works**: VAP models trained on English and fine-tuned on target language data outperform direct target-language training. This suggests training on English Switchboard + fine-tuning on French-Portuguese conversation data is a viable strategy.
68
+
69
+ 3. **Multi-feature approach is essential**: For L2 speakers, combining prosodic + lexical + timing features provides the best predictions. L2 speakers may have atypical prosody but more predictable syntax, or vice versa. The additive effect of features provides robustness.
70
+
71
+ 4. **Silence threshold tuning**: The survey notes thresholds from 50ms to 1s. For L2 speakers with longer pauses, the threshold must be adjusted upward. Our current 300ms SILENCE_HANGOVER_MS may need extension for L2 French speakers in Portuguese.
72
+
73
+ 5. **French is available in NoXi**: The NoXi database includes French conversations (among 7 languages) with 25h total. This could be a valuable fine-tuning resource for our French-specific model.
74
+
75
+ 6. **Backchannel handling**: L2 speakers may produce non-standard backchannels or transfer French backchannels ("ouais", "mmh") into Portuguese conversation. The model must distinguish these from turn-taking attempts.
76
+
77
+ 7. **Benchmarking gap**: Since 72% of papers don't compare methods, we should establish clear metrics for our L2 turn-taking task from the start, enabling future comparison.
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/03-castillo-lopez-2025-survey-turn-taking-modeling.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3fd4a6e8a77a46ad20634ba4b05c5731dbf4c692eb1acdb2fc1bb71062f12ee
3
+ size 473840
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Pauses and Communication Strategies in Second Language Speech"
3
+ authors:
4
+ - Jasone Cenoz
5
+ year: 2000
6
+ source_url: "https://eric.ed.gov/?id=ED426630"
7
+ date_converted: 2026-03-16
8
+ ---
9
+
10
+ ## Abstract
11
+
12
+ A study of silent and filled pauses in second language speech analyzing (1) which types of pauses are produced, (2) the functions of non-juncture pauses, (3) whether pauses co-occur with other hesitation phenomena, and (4) whether the occurrence of pauses is associated with second language proficiency. Subjects were 15 intermediate and advanced learners of English as a second language (L1 Spanish, university students). Each told a story in English which was recorded and transcribed. Silent and filled pauses at non-grammatical junctures were identified and analyzed.
13
+
14
+ ## Key Findings Relevant to L2 Turn-Taking
15
+
16
+ ### Pause Distribution
17
+ - Total non-juncture pauses: **1,085** across 15 subjects
18
+ - **64% silent pauses**, **36% filled pauses**
19
+ - The most common filler was **"eh"** (transferred from L1 Spanish), demonstrating L1 transfer of hesitation markers
20
+ - Wide individual variation: filled pauses ranged from 4% to 74.5% of all pauses per individual
21
+
22
+ ### Silent Pause Durations
23
+ | Duration Range | % of Pauses | Subjects Affected | Individual Variation |
24
+ |---|---|---|---|
25
+ | 200-1000ms | 70% | 100% | 39%-96% |
26
+ | 1001-2000ms | 21% | 100% | 4%-39% |
27
+ | 2001-4000ms | 7% | 40% | 10%-18% |
28
+ | 4001ms+ | 2% | 40% | 1.5%-7% |
29
+
30
+ ### Functional Categories of Pauses
31
+ | Function | Silent Pauses | Filled Pauses |
32
+ |---|---|---|
33
+ | Lexical (retrieval) | 36% | 26% |
34
+ | Morphological | 5% | 1% |
35
+ | Planning | 59% | 73% |
36
+
37
+ - Both silent and filled pauses serve the same functions, but filled pauses are overwhelmingly used for general **planning** (73%)
38
+ - More silent pauses associated with **lexical retrieval** than filled pauses
39
+
40
+ ### Co-occurrence with Other Hesitation Phenomena
41
+ | Strategy | Silent Pauses | Filled Pauses |
42
+ |---|---|---|
43
+ | Pauses + other hesitations | 54% | 23% |
44
+ | Only pauses | 46% | 77% |
45
+
46
+ - Most common hesitation phenomena: **repetition, self-correction, and reformulation**
47
+ - Silent pauses much more likely to co-occur with other repair strategies (54%) vs. filled pauses (23%)
48
+ - Filled pauses function as **standalone repair devices**; silent pauses tend to **precede** other repairs
49
+
50
+ ### Proficiency Effects
51
+ - Higher-proficiency learners produced **more total pauses** and **more filled pauses** (53% silent + 46% filled)
52
+ - Lower-proficiency learners: 69% silent + 31% filled
53
+ - Lower-proficiency learners used **more hesitation strategies combined with pauses** (62% for silent pauses) vs. higher-proficiency (38%)
54
+ - Interpretation: high-proficiency learners need only time to retrieve information; low-proficiency learners need to vocalize different options
55
+
56
+ ## Specific Hesitation Pattern Data
57
+
58
+ - Silent pause minimum threshold: **200ms**
59
+ - Pause range: **205ms to 11,569ms**
60
+ - Most pauses (70%) are under 1 second
61
+ - Long pauses (>2s) only in 40% of subjects -- marks a subset of struggling speakers
62
+ - L1 Spanish filler "eh" dominates filled pauses (L1 transfer effect)
63
+
64
+ ## Implications for Turn-Taking Detection in L2 Speech
65
+
66
+ 1. **Pause duration is critical**: 70% of L2 pauses are 200-1000ms, overlapping with natural turn-transition gaps (~200ms). A turn-taking model must distinguish these within-turn hesitation pauses from actual turn boundaries.
67
+
68
+ 2. **Filled pauses signal continuation**: Since filled pauses are primarily standalone floor-holding devices (77% occur without other hesitations), detecting "eh/euh/um" should strongly indicate the speaker intends to continue, NOT yield the turn.
69
+
70
+ 3. **L1 transfer of filler forms**: French speakers speaking Portuguese will likely transfer French "euh" rather than adopting Portuguese fillers. The model should recognize L1-influenced fillers as valid hold signals.
71
+
72
+ 4. **Proficiency paradox**: More proficient L2 speakers use MORE filled pauses, not fewer. The model should not interpret high filler rates as low fluency or turn-yielding intent.
73
+
74
+ 5. **Cluster detection**: When silent pauses co-occur with repetitions, self-corrections, or reformulations (54% of cases), this strongly signals ongoing speech processing, not turn completion. Detecting hesitation clusters is key to avoiding premature turn-taking.
75
+
76
+ 6. **Individual variation is massive**: Some speakers use almost no filled pauses (4%) while others fill 74.5% of their pauses. The model needs speaker adaptation or robust handling of diverse hesitation profiles.
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/04-cenoz-2000-pauses-hesitation-L2-production.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3e6e1ea3fb31493034539cd59a0274c8bcc11e08ff4104b2fb882c5766106e7a
3
+ size 238029
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/05-kosmala-2023-filled-pauses-pragmatic-markers-gaze-gesture.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b4003c9f58c976ea687d117ea25b507885d0472d440eaad0e8386880e036734
3
+ size 920978
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Disfluencies in Continuous Speech in French: Prosodic Parameters of Filled Pauses and Vowel Lengthening"
3
+ authors:
4
+ - Yaru Wu
5
+ - Ivana Didirkova
6
+ - Anne-Catherine Simon
7
+ year: 2023
8
+ source_url: null
9
+ date_converted: 2026-03-16
10
+ ---
11
+
12
+ ## Abstract
13
+
14
+ This study examines prosodic parameters in two types of disfluencies -- vowel lengthenings and filled pauses -- in approximately 2.5 hours of continuous French speech across 11 different speech genres (prepared, semi-prepared, and unprepared). Analyzes mean fundamental frequency (F0), pitch resets, and duration to compare disfluent syllables with surrounding fluent syllables as a function of speech preparation level. Results show that F0 is lower in filled pauses and disfluent vowel lengthenings than in fluent speech, with filled pauses produced at lower F0 than vowel lengthening. Larger pitch resets occur between disfluent units and their preceding contexts.
15
+
16
+ ## Key Findings Relevant to L2 Turn-Taking
17
+
18
+ ### Prosodic Characteristics of French Filled Pauses
19
+
20
+ #### Fundamental Frequency (F0) Drop
21
+ - **Fluent speech**: mean 19.01 ST (SD=5.97)
22
+ - **Vowel lengthening**: mean 18.10 ST (SD=6.43) -- ~1 ST drop
23
+ - **Filled pauses**: mean 17.90 ST (SD=7.31) -- ~1.1 ST drop
24
+ - F0 drop in filled pauses vs. fluent speech:
25
+ - **Females**: 2.12 ST decrease (from 23.66 to 21.54 ST)
26
+ - **Males**: 1.75 ST decrease (from 16.72 to 14.97 ST)
27
+ - Both disfluency types significantly lower than fluent speech (p < 0.001)
28
+
29
+ #### F0 and Speech Preparation Level
30
+ - **Unprepared speech**: FP average F0 = 18.89 ST
31
+ - **Semi-prepared speech**: FP average F0 = 17.41 ST
32
+ - **Prepared speech**: FP average F0 = 14.24 ST
33
+ - Massive **4.66 ST drop** between unprepared and prepared speech for FPs
34
+ - Vowel lengthening does NOT follow this trend (only 0.07 ST difference)
35
+ - Fluent speech shows moderate 1.2 ST drop between unprepared and prepared
36
+
37
+ #### Pitch Reset (Melodic Discontinuity)
38
+ - Filled pauses create a **larger negative pitch reset** from the preceding syllable than vowel lengthening
39
+ - Both FPs and lengthening produce significantly more negative pitch changes than fluent speech (p < 0.001)
40
+ - No significant pitch reset difference between the disfluent unit and the FOLLOWING syllable
41
+ - Key insight: the pitch drop **before** a filled pause is a reliable acoustic cue
42
+
43
+ #### Duration
44
+ - **Vowel lengthening** is significantly LONGER than **filled pauses** (p < 0.001)
45
+ - Vowel lengthening mean: ~347ms; Filled pause mean: ~268ms (from prior French study by Grosman 2018)
46
+ - Filled pauses are shorter in prepared speech and longer in unprepared speech
47
+ - Vowel lengthening shows the opposite pattern: shorter in unprepared, longer in prepared/semi-prepared
48
+
49
+ ### Corpus Details (LOCAS-F)
50
+ - Multi-genre French oral corpus (14 speech genres)
51
+ - Analyzed: 2h38m of recordings
52
+ - Data: >1,500 vowel prolongation sequences, >1,000 filled pauses, ~59,000 fluent syllables
53
+ - Disfluency affects ~4% of all data (typical for non-pathological speech)
54
+ - Inter-annotator agreement: kappa 0.86 for FPs (near perfect), 0.64 for lengthening (substantial)
55
+
56
+ ## Specific Hesitation Pattern Data
57
+
58
+ - French filled pauses typically transcribed as **"euh"**
59
+ - Duration range for hesitation vowels (cross-linguistic): **200ms to 650ms** (from Vasilescu et al. 2004, 8 languages)
60
+ - French FP F0 value is similar to the **onset value of breath groups** for a given speaker
61
+ - Mean FP duration in French: **268.4ms** (Grosman 2018)
62
+ - Mean vowel lengthening duration in French: **347.25ms** (Grosman 2018)
63
+
64
+ ## Implications for Turn-Taking Detection in L2 Speech
65
+
66
+ 1. **F0 drop is a reliable filled pause detector**: Filled pauses in French show a consistent 1-2 ST drop in F0 compared to fluent speech. A turn-taking model can use this prosodic signature to identify FPs even without lexical recognition. This is especially useful when ASR may not reliably transcribe L2 "euh" sounds.
67
+
68
+ 2. **Pitch reset before FPs marks processing onset**: The significant negative pitch reset between the preceding syllable and the filled pause signals the START of a hesitation event. This could serve as an early warning that the speaker is about to hesitate but intends to continue.
69
+
70
+ 3. **Preparation level affects FP prosody**: In more spontaneous speech (like real-time conversation), FPs have HIGHER F0 (closer to fluent speech). In prepared speech, FPs drop dramatically in F0. Since L2 speakers in conversation are in "unprepared" mode, their FPs may be harder to distinguish from fluent speech by F0 alone.
71
+
72
+ 4. **Vowel lengthening is a separate hesitation signal**: French speakers (both L1 and L2) elongate vowels at word boundaries as a hesitation strategy. These are LONGER than filled pauses (~347ms vs. ~268ms) but show a smaller F0 drop. The model should recognize elongated schwa-like sounds as hesitation, not turn completion.
73
+
74
+ 5. **4% disfluency rate in native speech**: This baseline helps calibrate expectations. L2 speakers will show significantly higher rates, potentially 8-15% or more, meaning the model will encounter disfluency markers very frequently in L2 French-to-Portuguese speech.
75
+
76
+ 6. **Cross-linguistic prosodic transfer**: French speakers speaking Portuguese will likely transfer their French FP prosodic patterns (F0 drop magnitude, pitch reset patterns). The model should accommodate French-influenced prosodic contours on Portuguese hesitations.
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/06-wu-2023-disfluencies-french-prosodic-params-filled-pauses.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c77a54e77105dc6ad1f3306d0a991c9730f70ed327559f2686bc5897c8f6540c
3
+ size 436423
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/07-ekstedt-2023-turn-taking-cues-speech-synthesis.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9422e0c9df73fd3e2a91bb7bea6eba9393ba5865ea91b36463e1d50c23c9215b
3
+ size 294389
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/08-inoue-2024-realtime-turn-taking-vap.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:129a900833fb9a775f6abc8869de60f3526a3a489ed6fda5f8b267136a5a8cec
3
+ size 496267
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/09-inoue-2024-multilingual-turn-taking-vap.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a12189557e7022189609c710cd8071f359ae4e05a3c755c7e1aceef1b19a86bc
3
+ size 974569
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/10-hutin-2024-filled-pauses-conversational-convergence.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e48ef19c610ebb2f2dca05458c801d11f2899563ff165aa7b1a90c17a563501a
3
+ size 951784
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Codeswitching: A Bilingual Toolkit for Opportunistic Speech Planning"
3
+ authors:
4
+ - Anne L. Beatty-Martinez
5
+ - Christian A. Navarro-Torres
6
+ - Paola E. Dussias
7
+ year: 2020
8
+ source_url: "https://doi.org/10.3389/fpsyg.2020.01699"
9
+ date_converted: 2026-03-16
10
+ ---
11
+
12
+ ## Abstract
13
+
14
+ Reviews codeswitching as a bilingual strategy for opportunistic speech planning. Recent discoveries show that codeswitching is not haphazard but subject to unique linguistic and cognitive constraints, and that bilinguals who codeswitch exhibit usage patterns conforming to community-based norms. The paper provides corpus evidence (from the Puerto Rico Codeswitching Map Task corpus of Spanish-English bilinguals) that codeswitching serves as a tool to navigate linguistic interference during production, enabling speakers to circumvent speech planning difficulties by opportunistically drawing from whichever language is most active or accessible.
15
+
16
+ ## Key Findings Relevant to L2 Turn-Taking
17
+
18
+ ### Codeswitching as a Fluency Strategy
19
+ - Codeswitching is NOT random -- it is **structured and strategic**
20
+ - Functions as a tool for **opportunistic speech planning**: bilinguals take advantage of whichever language's words/structures are most active to achieve communicative goals
21
+ - Reduces the cost in time and resources during speech production
22
+ - **93% of codeswitches occur at Intonation Unit (IU) boundaries** (Plaistowe 2015, from NMSEB corpus)
23
+ - This means switches overwhelmingly happen at natural prosodic break points, not mid-phrase
24
+
25
+ ### Prosodic Signatures of Codeswitching
26
+ - Speech rate changes before codeswitches: momentary **reorganization of prosodic and phonetic systems**
27
+ - These changes serve a dual purpose:
28
+ 1. Help the speaker negotiate lexical competition and minimize cross-language interference
29
+ 2. Provide reliable **acoustic cues for listeners** to anticipate and process the switch
30
+ - Codeswitching affects bilingual speech at different levels: word-level (speech rate of individual words) and sentence-level (speech rate within prosodic sentences)
31
+
32
+ ### Language Control Modes
33
+ - **Single-language context**: Language control is COMPETITIVE -- one language suppressed at expense of other
34
+ - **Codeswitching context**: Language control is COOPERATIVE -- coactivation maintained, items from both languages available for selection
35
+ - Dense codeswitchers minimize language membership tagging and keep both languages active
36
+
37
+ ### Corpus Data (Puerto Rico Codeswitching Map Task)
38
+ - 10 Spanish-English bilinguals (6 female), all native Spanish speakers
39
+ - Equal self-reported proficiency in both languages (9.6/10 each)
40
+ - ~2.5 hours of unscripted, task-oriented dialogs
41
+ - Participants exposed to both languages: more Spanish with family, more English in media, equal among friends
42
+ - Paired with in-group confederate (close friend from same speech community) -- this increased codeswitching 4x vs. out-group pairing
43
+
44
+ ### Variable Equivalence and Switch Sites
45
+ - Bilinguals do NOT consistently avoid "conflict sites" between languages
46
+ - Instead, they **opportunistically use** sites of variable equivalence (partial structural overlap between languages)
47
+ - This challenges the strict "equivalence constraint" (Poplack 1980)
48
+
49
+ ## Specific Data Points
50
+
51
+ - Self-reported proficiency: Spanish 9.6/10 (SD=0.8), English 9.6/10 (SD=0.5)
52
+ - Mean age: 23.3 years (SD=1.8)
53
+ - Codeswitching at IU boundaries: 93% (from related corpus study)
54
+
55
+ ## Implications for Turn-Taking Detection in L2 Speech
56
+
57
+ 1. **Codeswitches happen at prosodic boundaries**: Since 93% of codeswitches align with intonation unit boundaries, a turn-taking model should expect language switches at natural break points -- the same locations where turn transitions might occur. The model must not interpret a language switch as a turn-ending signal.
58
+
59
+ 2. **Prosodic reorganization before switches**: The speech rate and prosodic changes before a codeswitch could be confused with turn-ending prosody. For French speakers occasionally inserting French words while speaking Portuguese, the model needs to tolerate these prosodic perturbations without triggering a false turn-shift prediction.
60
+
61
+ 3. **L1-L2 fluency trade-off**: When French speakers struggle with Portuguese lexical retrieval, they may briefly switch to French (a natural bilingual strategy). The turn-taking model should recognize this as a hold signal (the speaker is using an alternative strategy to continue) rather than a breakdown signal.
62
+
63
+ 4. **Cooperative language mode in bilingual contexts**: If the conversation partner also speaks French, the French-Portuguese speaker may engage in cooperative codeswitching. The model should handle mixed-language utterances without treating language boundaries as turn boundaries.
64
+
65
+ 5. **Community norms matter**: Codeswitching frequency and patterns depend heavily on the interactional context and community norms. The model should be configurable for different bilingual settings (e.g., high-codeswitch vs. monolingual-target contexts).
66
+
67
+ 6. **Intonation unit alignment**: Since codeswitches cluster at IU boundaries, and IU boundaries are also key sites for turn-taking decisions, the model must weigh additional cues (gaze, content completeness, prosodic finality) at these ambiguous boundary points where both a codeswitch-and-continue and a turn-yield are possible.
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/11-beatty-martinez-2020-codeswitching-bilingual-toolkit.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:40ac9947f5813c936324bc2e266a1a574f1edb9d72a0d18f851bc3ddadf1d25b
3
+ size 1159649
03-finetune-pipecat-pt/references/papers/hesitation-l2-french/12-lingref-code-switching-bilinguals.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca2aa2330e88de35a7f0254aefd3c024d14ce282f3f73ee28914f656f62f7820
3
+ size 874378