# Prosody Predictor A small (682K param) convolutional model that predicts pitch (F0) and volume (RMS) contours from text at 100ms resolution. ## Model Architecture ``` Text -> CharEncoder (4x Conv1d) -> DurationPredictor (2x Conv1d, detached) -> LengthRegulator (repeat by durations) -> FrameDecoder (3x Conv1d) -> [F0, RMS] ``` - **CharEncoder**: Char embedding (51 -> 128) + sinusoidal positional encoding + 4x Conv1d(128, k=5) + ReLU + LayerNorm + Dropout - **DurationPredictor**: Detached encoder input + 2x Conv1d(128, k=3) + ReLU + LN + Drop -> linear(1) - **LengthRegulator**: Repeats encoder output per-character by predicted durations - **FrameDecoder**: 3x Conv1d(128, k=5) + ReLU + LN + Drop -> linear(2) for [F0, RMS] ## Quickstart ```python import torch from model_prosody import ProsodyPredictor from infer_prosody import predict_prosody ckpt = torch.load("final_model.pt", map_location="cpu", weights_only=False) model = ProsodyPredictor(vocab_size=ckpt["vocab_size"], d_model=128, dropout=0.0) model.load_state_dict(ckpt["model"]) model.eval() result = predict_prosody("Hello, I am Kobi AI", model, ckpt["norm_stats"]) # result["f0_hz"] - pitch in Hz per 100ms frame # result["rms"] - volume per 100ms frame # result["duration_s"] - total duration in seconds ``` ## Synthesize as Sine Wave ```python import numpy as np import soundfile as sf from scipy.interpolate import CubicSpline f0 = result["f0_hz"] rms = result["rms"] sr = 24000 frame_dur = 0.1 n_frames = len(f0) total_samples = int(n_frames * frame_dur * sr) # Smooth interpolation between frames frame_times = (np.arange(n_frames) + 0.5) * frame_dur sample_times = np.arange(total_samples) / sr f0_smooth = np.clip(CubicSpline(frame_times, f0, bc_type='clamped')(sample_times), 50, 300) rms_smooth = np.clip(CubicSpline(frame_times, rms, bc_type='clamped')(sample_times), 0, None) # Generate with continuous phase phase = np.cumsum(2 * np.pi * f0_smooth / sr) audio = (rms_smooth * np.sin(phase)).astype(np.float32) audio = audio / (np.abs(audio).max() + 1e-8) * 0.8 sf.write("output.wav", audio, sr) ``` ## Files | File | Description | |------|-------------| | `final_model.pt` | Fully trained model (200 epochs, 8000 steps) | | `best_model.pt` | Best validation checkpoint (val loss 1.078) | | `model_prosody.py` | Model definition (ProsodyPredictor) | | `infer_prosody.py` | Inference helper (`predict_prosody()`) | | `extract_features.py` | Feature extraction from WAV + text (vocab, tokenizer) | ## Training Details - **Data**: 2000 TTS WAV samples (24kHz mono) with text transcripts - **Features**: F0 via librosa pyin (50-300 Hz), RMS, z-score normalized - **Split**: 95/5 train/val, seed=42 - **Optimizer**: AdamW, lr=1e-3 -> 1e-5 cosine annealing, 200-step warmup - **Loss**: `MSE(pitch, voiced only) + MSE(volume, all frames) + 0.1 * MSE(log duration)` - **Batch size**: 48, **Epochs**: 200, **Grad clip**: 1.0 ## Limitations - Duration prediction uses proportional alignment (frames / chars), not forced alignment. The model learns positional averages rather than phoneme-specific timing. - Deterministic output -- no sampling or variance prediction. Same text always produces the same contour. - Trained on a single TTS voice, so prosody patterns reflect that speaker's style.