| # Prosody Predictor |
|
|
| A small (682K param) convolutional model that predicts pitch (F0) and volume (RMS) contours from text at 100ms resolution. |
|
|
| ## Model Architecture |
|
|
| ``` |
| Text -> CharEncoder (4x Conv1d) -> DurationPredictor (2x Conv1d, detached) |
| -> LengthRegulator (repeat by durations) |
| -> FrameDecoder (3x Conv1d) -> [F0, RMS] |
| ``` |
|
|
| - **CharEncoder**: Char embedding (51 -> 128) + sinusoidal positional encoding + 4x Conv1d(128, k=5) + ReLU + LayerNorm + Dropout |
| - **DurationPredictor**: Detached encoder input + 2x Conv1d(128, k=3) + ReLU + LN + Drop -> linear(1) |
| - **LengthRegulator**: Repeats encoder output per-character by predicted durations |
| - **FrameDecoder**: 3x Conv1d(128, k=5) + ReLU + LN + Drop -> linear(2) for [F0, RMS] |
|
|
| ## Quickstart |
|
|
| ```python |
| import torch |
| from model_prosody import ProsodyPredictor |
| from infer_prosody import predict_prosody |
| |
| ckpt = torch.load("final_model.pt", map_location="cpu", weights_only=False) |
| model = ProsodyPredictor(vocab_size=ckpt["vocab_size"], d_model=128, dropout=0.0) |
| model.load_state_dict(ckpt["model"]) |
| model.eval() |
| |
| result = predict_prosody("Hello, I am Kobi AI", model, ckpt["norm_stats"]) |
| # result["f0_hz"] - pitch in Hz per 100ms frame |
| # result["rms"] - volume per 100ms frame |
| # result["duration_s"] - total duration in seconds |
| ``` |
|
|
| ## Synthesize as Sine Wave |
|
|
| ```python |
| import numpy as np |
| import soundfile as sf |
| from scipy.interpolate import CubicSpline |
| |
| f0 = result["f0_hz"] |
| rms = result["rms"] |
| sr = 24000 |
| frame_dur = 0.1 |
| n_frames = len(f0) |
| total_samples = int(n_frames * frame_dur * sr) |
| |
| # Smooth interpolation between frames |
| frame_times = (np.arange(n_frames) + 0.5) * frame_dur |
| sample_times = np.arange(total_samples) / sr |
| f0_smooth = np.clip(CubicSpline(frame_times, f0, bc_type='clamped')(sample_times), 50, 300) |
| rms_smooth = np.clip(CubicSpline(frame_times, rms, bc_type='clamped')(sample_times), 0, None) |
| |
| # Generate with continuous phase |
| phase = np.cumsum(2 * np.pi * f0_smooth / sr) |
| audio = (rms_smooth * np.sin(phase)).astype(np.float32) |
| audio = audio / (np.abs(audio).max() + 1e-8) * 0.8 |
| sf.write("output.wav", audio, sr) |
| ``` |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `final_model.pt` | Fully trained model (200 epochs, 8000 steps) | |
| | `best_model.pt` | Best validation checkpoint (val loss 1.078) | |
| | `model_prosody.py` | Model definition (ProsodyPredictor) | |
| | `infer_prosody.py` | Inference helper (`predict_prosody()`) | |
| | `extract_features.py` | Feature extraction from WAV + text (vocab, tokenizer) | |
|
|
| ## Training Details |
|
|
| - **Data**: 2000 TTS WAV samples (24kHz mono) with text transcripts |
| - **Features**: F0 via librosa pyin (50-300 Hz), RMS, z-score normalized |
| - **Split**: 95/5 train/val, seed=42 |
| - **Optimizer**: AdamW, lr=1e-3 -> 1e-5 cosine annealing, 200-step warmup |
| - **Loss**: `MSE(pitch, voiced only) + MSE(volume, all frames) + 0.1 * MSE(log duration)` |
| - **Batch size**: 48, **Epochs**: 200, **Grad clip**: 1.0 |
|
|
| ## Limitations |
|
|
| - Duration prediction uses proportional alignment (frames / chars), not forced alignment. The model learns positional averages rather than phoneme-specific timing. |
| - Deterministic output -- no sampling or variance prediction. Same text always produces the same contour. |
| - Trained on a single TTS voice, so prosody patterns reflect that speaker's style. |
|
|