hidude562
/

prosody-predictor

Model card Files Files and versions

prosody-predictor / README.md

hidude562's picture

Upload README.md with huggingface_hub

0112fb6 verified 25 days ago

|

history blame contribute delete

3.36 kB

	# Prosody Predictor

	A small (682K param) convolutional model that predicts pitch (F0) and volume (RMS) contours from text at 100ms resolution.

	## Model Architecture

	```
	Text -> CharEncoder (4x Conv1d) -> DurationPredictor (2x Conv1d, detached)
	-> LengthRegulator (repeat by durations)
	-> FrameDecoder (3x Conv1d) -> [F0, RMS]
	```

	- CharEncoder: Char embedding (51 -> 128) + sinusoidal positional encoding + 4x Conv1d(128, k=5) + ReLU + LayerNorm + Dropout
	- DurationPredictor: Detached encoder input + 2x Conv1d(128, k=3) + ReLU + LN + Drop -> linear(1)
	- LengthRegulator: Repeats encoder output per-character by predicted durations
	- FrameDecoder: 3x Conv1d(128, k=5) + ReLU + LN + Drop -> linear(2) for [F0, RMS]

	## Quickstart

	```python
	import torch
	from model_prosody import ProsodyPredictor
	from infer_prosody import predict_prosody

	ckpt = torch.load("final_model.pt", map_location="cpu", weights_only=False)
	model = ProsodyPredictor(vocab_size=ckpt["vocab_size"], d_model=128, dropout=0.0)
	model.load_state_dict(ckpt["model"])
	model.eval()

	result = predict_prosody("Hello, I am Kobi AI", model, ckpt["norm_stats"])
	# result["f0_hz"] - pitch in Hz per 100ms frame
	# result["rms"] - volume per 100ms frame
	# result["duration_s"] - total duration in seconds
	```

	## Synthesize as Sine Wave

	```python
	import numpy as np
	import soundfile as sf
	from scipy.interpolate import CubicSpline

	f0 = result["f0_hz"]
	rms = result["rms"]
	sr = 24000
	frame_dur = 0.1
	n_frames = len(f0)
	total_samples = int(n_frames * frame_dur * sr)

	# Smooth interpolation between frames
	frame_times = (np.arange(n_frames) + 0.5) * frame_dur
	sample_times = np.arange(total_samples) / sr
	f0_smooth = np.clip(CubicSpline(frame_times, f0, bc_type='clamped')(sample_times), 50, 300)
	rms_smooth = np.clip(CubicSpline(frame_times, rms, bc_type='clamped')(sample_times), 0, None)

	# Generate with continuous phase
	phase = np.cumsum(2 * np.pi * f0_smooth / sr)
	audio = (rms_smooth * np.sin(phase)).astype(np.float32)
	audio = audio / (np.abs(audio).max() + 1e-8) * 0.8
	sf.write("output.wav", audio, sr)
	```

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `final_model.pt` \| Fully trained model (200 epochs, 8000 steps) \|
	\| `best_model.pt` \| Best validation checkpoint (val loss 1.078) \|
	\| `model_prosody.py` \| Model definition (ProsodyPredictor) \|
	\| `infer_prosody.py` \| Inference helper (`predict_prosody()`) \|
	\| `extract_features.py` \| Feature extraction from WAV + text (vocab, tokenizer) \|

	## Training Details

	- Data: 2000 TTS WAV samples (24kHz mono) with text transcripts
	- Features: F0 via librosa pyin (50-300 Hz), RMS, z-score normalized
	- Split: 95/5 train/val, seed=42
	- Optimizer: AdamW, lr=1e-3 -> 1e-5 cosine annealing, 200-step warmup
	- Loss: `MSE(pitch, voiced only) + MSE(volume, all frames) + 0.1 * MSE(log duration)`
	- Batch size: 48, Epochs: 200, Grad clip: 1.0

	## Limitations

	- Duration prediction uses proportional alignment (frames / chars), not forced alignment. The model learns positional averages rather than phoneme-specific timing.
	- Deterministic output -- no sampling or variance prediction. Same text always produces the same contour.
	- Trained on a single TTS voice, so prosody patterns reflect that speaker's style.