hidude562 commited on
Commit
0112fb6
·
verified ·
1 Parent(s): fb18897

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Prosody Predictor
2
+
3
+ A small (682K param) convolutional model that predicts pitch (F0) and volume (RMS) contours from text at 100ms resolution.
4
+
5
+ ## Model Architecture
6
+
7
+ ```
8
+ Text -> CharEncoder (4x Conv1d) -> DurationPredictor (2x Conv1d, detached)
9
+ -> LengthRegulator (repeat by durations)
10
+ -> FrameDecoder (3x Conv1d) -> [F0, RMS]
11
+ ```
12
+
13
+ - **CharEncoder**: Char embedding (51 -> 128) + sinusoidal positional encoding + 4x Conv1d(128, k=5) + ReLU + LayerNorm + Dropout
14
+ - **DurationPredictor**: Detached encoder input + 2x Conv1d(128, k=3) + ReLU + LN + Drop -> linear(1)
15
+ - **LengthRegulator**: Repeats encoder output per-character by predicted durations
16
+ - **FrameDecoder**: 3x Conv1d(128, k=5) + ReLU + LN + Drop -> linear(2) for [F0, RMS]
17
+
18
+ ## Quickstart
19
+
20
+ ```python
21
+ import torch
22
+ from model_prosody import ProsodyPredictor
23
+ from infer_prosody import predict_prosody
24
+
25
+ ckpt = torch.load("final_model.pt", map_location="cpu", weights_only=False)
26
+ model = ProsodyPredictor(vocab_size=ckpt["vocab_size"], d_model=128, dropout=0.0)
27
+ model.load_state_dict(ckpt["model"])
28
+ model.eval()
29
+
30
+ result = predict_prosody("Hello, I am Kobi AI", model, ckpt["norm_stats"])
31
+ # result["f0_hz"] - pitch in Hz per 100ms frame
32
+ # result["rms"] - volume per 100ms frame
33
+ # result["duration_s"] - total duration in seconds
34
+ ```
35
+
36
+ ## Synthesize as Sine Wave
37
+
38
+ ```python
39
+ import numpy as np
40
+ import soundfile as sf
41
+ from scipy.interpolate import CubicSpline
42
+
43
+ f0 = result["f0_hz"]
44
+ rms = result["rms"]
45
+ sr = 24000
46
+ frame_dur = 0.1
47
+ n_frames = len(f0)
48
+ total_samples = int(n_frames * frame_dur * sr)
49
+
50
+ # Smooth interpolation between frames
51
+ frame_times = (np.arange(n_frames) + 0.5) * frame_dur
52
+ sample_times = np.arange(total_samples) / sr
53
+ f0_smooth = np.clip(CubicSpline(frame_times, f0, bc_type='clamped')(sample_times), 50, 300)
54
+ rms_smooth = np.clip(CubicSpline(frame_times, rms, bc_type='clamped')(sample_times), 0, None)
55
+
56
+ # Generate with continuous phase
57
+ phase = np.cumsum(2 * np.pi * f0_smooth / sr)
58
+ audio = (rms_smooth * np.sin(phase)).astype(np.float32)
59
+ audio = audio / (np.abs(audio).max() + 1e-8) * 0.8
60
+ sf.write("output.wav", audio, sr)
61
+ ```
62
+
63
+ ## Files
64
+
65
+ | File | Description |
66
+ |------|-------------|
67
+ | `final_model.pt` | Fully trained model (200 epochs, 8000 steps) |
68
+ | `best_model.pt` | Best validation checkpoint (val loss 1.078) |
69
+ | `model_prosody.py` | Model definition (ProsodyPredictor) |
70
+ | `infer_prosody.py` | Inference helper (`predict_prosody()`) |
71
+ | `extract_features.py` | Feature extraction from WAV + text (vocab, tokenizer) |
72
+
73
+ ## Training Details
74
+
75
+ - **Data**: 2000 TTS WAV samples (24kHz mono) with text transcripts
76
+ - **Features**: F0 via librosa pyin (50-300 Hz), RMS, z-score normalized
77
+ - **Split**: 95/5 train/val, seed=42
78
+ - **Optimizer**: AdamW, lr=1e-3 -> 1e-5 cosine annealing, 200-step warmup
79
+ - **Loss**: `MSE(pitch, voiced only) + MSE(volume, all frames) + 0.1 * MSE(log duration)`
80
+ - **Batch size**: 48, **Epochs**: 200, **Grad clip**: 1.0
81
+
82
+ ## Limitations
83
+
84
+ - Duration prediction uses proportional alignment (frames / chars), not forced alignment. The model learns positional averages rather than phoneme-specific timing.
85
+ - Deterministic output -- no sampling or variance prediction. Same text always produces the same contour.
86
+ - Trained on a single TTS voice, so prosody patterns reflect that speaker's style.