Lightweight Phoneme-Based TTS
This repository implements a lightweight phoneme-based Text-to-Speech (TTS) system trained on paired text–audio data. The model follows a NanoSpeech-inspired Conv1D encoder–decoder architecture that reconstructs mel-spectrograms from phoneme inputs, which are converted to waveforms using the Griffin–Lim vocoder.
This project is intended for educational and research purposes, focusing on simplicity, interpretability, and core TTS fundamentals.
Model Description
- Architecture: Lightweight Conv1D encoder–decoder (NanoSpeech-inspired)
- Input: ARPAbet phoneme sequences
- Output: Predicted mel-spectrograms
- Vocoder: Griffin–Lim (spectrogram → waveform)
- Framework: PyTorch
- Checkpoint:
tts_model.pth
The model learns basic phoneme-to-acoustic mappings and produces intelligible speech while remaining computationally lightweight.
Dataset
- Source: Desivocal
- Metadata:
metadata.csvcontaining text and corresponding audio filenames - Audio:
.wavfiles aligned with metadata entries
Phonemization
Text is converted to ARPAbet phonemes using the g2p-en library.
Example
"I invite you to embark on a profound journey" AY1 IH2 N V AY1 T Y UW1 T UW1 EH0 M B AA1 R K AA1 N AH0 P R OW0 F AW1 N D JH ER1 N IY0
Training Procedure
- Optimizer: Adam
- Loss Function: L1 reconstruction loss
- Epochs: 20 (configurable)
- Training Notebook:
train_tts.ipynb
The model is trained to minimize reconstruction error between predicted and ground-truth mel-spectrograms.
Evaluation
Evaluation was performed on 3 held-out samples using L1 and L2 similarity metrics.
| Sample | Dataset Index | L1 Loss | L2 Loss |
|---|---|---|---|
| 1 | 73 | 3.8363 | 24.6055 |
| 2 | 415 | 4.2695 | 30.8383 |
| 3 | 392 | 3.9636 | 26.2839 |
Audio Samples
Ground-truth and synthesized audio samples are available in:
results/audio_samples/ ├── gt_73.wav ├── pred_73.wav ├── gt_415.wav ├── pred_415.wav ├── gt_392.wav ├── pred_392.wav └── ground_truth.wav
Observations
- The model captures basic phoneme-to-mel mappings
- Synthesized speech is intelligible
- Natural prosody and expressiveness are limited
- Higher L1/L2 errors reflect:
- small model size
- limited training data
- absence of a pretrained neural vocoder
Future Improvements
- Replace Griffin–Lim with a pretrained neural vocoder (e.g., HiFi-GAN, WaveGlow)
- Train with more data and a larger model
- Add style and prosody modeling for more natural speech
- Convert the model into a fully deployable TTS pipeline
Intended Use
This model is intended for:
- Learning phoneme-based TTS systems
- Research and experimentation
- Educational demonstrations of lightweight speech synthesis
It is not intended for production or commercial deployment.
License
This project is released under the MIT License.