Lightweight Phoneme-Based TTS

This repository implements a lightweight phoneme-based Text-to-Speech (TTS) system trained on paired text–audio data. The model follows a NanoSpeech-inspired Conv1D encoder–decoder architecture that reconstructs mel-spectrograms from phoneme inputs, which are converted to waveforms using the Griffin–Lim vocoder.

This project is intended for educational and research purposes, focusing on simplicity, interpretability, and core TTS fundamentals.

Model Description

Architecture: Lightweight Conv1D encoder–decoder (NanoSpeech-inspired)
Input: ARPAbet phoneme sequences
Output: Predicted mel-spectrograms
Vocoder: Griffin–Lim (spectrogram → waveform)
Framework: PyTorch
Checkpoint: tts_model.pth

The model learns basic phoneme-to-acoustic mappings and produces intelligible speech while remaining computationally lightweight.

Dataset

Source: Desivocal
Metadata: metadata.csv containing text and corresponding audio filenames
Audio: .wav files aligned with metadata entries

Phonemization

Text is converted to ARPAbet phonemes using the g2p-en library.

Example

"I invite you to embark on a profound journey" AY1 IH2 N V AY1 T Y UW1 T UW1 EH0 M B AA1 R K AA1 N AH0 P R OW0 F AW1 N D JH ER1 N IY0

Training Procedure

Optimizer: Adam
Loss Function: L1 reconstruction loss
Epochs: 20 (configurable)
Training Notebook: train_tts.ipynb

The model is trained to minimize reconstruction error between predicted and ground-truth mel-spectrograms.

Evaluation

Evaluation was performed on 3 held-out samples using L1 and L2 similarity metrics.

Sample	Dataset Index	L1 Loss	L2 Loss
1	73	3.8363	24.6055
2	415	4.2695	30.8383
3	392	3.9636	26.2839

Audio Samples

Ground-truth and synthesized audio samples are available in:

results/audio_samples/ ├── gt_73.wav ├── pred_73.wav ├── gt_415.wav ├── pred_415.wav ├── gt_392.wav ├── pred_392.wav └── ground_truth.wav

Observations

The model captures basic phoneme-to-mel mappings
Synthesized speech is intelligible
Natural prosody and expressiveness are limited
Higher L1/L2 errors reflect:
- small model size
- limited training data
- absence of a pretrained neural vocoder

Future Improvements

Replace Griffin–Lim with a pretrained neural vocoder (e.g., HiFi-GAN, WaveGlow)
Train with more data and a larger model
Add style and prosody modeling for more natural speech
Convert the model into a fully deployable TTS pipeline

Intended Use

This model is intended for:

Learning phoneme-based TTS systems
Research and experimentation
Educational demonstrations of lightweight speech synthesis

It is not intended for production or commercial deployment.

License

This project is released under the MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track