Lightweight Phoneme-Based TTS

This repository implements a lightweight phoneme-based Text-to-Speech (TTS) system trained on paired text–audio data. The model follows a NanoSpeech-inspired Conv1D encoder–decoder architecture that reconstructs mel-spectrograms from phoneme inputs, which are converted to waveforms using the Griffin–Lim vocoder.

This project is intended for educational and research purposes, focusing on simplicity, interpretability, and core TTS fundamentals.


Model Description

  • Architecture: Lightweight Conv1D encoder–decoder (NanoSpeech-inspired)
  • Input: ARPAbet phoneme sequences
  • Output: Predicted mel-spectrograms
  • Vocoder: Griffin–Lim (spectrogram → waveform)
  • Framework: PyTorch
  • Checkpoint: tts_model.pth

The model learns basic phoneme-to-acoustic mappings and produces intelligible speech while remaining computationally lightweight.


Dataset

  • Source: Desivocal
  • Metadata: metadata.csv containing text and corresponding audio filenames
  • Audio: .wav files aligned with metadata entries

Phonemization

Text is converted to ARPAbet phonemes using the g2p-en library.

Example

"I invite you to embark on a profound journey" AY1 IH2 N V AY1 T Y UW1 T UW1 EH0 M B AA1 R K AA1 N AH0 P R OW0 F AW1 N D JH ER1 N IY0


Training Procedure

  • Optimizer: Adam
  • Loss Function: L1 reconstruction loss
  • Epochs: 20 (configurable)
  • Training Notebook: train_tts.ipynb

The model is trained to minimize reconstruction error between predicted and ground-truth mel-spectrograms.


Evaluation

Evaluation was performed on 3 held-out samples using L1 and L2 similarity metrics.

Sample Dataset Index L1 Loss L2 Loss
1 73 3.8363 24.6055
2 415 4.2695 30.8383
3 392 3.9636 26.2839

Audio Samples

Ground-truth and synthesized audio samples are available in:

results/audio_samples/ ├── gt_73.wav ├── pred_73.wav ├── gt_415.wav ├── pred_415.wav ├── gt_392.wav ├── pred_392.wav └── ground_truth.wav


Observations

  • The model captures basic phoneme-to-mel mappings
  • Synthesized speech is intelligible
  • Natural prosody and expressiveness are limited
  • Higher L1/L2 errors reflect:
    • small model size
    • limited training data
    • absence of a pretrained neural vocoder

Future Improvements

  • Replace Griffin–Lim with a pretrained neural vocoder (e.g., HiFi-GAN, WaveGlow)
  • Train with more data and a larger model
  • Add style and prosody modeling for more natural speech
  • Convert the model into a fully deployable TTS pipeline

Intended Use

This model is intended for:

  • Learning phoneme-based TTS systems
  • Research and experimentation
  • Educational demonstrations of lightweight speech synthesis

It is not intended for production or commercial deployment.


License

This project is released under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support