Mixer-TTS Hebrew/English IPA Fine-Tune

This repository contains a Hebrew/English mixed fine-tune of Mixer-TTS, a non-autoregressive compact text-to-speech model.

The model was fine-tuned from the PyTorch implementation at nipponjo/mixer-tts-pytorch on a mixed Hebrew and English IPA-phoneme dataset of about 24 hours of speech.

Files

synthetic_ft_80_best_vocos_int8.onnx - best validation checkpoint exported as a single embedded-Vocos ONNX model with dynamic int8 quantization.
synthetic_ft_80_latest_vocos_fp32.onnx - latest checkpoint exported as a single fp32 ONNX model with embedded Vocos vocoder. Outputs waveform audio directly.
synthetic_ft_80_latest_hifigan_fp32.onnx - latest checkpoint exported as a single fp32 ONNX model with embedded official LJ HiFi-GAN V1 vocoder. Outputs waveform audio directly.
best.pth - PyTorch training checkpoint for the best validation loss checkpoint.

Inference and Training Code

See the fork and ONNX wrapper here:

ONNX runtime wrapper and examples: thewh1teagle/mixer-tts-pytorch/tree/feature/hebrew-fine-tune/mixer-tts-onnx
Fine-tuning/export scripts: thewh1teagle/mixer-tts-pytorch/tree/feature/hebrew-fine-tune

Basic embedded-ONNX usage:

from mixer_tts_onnx import MixerTTS

tts = MixerTTS("synthetic_ft_80_latest_vocos_fp32.onnx")
tts.create(
    "sˈimu lˈev nosʔˈim jekaʁˈim.",
    is_phonemes=True,
    output_path="sample.wav",
)

The current ONNX wrapper assumes IPA phonemes when is_phonemes=True. For plain English text it can phonemize with eSpeak; Hebrew should be passed as IPA phonemes.

Training Notes

Dataset: mixed Hebrew and English speech, IPA phoneme labels.
Acoustic dimension: 80 mel bins.
Sample rate: 22,050 Hz.
Checkpoint selection: best.pth by validation loss; latest files are exported from the latest training last.pth at upload time.
ONNX vocoder options: embedded Vocos or embedded HiFi-GAN.

Citation

Mixer-TTS paper:

@article{Tatanov2021MixerTTSNF,
  title={Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings},
  author={Oktai Tatanov and Stanislav Beliaev and Boris Ginsburg},
  journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={7482-7486},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for thewh1teagle/mixer-tts

Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings

Paper • 2110.03584 • Published Oct 7, 2021 • 1