Mixer-TTS: non-autoregressive, fast and compact text-to-speech model conditioned on language model embeddings
Paper • 2110.03584 • Published
This repository contains a Hebrew/English mixed fine-tune of Mixer-TTS, a non-autoregressive compact text-to-speech model.
The model was fine-tuned from the PyTorch implementation at nipponjo/mixer-tts-pytorch on a mixed Hebrew and English IPA-phoneme dataset of about 24 hours of speech.
synthetic_ft_80_best_vocos_int8.onnx - best validation checkpoint exported as a single embedded-Vocos ONNX model with dynamic int8 quantization.synthetic_ft_80_latest_vocos_fp32.onnx - latest checkpoint exported as a single fp32 ONNX model with embedded Vocos vocoder. Outputs waveform audio directly.synthetic_ft_80_latest_hifigan_fp32.onnx - latest checkpoint exported as a single fp32 ONNX model with embedded official LJ HiFi-GAN V1 vocoder. Outputs waveform audio directly.best.pth - PyTorch training checkpoint for the best validation loss checkpoint.See the fork and ONNX wrapper here:
Basic embedded-ONNX usage:
from mixer_tts_onnx import MixerTTS
tts = MixerTTS("synthetic_ft_80_latest_vocos_fp32.onnx")
tts.create(
"sˈimu lˈev nosʔˈim jekaʁˈim.",
is_phonemes=True,
output_path="sample.wav",
)
The current ONNX wrapper assumes IPA phonemes when is_phonemes=True. For plain English text it can phonemize with eSpeak; Hebrew should be passed as IPA phonemes.
best.pth by validation loss; latest files are exported from the latest training last.pth at upload time.Mixer-TTS paper:
@article{Tatanov2021MixerTTSNF,
title={Mixer-TTS: Non-Autoregressive, Fast and Compact Text-to-Speech Model Conditioned on Language Model Embeddings},
author={Oktai Tatanov and Stanislav Beliaev and Boris Ginsburg},
journal={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2021},
pages={7482-7486},
}