T5Gemma-TTS-2b-2b / README.md
Aratako's picture
Update README.md
e548f83 verified
metadata
license:
  - gemma
  - cc-by-nc-4.0
language:
  - en
  - zh
  - ja
base_model:
  - google/t5gemma-2b-2b-ul2
pipeline_tag: text-to-speech
library_name: transformers
tags:
  - speech
  - tts
datasets:
  - amphion/Emilia-Dataset
  - pkufool/libriheavy
extra_gated_heading: License & Ethics Agreement
extra_gated_description: >-
  This model is for **Non-Commercial Use Only** (CC-BY-NC 4.0) and follows the
  **Gemma Terms of Use**. Malicious use, including impersonation, is strictly
  prohibited.
extra_gated_button_content: Agree and Access

T5Gemma-TTS-2b-2b

arXiv GitHub WandB Demo Space

๐Ÿ”ฅ [2026/04/03] Update: Our technical report is out! Read the paper on arXiv: https://arxiv.org/abs/2604.01760

ๆ—ฅๆœฌ่ชž็‰ˆ README ใฏใ“ใกใ‚‰

T5Gemma-TTS-2b-2b is a multilingual Text-to-Speech (TTS) model developed as a personal project. It utilizes an Encoder-Decoder LLM architecture, supporting English, Chinese, and Japanese.

๐ŸŒŸ Overview

This model is an Encoder-Decoder LLM based TTS system initialized from the weights of google/t5gemma-2b-2b-ul2. While it leverages pre-trained LLM weights, the audio component has been trained from scratch specifically for TTS tasks.

You can try the interactive demo on Hugging Face Spaces: T5Gemma-TTS Demo

Key Features

  • Multilingual Support: Supports English, Chinese, and Japanese.
  • Voice Cloning: Capable of zero-shot voice cloning from reference audio.
  • Duration Control: Allows users to control the speed and length of the generated audio explicitly.
  • Open Source Code: Training code and inference scripts are available on GitHub.

Note: This is a hobby project. There are no formal objective evaluation metrics (WER/CER, SIM-O, etc.) available at this time.

๐Ÿ—๏ธ Technical Details

Architecture

The architecture is inspired by VoiceStar (arXiv:2505.19462). It adopts mechanisms such as PM-RoPE for length control.

Training Data

The model was trained on approximately 170,000 hours of publicly available speech datasets (mainly Emilia and libriheavy).

Language Approx. Hours
English ~100k hours
Chinese ~50k hours
Japanese ~20k hours

Training Hardware

Training was conducted on the AMD Developer Cloud using 8x MI300X GPUs for approximately 2 weeks.

  • You can check the training logs here: WandB

๐ŸŽง Audio Samples

Below are some samples generated by T5Gemma-TTS-2b-2b.

1. Multilingual TTS

Basic text-to-speech generation in supported languages.

Language Text Prompt Audio
English "The old library was silent, save for the gentle ticking of a clock somewhere in the shadows. As I ran my fingers along the dusty spines of the books, I felt a strange sense of nostalgia, as if I had lived a thousand lives within these walls."
Chinese "้‚ฃๆ˜ฏไธ€ไธชๅฎ้™็š„ๅคœๆ™š๏ผŒๆœˆๅ…‰ๆด’ๅœจๆน–้ขไธŠ๏ผŒๆณขๅ…‰็ฒผ็ฒผใ€‚ๅพฎ้ฃŽ่ฝปๆ‹‚๏ผŒๅธฆๆฅไบ†่ฟœๅค„่Šฑๆœต็š„ๆธ…้ฆ™ใ€‚ๆˆ‘็‹ฌ่‡ชๅๅœจๅฒธ่พน๏ผŒๅฟƒไธญๆถŒ่ตทไธ€่‚ก่Žซๅ็š„ๆ„ŸๅŠจ๏ผŒไปฟไฝ›ๆ•ดไธชไธ–็•Œ้ƒฝๅœจ่ฟ™ไธ€ๅˆป้™ๆญขไบ†ใ€‚"
Japanese "ใใฎๆฃฎใซใฏใ€ๅคใ„่จ€ใ„ไผใˆใŒใ‚ใ‚Šใพใ—ใŸใ€‚ๆœˆใŒๆœ€ใ‚‚้ซ˜ใๆ˜‡ใ‚‹ๅคœใ€้™ใ‹ใซ่€ณใ‚’ๆพ„ใพใ›ใฐใ€้ขจใฎๆญŒๅฃฐใŒ่žใ“ใˆใ‚‹ใจใ„ใ†ใฎใงใ™ใ€‚็งใฏๅŠไฟกๅŠ็–‘ใงใ—ใŸใŒใ€ใใฎๅคœใ€็ขบใ‹ใซ่ชฐใ‹ใŒ็งใ‚’ๅ‘ผใถๅฃฐใ‚’่žใ„ใŸใฎใงใ™ใ€‚"

2. Duration Control

Examples of generating the same text with different duration constraints.

English Sample

Text: "This new model allows users to strictly control the duration of the generated speech.

Target Duration Generated Audio
3.0s (Fast)
5.0s (Normal)
7.0s (Slow)

Japanese Sample

Text: "ใ“ใฎใƒขใƒ‡ใƒซใงใฏใ€็”Ÿๆˆ้Ÿณๅฃฐใฎ้•ทใ•ใ‚’่‡ช็”ฑใซ่ชฟๆ•ดใงใใพใ™ใ€‚"

Target Duration Generated Audio
3.0s (Fast)
5.0s (Normal)
7.0s (Slow)

3. Voice Cloning (Zero-shot)

Examples of cloning a voice from a reference audio clip.

Note: The reference audio samples below were generated using NandemoGHS/Anime-Llasa-3B and gemini-2.5-pro-preview-tts.

Case Reference Audio Generated Audio
Example 1
Example 2
Example 3

๐Ÿš€ Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

๐Ÿ‘‰ GitHub

โš ๏ธ Limitations

  • Inference Speed: The model is not optimized for real-time TTS applications. Autoregressive generation of audio tokens takes significant time, making it unsuitable for low-latency use cases.
  • Duration Control: While the model supports explicit duration specification, control is not perfect. Generated audio may differ from the specified duration, and even when the duration matches, the speech pacing or naturalness may not always be optimal.
  • Audio Quality: Quality depends on training data characteristics. Performance may vary for voices, accents, or speaking styles underrepresented in the training data.

๐Ÿ“œ License

This model is released under a Dual License policy. Users must strictly comply with BOTH of the following sets of terms:

  1. Gemma Terms of Use: Since this model is derived from google/t5gemma-2b-2b-ul2, you must adhere to the Gemma Terms of Use.
  2. CC-BY-NC 4.0: Due to the constraints of the training datasets (such as Emilia), this model is restricted to Non-Commercial Use Only.

โš ๏ธ Important Note on Codec: The audio codec used, XCodec2, is also released under a CC-BY-NC license. Please ensure you also follow their license terms when using the generated audio.

Ethical Restrictions: Do not use this model to impersonate specific individuals (e.g., voice cloning of voice actors, celebrities, or public figures) without their explicit consent.

๐Ÿ™ Acknowledgments

I would like to thank the following for their open-source contributions, which made this project possible:

๐Ÿ–Š๏ธ Citation

If you cite this model, please cite it as follows:

@misc{t5gemma-tts,
  author = {Chihiro Arata},
  title = {T5Gemma-TTS-2b-2b: An Encoder-Decoder LLM-based TTS Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/T5Gemma-TTS-2b-2b}}
}