Update README.md

e548f83 verified about 1 month ago

10.1 kB

license:
  - gemma
  - cc-by-nc-4.0
language:
  - en
  - zh
  - ja
base_model:
  - google/t5gemma-2b-2b-ul2
pipeline_tag: text-to-speech
library_name: transformers
tags:
  - speech
  - tts
datasets:
  - amphion/Emilia-Dataset
  - pkufool/libriheavy
extra_gated_heading: License & Ethics Agreement
extra_gated_description: >-
  This model is for **Non-Commercial Use Only** (CC-BY-NC 4.0) and follows the
  **Gemma Terms of Use**. Malicious use, including impersonation, is strictly
  prohibited.
extra_gated_button_content: Agree and Access

T5Gemma-TTS-2b-2b

🔥 [2026/04/03] Update: Our technical report is out! Read the paper on arXiv: https://arxiv.org/abs/2604.01760

日本語版 README はこちら

T5Gemma-TTS-2b-2b is a multilingual Text-to-Speech (TTS) model developed as a personal project. It utilizes an Encoder-Decoder LLM architecture, supporting English, Chinese, and Japanese.

🌟 Overview

This model is an Encoder-Decoder LLM based TTS system initialized from the weights of google/t5gemma-2b-2b-ul2. While it leverages pre-trained LLM weights, the audio component has been trained from scratch specifically for TTS tasks.

You can try the interactive demo on Hugging Face Spaces: T5Gemma-TTS Demo

Key Features

Multilingual Support: Supports English, Chinese, and Japanese.
Voice Cloning: Capable of zero-shot voice cloning from reference audio.
Duration Control: Allows users to control the speed and length of the generated audio explicitly.
Open Source Code: Training code and inference scripts are available on GitHub.

Note: This is a hobby project. There are no formal objective evaluation metrics (WER/CER, SIM-O, etc.) available at this time.

🏗️ Technical Details

Architecture

The architecture is inspired by VoiceStar (arXiv:2505.19462). It adopts mechanisms such as PM-RoPE for length control.

Base Model: google/t5gemma-2b-2b-ul2 (Weights used for initialization).
Audio Codec: XCodec2 and its derivatives.

Training Data

The model was trained on approximately 170,000 hours of publicly available speech datasets (mainly Emilia and libriheavy).

Language	Approx. Hours
English	~100k hours
Chinese	~50k hours
Japanese	~20k hours

Training Hardware

Training was conducted on the AMD Developer Cloud using 8x MI300X GPUs for approximately 2 weeks.

You can check the training logs here: WandB

🎧 Audio Samples

Below are some samples generated by T5Gemma-TTS-2b-2b.

1. Multilingual TTS

Basic text-to-speech generation in supported languages.

Language	Text Prompt	Audio
English	"The old library was silent, save for the gentle ticking of a clock somewhere in the shadows. As I ran my fingers along the dusty spines of the books, I felt a strange sense of nostalgia, as if I had lived a thousand lives within these walls."
Chinese	"那是一个宁静的夜晚，月光洒在湖面上，波光粼粼。微风轻拂，带来了远处花朵的清香。我独自坐在岸边，心中涌起一股莫名的感动，仿佛整个世界都在这一刻静止了。"
Japanese	"その森には、古い言い伝えがありました。月が最も高く昇る夜、静かに耳を澄ませば、風の歌声が聞こえるというのです。私は半信半疑でしたが、その夜、確かに誰かが私を呼ぶ声を聞いたのです。"

2. Duration Control

Examples of generating the same text with different duration constraints.

English Sample

Text: "This new model allows users to strictly control the duration of the generated speech.

Target Duration	Generated Audio
3.0s (Fast)
5.0s (Normal)
7.0s (Slow)

Japanese Sample

Text: "このモデルでは、生成音声の長さを自由に調整できます。"

Target Duration	Generated Audio
3.0s (Fast)
5.0s (Normal)
7.0s (Slow)

3. Voice Cloning (Zero-shot)

Examples of cloning a voice from a reference audio clip.

Note: The reference audio samples below were generated using NandemoGHS/Anime-Llasa-3B and gemini-2.5-pro-preview-tts.

Case	Reference Audio	Generated Audio
Example 1
Example 2
Example 3

🚀 Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

👉 GitHub

⚠️ Limitations

Inference Speed: The model is not optimized for real-time TTS applications. Autoregressive generation of audio tokens takes significant time, making it unsuitable for low-latency use cases.
Duration Control: While the model supports explicit duration specification, control is not perfect. Generated audio may differ from the specified duration, and even when the duration matches, the speech pacing or naturalness may not always be optimal.
Audio Quality: Quality depends on training data characteristics. Performance may vary for voices, accents, or speaking styles underrepresented in the training data.

📜 License

This model is released under a Dual License policy. Users must strictly comply with BOTH of the following sets of terms:

Gemma Terms of Use: Since this model is derived from google/t5gemma-2b-2b-ul2, you must adhere to the Gemma Terms of Use.
CC-BY-NC 4.0: Due to the constraints of the training datasets (such as Emilia), this model is restricted to Non-Commercial Use Only.

⚠️ Important Note on Codec: The audio codec used, XCodec2, is also released under a CC-BY-NC license. Please ensure you also follow their license terms when using the generated audio.

Ethical Restrictions: Do not use this model to impersonate specific individuals (e.g., voice cloning of voice actors, celebrities, or public figures) without their explicit consent.

🙏 Acknowledgments

I would like to thank the following for their open-source contributions, which made this project possible:

VoiceStar - Architecture inspiration
T5Gemma - Base model
XCodec2 and XCodec2-Variant - Audio codec

🖊️ Citation

If you cite this model, please cite it as follows:

@misc{t5gemma-tts,
  author = {Chihiro Arata},
  title = {T5Gemma-TTS-2b-2b: An Encoder-Decoder LLM-based TTS Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/T5Gemma-TTS-2b-2b}}
}