drbaph
/

MegaTTS3-WaveVAE

@@ -1,151 +1,95 @@
 ---
-language:
-- en
-- zh
 license: apache-2.0
 pipeline_tag: text-to-speech
 ---
-Unofficial implementation version, including full parameters. 非官方实现版本，包括全量参数。
-# Model Description
-This is a ModelScope model card for MegaTTS 3 👋
--   Paper: [MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis](https://huggingface.co/papers/2502.18924)
--   Project Page (Audio Samples): <https://sditdemo.github.io/sditdemo/>
--   github: <https://github.com/bytedance/MegaTTS3>
--   [Demo Video](https://github.com/user-attachments/assets/0174c111-f392-4376-a34b-0b5b8164aacc)
-## Installation
-```sh
-# Clone the repository
-git clone https://github.com/bytedance/MegaTTS3
-cd MegaTTS3
-```
-**Model Download**
-```sh
-modelscope download --model ACoderPassBy/MegaTTS-SFT --local_dir ./checkpoints
-```
-**Requirements (for Linux)**
-```sh
-# Create a python 3.10 conda env (you could also use virtualenv)
-conda create -n megatts3-env python=3.10
-conda activate megatts3-env
-pip install -r requirements.txt
-# Set the root directory
-export PYTHONPATH="/path/to/MegaTTS3:$PYTHONPATH"
-# [Optional] Set GPU
-export CUDA_VISIBLE_DEVICES=0
-# If you encounter bugs with pydantic in inference, you should check if the versions of pydantic and gradio are matched.
-# [Note] if you encounter bugs related with httpx, please check that whether your environmental variable "no_proxy" has patterns like "::"
-```
-**Requirements (for Windows)**
-```sh
-# [The Windows version is currently under testing]
-# Comment below dependence in requirements.txt:
-# # WeTextProcessing==1.0.4.1
-# Create a python 3.10 conda env (you could also use virtualenv)
-conda create -n megatts3-env python=3.10
-conda activate megatts3-env
-pip install -r requirements.txt
-conda install -y -c conda-forge pynini==2.1.5
-pip install WeTextProcessing==1.0.3
-# [Optional] If you want GPU inference, you may need to install specific version of PyTorch for your GPU from https://pytorch.org/.
-pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
-# [Note] if you encounter bugs related with `ffprobe` or `ffmpeg`, you can install it through `conda install -c conda-forge ffmpeg`
-# Set environment variable for root directory
-set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Windows
-$env:PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # Powershell on Windows
-conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For conda users
-# [Optional] Set GPU
-set CUDA_VISIBLE_DEVICES=0 # Windows
-$env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
-```
-**Requirements (for Docker)**
-```sh
-# [The Docker version is currently under testing]
-# ! You should download the pretrained checkpoint before running the following command
-docker build . -t megatts3:latest
-# For GPU inference
-docker run -it -p 7929:7929 --gpus all -e CUDA_VISIBLE_DEVICES=0 megatts3:latest
-# For CPU inference
-docker run -it -p 7929:7929  megatts3:latest
-# Visit http://0.0.0.0:7929/ for gradio.
-```
-> \[!TIP]
-> \[IMPORTANT]
-> 非官方版本
-## Inference
-**Command-Line Usage (Standard)**
 ```bash
-# p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
-python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav'  --input_text "另��边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论？'" --output_dir ./gen
-# As long as audio volume and pronunciation are appropriate, increasing --t_w within reasonable ranges (2.0~5.0)
-# will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
-python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0
-```
-**Command-Line Usage (for TTS with Accents)**
-```bash
-# When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
-# t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
-# Useful for accented TTS or solving the accent problems in cross-lingual TTS.
-python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0
-python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这条音频的发音标准一些了吗？' --output_dir ./gen --p_w 2.5 --t_w 2.5
 ```
-**Web UI Usage**
-```bash
-# We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
-python tts/gradio_api.py
-```
-## Security
-If you discover a potential security issue in this project, or think you may
-have discovered a security issue, we ask that you notify Bytedance Security via our [security center](https://security.bytedance.com/src) or [sec@bytedance.com](sec@bytedance.com).
-Please do **not** create a public issue.
-## License
-This project is licensed under the [Apache-2.0 License](LICENSE).
-## BibTeX Entry and Citation Info
-This repo contains forced-align version of `Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis` and the WavVAE is mainly based on `Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling`. Compared to the model described in paper, the repository includes additional models. These models not only enhance the stability and cloning capabilities of the algorithm but can also be independently utilized to serve a wider range of scenarios.
-```
 @article{jiang2025sparse,
   title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
   author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
@@ -159,4 +103,7 @@ This repo contains forced-align version of `Sparse Alignment Enhanced Latent Dif
   journal={arXiv preprint arXiv:2408.16532},
   year={2024}
 }
-```

 ---
 license: apache-2.0
+tags:
+- text-to-speech
+- tts
+- voice-cloning
+- speech-synthesis
+- pytorch
+- audio
+- chinese
+- english
+- zero-shot
+- diffusion
+library_name: transformers
 pipeline_tag: text-to-speech
 ---
+# MegaTTS3-WaveVAE: Complete Voice Cloning Model
+<div align="center">
+  <h3>🚀 <a href="https://github.com/Saganaki22/MegaTTS3-WaveVAE">GitHub Repository</a></h3>
+  <img src="https://img.shields.io/github/stars/Saganaki22/MegaTTS3-WaveVAE?style=social" alt="GitHub Stars">
+  <img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License">
+  <img src="https://img.shields.io/badge/Platform-Windows-blue" alt="Platform">
+  <img src="https://img.shields.io/badge/Language-Chinese%20%7C%20English-red" alt="Language">
+</div>
+## About
+This is a **complete MegaTTS3 model** with **WaveVAE support** for zero-shot voice cloning. Unlike the original ByteDance release, this includes the full WaveVAE encoder/decoder, enabling direct voice cloning from audio samples.
+**Key Features:**
+- 🎯 Zero-shot voice cloning from any 3-24 second audio sample
+- 🌍 Bilingual: Chinese, English, and code-switching
+- ⚡ Efficient: 0.45B parameter diffusion transformer
+- 🔧 Complete: Includes WaveVAE (missing from original)
+- 🎛️ Controllable: Adjustable voice similarity and clarity
+- 💻 Windows ready: One-click installer available
+## Quick Start
+### Installation
+**[📥 One-Click Windows Installer](https://github.com/Saganaki22/MegaTTS3-WaveVAE/releases/tag/Installer)** - Automated setup with GPU detection
+Or see [manual installation](https://github.com/Saganaki22/MegaTTS3-WaveVAE#installation) for advanced users.
+### Usage Examples
 ```bash
+# Basic voice cloning
+python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output
+# Better quality settings
+python tts/infer_cli.py --input_wav 'reference.wav' --input_text "Your text here" --output_dir ./output --p_w 2.0 --t_w 3.0
+# Web interface (easiest)
+python tts/megatts3_gradio.py
+# Then open http://localhost:7929
 ```
+## Model Components
+- **Diffusion Transformer**: 0.45B parameter TTS model
+- **WaveVAE**: High-quality audio encoder/decoder
+- **Aligner**: Speech-text alignment model
+- **G2P**: Grapheme-to-phoneme converter
+## Parameters
+- `--p_w` (Intelligibility): 1.0-5.0, higher = clearer speech
+- `--t_w` (Similarity): 0.0-10.0, higher = more similar to reference
+- **Tip**: Set t_w 0-3 points higher than p_w
+## Requirements
+- Windows 10/11 or Linux
+- Python 3.10
+- 8GB+ RAM, NVIDIA GPU recommended
+- 5GB+ storage space
+## Credits
+- **Original MegaTTS3**: [ByteDance Research](https://github.com/bytedance/MegaTTS3)
+- **WaveVAE Model**: [ACoderPassBy/MegaTTS-SFT](https://modelscope.cn/models/ACoderPassBy/MegaTTS-SFT) [Apache 2.0]
+- **Additional Components**: [mrfakename/MegaTTS3-VoiceCloning](https://huggingface.co/mrfakename/MegaTTS3-VoiceCloning)
+- **Windows Implementation & Complete Package**: [Saganaki22/MegaTTS3-WaveVAE](https://github.com/Saganaki22/MegaTTS3-WaveVAE)
+- **Special Thanks**: MysteryShack on Discord for model information
+## Citation
+If you use this model, please cite the original research:
+```bibtex
 @article{jiang2025sparse,
   title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
   author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},
   journal={arXiv preprint arXiv:2408.16532},
   year={2024}
 }
+```
+---
+*High-quality voice cloning for research and creative applications. Please use responsibly.*