magpie_tts_multilingual_357m / README.md

Update README.md

311be03 verified 11 days ago

15.6 kB

	---
	license: other
	license_name: nvidia-open-model-license
	license_link: >-
	https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
	language:
	- en
	- es
	- de
	- it
	- vi
	- zh
	- fr
	metrics:
	- cer
	library_name: nemo
	tags:
	- NeMo
	- TTS
	- PyTorch
	- Speech
	- Multilingual-TTS
	extra_gated_prompt: >-
	You agree to not use the model to conduct experiments that cause harm to human
	subjects.
	extra_gated_fields:
	Company: text
	Country: country
	Specific date: date_picker
	I want to use this model for:
	type: select
	options:
	- Research
	- Education
	- label: Other
	value: other
	I agree to use this model for non-commercial use ONLY: checkbox
	extra_gated_heading: Acknowledge license to accept the repository
	extra_gated_description: Our team may take 2-3 days to process your request
	extra_gated_button_content: Acknowledge license
	pipeline_tag: text-to-speech
	---

	# MagpieTTS Multilingual 357M

	<style>
	img {
	display: inline;
	}
	</style>

	[![Model architecture](https://img.shields.io/badge/model_arch-encoder_decoder_transformer-lightgrey#model-badge)](#model-architecture)
	\| [![Model size](https://img.shields.io/badge/Params-357M-lightgrey#model-badge)](#model-architecture)
	\| [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#training-dataset)


	🤗 HuggingFace MagpieTTS Multilingual demo: [magpie_tts_multilingual_demo](https://huggingface.co/spaces/nvidia/magpie_tts_multilingual_demo)

	💻 NeMo Framework: [github.com/NVIDIA/NeMo](https://github.com/NVIDIA/NeMo)

	### Description:
	The model is a text-to-speech model that generates speech in 5 different English speakers - Sofia, Aria, Jason, Leo, [John Van Stan](https://librivox.org/reader/9017?primary_key=9017&search_category=reader&search_page=1&search_form=get_results&search_order=alpha). Each speakers can speak seven different languages (En, Es, De, Fr, Vi, It, Zh). The model predicts discrete audio codec tokens autoregressively using a transformer encoder-decoder architecture. It employs multi-codebook prediction (typically 8 codebooks) with optional local transformer refinement for high-quality audio generation, and leverages techniques like attention priors, classifier-free guidance (CFG), and Group Relative Policy Optimization (GRPO) for improved alignment. The generated codecs are then converted to speech waveform using [NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps).

	This model is ready for commercial use. <br>

	### Key Features of the model

	- Multilingual Support — Synthesizes natural speech in English, French, Spanish, German, French, Vietnamese, Italian, and Mandarin
	- Expressive Voices — Multiple voice options with emotional tones and gender variations including 4 proprietary voices and 1 public voice
	- Text Normalization — Built-in text normalization for handling numbers, abbreviations, and special characters for all languages except Vietnamese

	### Explore more from NVIDIA:

	- For the enterprise offering, see the [MagpieTTS NIM](https://build.nvidia.com/nvidia/magpie-tts-multilingual) which includes additional native voices in the supported languages, emotional speech capabilities, and optimized batch and latency inference pipeline.
	- What is [Nemotron](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/)?<br>
	- NVIDIA Developer [Nemotron](https://developer.nvidia.com/nemotron)<br>
	- Build a [voice-agent code repo](https://github.com/NVIDIA/voice-agent-examples/tree/riva_voice_agent_example) with the model.<br>

	### Deployment Geography:

	Global

	### Use Case: <br>
	Wherever NVIDIA’s text-to-speech (TTS) models are used, Multilingual MagpieTTS can generate multilingual speech for a given text.

	## Model Architecture:
	Architecture Type: Transformer Encoder, Transformer Decoder, Local Transformer, and feedforward layers <br>


	![MagpieTTS Model Architecture](./magpietts_architecture.png)
	Figure 1: MagpieTTS Model Architecture


	Network Architecture:
	1. Causal Transformer Encoder with 6 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer. <br>
	2. Causal Transformer Decoder with 12 layers, learnable positional encoder of length 2048, and 1 Layer Normalization output layer. <br>

	Number of model parameters 3.57*10^8 <br>


	## Input: <br>
	Input Type(s): Text <br>
	Input Format: String <br>
	Input Parameters: One-Dimensional (1D) <br>


	## Output: <br>
	Output Type(s): Audio <br>
	Output Format: .wav file <br>
	Output Parameters: One-Dimensional (1D) <br>
	Other Properties Related to Output: Audio output with dimensions (B x T), where B is batch size and T is time dimension. <br>

	Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>

	## How to Use this Model

	### NeMo Installation
	To train, fine-tune or perform TTS with this model, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA-NeMo/NeMo). We recommend you install it after you've installed latest PyTorch version and Python version ≥ 3.10.12.

	```
	pip install nemo_toolkit[tts]@main
	pip install kaldialign
	```

	The model is available for use in the [NeMo Framework](https://github.com/NVIDIA-NeMo/NeMo), and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

	### Method 1: Single TTS Inference

	In this method, the model can be used to infer on a single (text, language) pair. Text Normalization can also be applied if needed for En, Es, De, Fr, It, Zh languages.
	Load the Open Source MagpieTTS checkpoint from Huggingface and call the `do_tts(transcript, language, apply_TN)` method. This returns the generated audio and the length of the audio.

	```python
	from nemo.collections.tts.models import MagpieTTSModel

	speaker_map = {
	"John": 0,
	"Sofia": 1,
	"Aria": 2,
	"Jason": 3,
	"Leo": 4
	}
	transcript = "Hello world from NeMo Text to Speech."
	language = "en"
	speaker = "Sofia"
	speaker_idx = speaker_map[speaker]

	model = MagpieTTSModel.from_pretrained("nvidia/magpie_tts_multilingual_357m")
	audio, audio_len = model.do_tts(transcript, language=language, apply_TN=False, speaker_index=speaker_idx)
	```


	### Method 2: Batch Inference

	This section explains how to run batch inference and evaluation on MagpieTTS models using the `examples/tts/magpietts_inference.py` script.

	Key Points

	The MagpieTTS inference script supports:
	- Batch inference from `.nemo` files or `.ckpt` checkpoints
	- Optional evaluation with metrics (CER, WER, Speaker Similarity, UTMOSv2)
	- Multiple datasets in a single run

	#### Dataset Configuration (`examples/tts/evalset_config.json`)

	The script requires a JSON configuration file that defines the metadata for the datasets to process.

	Format

	```json
	{
	"dataset_name_1": {
	"manifest_path": "/absolute/path/to/manifest.json",
	"audio_dir": "/",
	"feature_dir": null
	},
	"dataset_name_2": {
	"manifest_path": "/path/to/another_manifest.json",
	"audio_dir": "/base/audio/path",
	"feature_dir": "/path/to/features"
	}
	}
	```

	Fields

	\| Field \| Required \| Description \|
	\|-------\|----------\|-------------\|
	\| `manifest_path` \| Yes \| Absolute path to the NeMo manifest JSON file \|
	\| `audio_dir` \| Yes \| Base directory for audio files. Use `"/"` if manifest contains absolute paths \|
	\| `feature_dir` \| No \| Directory for pre-computed features (set to `null` if not used) \|
	\| `whisper_language` \| No \| Language code for ASR evaluation (default: `"en"`) \|

	Example

	```json
	{
	"libritts_test_clean": {
	"manifest_path": "/data/libritts/test_clean_manifest.json",
	"audio_dir": "/",
	"feature_dir": null,
	"whisper_language": "en"
	},
	"vctk": {
	"manifest_path": "/data/vctk/manifest.json",
	"audio_dir": "/data/vctk/wav48",
	"feature_dir": null
	}
	}
	```

	---

	#### Manifest Format

	The manifest is a JSON-lines file where each line is a JSON object representing one utterance.

	Minimum Required Fields

	For models with fixed speaker context embeddings (no audio/text conditioning needed):

	```json
	{"audio_filepath": "/path/to/audio.wav", "text": "The transcript text.", "duration": 3.5}
	```

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `audio_filepath` \| string \| Path to the target audio file \|
	\| `text` \| string \| Text transcript to synthesize \|
	\| `duration` \| float \| Audio duration in seconds \|

	#### Run Inference and Evaluation

	```bash
	# Basic inference (no evaluation)
	python examples/tts/magpietts_inference.py \
	--nemo_files "nvidia/magpie_tts_multilingual_357m" \
	--datasets_json_path /path/to/evalset_config.json \
	--out_dir /path/to/output \
	--codecmodel_path "nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps" \
	--use_cfg \
	--cfg_scale 2.5

	# Inference with evaluation
	python examples/tts/magpietts_inference.py \
	--nemo_files "nvidia/magpie_tts_multilingual_357m" \
	--datasets_json_path /path/to/evalset_config.json \
	--out_dir /path/to/output \
	--codecmodel_path "nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps" \
	--run_evaluation \
	--use_cfg \
	--cfg_scale 2.5
	```

	Check Outputs

	After running, you'll find:
	- Generated audio files in `<out_dir>/<checkpoint_name>/`
	- Evaluation metrics in `metrics.json`
	- Visualization plots (if evaluation enabled)

	---

	Evaluation Metrics

	When `--run_evaluation` is enabled, the following metrics are computed:

	\| Metric \| Description \|
	\|--------\|-------------\|
	\| CER \| Character Error Rate (lower is better) \|
	\| WER \| Word Error Rate (lower is better) \|
	\| SSIM (pred-gt) \| Speaker similarity between predicted and ground truth \|
	\| SSIM (pred-context) \| Speaker similarity between predicted and context \|
	\| UTMOSv2 \| Audio quality score (higher is better, requires `utmosv2` package) \|
	\| RTF \| Real-time factor (processing time / audio duration) \|


	## Software Integration:
	Runtime Engine(s): NeMo Framework 25.11


	Supported Hardware Microarchitecture Compatibility: <br>
	* NVIDIA A10 GPU <br>
	* NVIDIA A30 GPU <br>
	* NVIDIA A100 GPU <br>
	* NVIDIA H100 GPU <br>

	Preferred/Supported Operating System(s):
	* Linux <br>
	* Linux 4 Tegra <br>

	The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment. <br>

	## Model Version(s):
	Multilingual MagpieTTS-357M <br>

	## Training and Evaluation Datasets:

	## Training Dataset:
	The following datasets were used to train the model, including additional datasets focused on speech and ASR.

	* [Hi-FiTTS En](https://www.openslr.org/109/)<br>
	* [HiFiTTS-2 A Large-Scale High Bandwidth Speech Dataset En](https://huggingface.co/datasets/nvidia/hifitts-2)<br>
	* [LibriTTS En](https://www.openslr.org/60/)<br>
	* Internal English Dataset
	* [CML-TTS Es](https://www.openslr.org/146/)
	* Internal Spanish Dataset
	* [CML-TTS Fr](https://www.openslr.org/146/)
	* Internal French Dataset
	* [CML-TTS It](https://www.openslr.org/146/)
	* [CML-TTS De](https://www.openslr.org/146/)
	* [Large-scale Vietnamese speech corpus (LSVSC) Vi](https://huggingface.co/datasets/doof-ferb/LSVSC)
	* [InfoRe-2 Vi](https://huggingface.co/datasets/doof-ferb/infore2_audiobooks)
	* [InfoRe-1 Vi](https://huggingface.co/datasets/doof-ferb/infore1_25hours)
	* Internal Vietnamese Dataset
	* Internal Mandarin Dataset


	Data Modality <br>
	* Audio <br>

	<!-- 291.6 hifi, 36K hifi2, 585 Libri -->
	Audio Training Data Size <br>
	* 60,000 Hours <br>

	Data Collection Method by dataset <br>
	* Publicly available dataset <br>
	* Human <br>

	Labeling Method by dataset <br>
	* Hybrid: Human, Synthetic - Human recorded data points were preprocessed algorithmically. <br>

	Properties:
	Number of data items in training set: 38k hours
	Modality: Audio (speech signal)
	Nature of the content: Audio books
	Language: Multilingual (En, Es, De, Fr, Vi, It, Zh)
	Sensor Type: Microphones <br>

	### Evaluation Dataset:
	* [LibriTTS test-clean](https://www.openslr.org/60/)<br>
	* [CML-TTS Es](https://www.openslr.org/146/)<br>
	* [CML-TTS Fr](https://www.openslr.org/146/)<br>
	* [CML-TTS De](https://www.openslr.org/146/)<br>

	Benchmark Score <br>

	Data Collection Method by dataset: <br>
	* Publicly available dataset <br>
	* Human <br>

	Labeling Method by dataset: <br>
	* Human <br>
	* Hybrid: Human, Synthetic - Human labeled data points are mixed and matched to create more variabilities. <br>

	Properties:
	Modality: Audio (speech signal)
	Nature of the content: Audio books and Newspaper passages
	Language: Multilingual (En, Es, De, Fr)
	Sensor Type: Microphones <br>


	\| \| CER (%)\| SV-SSIM \|
	\| --------------------- \| ------ \| ------ \|
	\| LibriTTS test-clean \| 0.38 \| 0.823 \|
	\| Spanish CML \| 1.0 \| 0.719 \|
	\| French CML \| 2.8 \| 0.708 \|
	\| German CML \| 1.1 \| 0.646 \|

	* This result is based on the MagpieTTS model ([Huggingface Checkpoint](https://huggingface.co/nvidia/multilingual_magpietts_2512))

	# Inference:
	Acceleration Engine: None <br>
	Test Hardware: <br>
	* NVIDIA H100 GPU <br>
	* NVIDIA A100 GPU <br>
	* NVIDIA A6000 GPU <br>
	* NVIDIA T4 GPU <br>

	## Technical Limitations & Mitigation:

	There are two modes of inference, namely, standard and long-form. In standard mode, this model can generate up to twenty (20) seconds of multilingual (En, Es, De, Fr, Vi, It, Zh) speech at a time. In long-form mode, the model performs optimally when the input text contains punctuation and capitalization. The model was trained on a mix of publicly available speech datasets and internally recorded datasets in seven languages. As a result, it is not suitable for speech generation in any language other than the seven languages mentioned. We have removed zero-shot capabilities of this model for this release. Text normalization is required.


	## Ethical Considerations:

	NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

	Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/ . <br>

	### License/Terms of Use

	GOVERNING TERMS: Use of this model is governed by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)

	## References(s):
	1. [Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment](https://arxiv.org/abs/2406.17957) <br>
	2. [Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance](https://arxiv.org/abs/2502.05236) <br>