| | --- |
| | license: cc-by-nc-sa-4.0 |
| | language: |
| | - en |
| | - hi |
| | tags: |
| | - text-to-speech |
| | - haiailabs |
| | library_tag: spark-tts |
| | --- |
| | |
| |
|
| |
|
| | # Model Card for Model ID |
| |
|
| | We used spark-tts-0.5B model which was originally trained on English and Chinese. We did peft finetuning to support Hindi a new language which was not part of original training. |
| | So our objective is to achieve better AI models with frugality. This model is just to show case capabilities that a model can be finetuned with less money and time. You can finetune this model for better results by taking this as base for Hindi. |
| |
|
| | We will plan to release best model in future. |
| |
|
| | ### Model Details and Description |
| |
|
| | Spark-TTS is an advanced text-to-speech system that uses the power of large language models (LLM) for highly accurate and natural-sounding voice synthesis. It is designed to be efficient, flexible, and powerful for both research and production use. |
| |
|
| |
|
| | ### Key Features |
| |
|
| | - **Simplicity and Efficiency**: Built entirely on Qwen2.5, Spark-TTS eliminates the need for additional generation models like flow matching. Instead of relying on separate models to generate acoustic features, it directly reconstructs audio from the code predicted by the LLM. This approach streamlines the process, improving efficiency and reducing complexity. |
| | - **High-Quality Voice Cloning**: Supports zero-shot voice cloning, which means it can replicate a speaker's voice even without specific training data for that voice. This is ideal for cross-lingual and code-switching scenarios, allowing for seamless transitions between languages and voices without requiring separate training for each one. |
| | - **Trilingual Support**: Supports Hindi Chinese and English, and is capable of zero-shot voice cloning for cross-lingual and code-switching scenarios, enabling the model to synthesize speech in multiple languages with high naturalness and accuracy. |
| | - **Controllable Speech Generation**: Supports creating virtual speakers by adjusting parameters such as gender, pitch, and speaking rate. |
| |
|
| | --- |
| |
|
| | <table align="center"> |
| | <tr> |
| | <td align="center"><b>Inference Overview of Voice Cloning</b><br><img src="src/figures/infer_voice_cloning.png" width="80%" /></td> |
| | </tr> |
| | <tr> |
| | <td align="center"><b>Inference Overview of Controlled Generation</b><br><img src="src/figures/infer_control.png" width="80%" /></td> |
| | </tr> |
| | </table> |
| | |
| |
|
| | ### Model Sources [optional] |
| |
|
| | <!-- Provide the basic links for the model. --> |
| |
|
| | - **Repository:** [[More Information Needed](https://github.com/SparkAudio/Spark-TTS)] |
| |
|
| | ## Misuse and Abuse Policy ⚠️ |
| |
|
| | This project delivers a high-quality speech generation model intended strictly for research and educational use. Responsible and ethical usage is mandatory. The following activities are strictly prohibited: |
| | Impersonation or Fraud: |
| | Generating speech that imitates real individuals without their clear and explicit consent. |
| | Misinformation or Deception: |
| | Producing audio designed to mislead, manipulate, or deceive—such as fake news, fraudulent communication, or deceptive content. |
| | Illegal or Harmful Use: |
| | Employing the model for any activity that is unlawful, harmful, malicious, or violates safety and security standards. |
| | |
| | ## How to use |
| |
|
| | This is the inference code |
| |
|
| | ```python |
| | |
| | |
| | import re |
| | import torch |
| | from typing import Tuple |
| | from pathlib import Path |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import sys |
| | import torch |
| | sys.path.append('Spark-TTS') |
| | from sparktts.utils.file import load_config |
| | from sparktts.models.audio_tokenizer import BiCodecTokenizer |
| | from sparktts.utils.token_parser import LEVELS_MAP, GENDER_MAP, TASK_TOKEN_MAP |
| | |
| | |
| | class SparkTTS: |
| | """ |
| | Spark-TTS for text-to-speech generation. |
| | """ |
| | |
| | def __init__(self, model_dir: Path, device: torch.device = torch.device("cuda:0")): |
| | """ |
| | Initializes the SparkTTS model with the provided configurations and device. |
| | |
| | Args: |
| | model_dir (Path): Directory containing the model and config files. |
| | device (torch.device): The device (CPU/GPU) to run the model on. |
| | """ |
| | self.device = device |
| | self.model_dir = model_dir |
| | self.configs = load_config(f"{model_dir}/config.yaml") |
| | self.sample_rate = self.configs["sample_rate"] |
| | self._initialize_inference() |
| | |
| | def _initialize_inference(self): |
| | """Initializes the tokenizer, model, and audio tokenizer for inference.""" |
| | self.tokenizer = AutoTokenizer.from_pretrained(f"{self.model_dir}/LLM") |
| | self.model = AutoModelForCausalLM.from_pretrained(f"{self.model_dir}/LLM") |
| | self.audio_tokenizer = BiCodecTokenizer(self.model_dir, device=self.device) |
| | self.model.to(self.device) |
| | |
| | def process_prompt( |
| | self, |
| | text: str, |
| | prompt_speech_path: Path, |
| | prompt_text: str = None, |
| | ) -> Tuple[str, torch.Tensor]: |
| | """ |
| | Process input for voice cloning. |
| | |
| | Args: |
| | text (str): The text input to be converted to speech. |
| | prompt_speech_path (Path): Path to the audio file used as a prompt. |
| | prompt_text (str, optional): Transcript of the prompt audio. |
| | |
| | Return: |
| | Tuple[str, torch.Tensor]: Input prompt; global tokens |
| | """ |
| | |
| | global_token_ids, semantic_token_ids = self.audio_tokenizer.tokenize( |
| | prompt_speech_path |
| | ) |
| | global_tokens = "".join( |
| | [f"<|bicodec_global_{i}|>" for i in global_token_ids.squeeze()] |
| | ) |
| | |
| | # Prepare the input tokens for the model |
| | if prompt_text is not None: |
| | semantic_tokens = "".join( |
| | [f"<|bicodec_semantic_{i}|>" for i in semantic_token_ids.squeeze()] |
| | ) |
| | inputs = [ |
| | TASK_TOKEN_MAP["tts"], |
| | "<|start_content|>", |
| | prompt_text, |
| | text, |
| | "<|end_content|>", |
| | "<|start_global_token|>", |
| | global_tokens, |
| | "<|end_global_token|>", |
| | "<|start_semantic_token|>", |
| | semantic_tokens, |
| | ] |
| | else: |
| | inputs = [ |
| | TASK_TOKEN_MAP["tts"], |
| | "<|start_content|>", |
| | text, |
| | "<|end_content|>", |
| | "<|start_global_token|>", |
| | global_tokens, |
| | "<|end_global_token|>", |
| | ] |
| | |
| | inputs = "".join(inputs) |
| | |
| | return inputs, global_token_ids |
| | |
| | def process_prompt_control( |
| | self, |
| | gender: str, |
| | pitch: str, |
| | speed: str, |
| | text: str, |
| | ): |
| | """ |
| | Process input for voice creation. |
| | |
| | Args: |
| | gender (str): female | male. |
| | pitch (str): very_low | low | moderate | high | very_high |
| | speed (str): very_low | low | moderate | high | very_high |
| | text (str): The text input to be converted to speech. |
| | |
| | Return: |
| | str: Input prompt |
| | """ |
| | assert gender in GENDER_MAP.keys() |
| | assert pitch in LEVELS_MAP.keys() |
| | assert speed in LEVELS_MAP.keys() |
| | |
| | gender_id = GENDER_MAP[gender] |
| | pitch_level_id = LEVELS_MAP[pitch] |
| | speed_level_id = LEVELS_MAP[speed] |
| | |
| | pitch_label_tokens = f"<|pitch_label_{pitch_level_id}|>" |
| | speed_label_tokens = f"<|speed_label_{speed_level_id}|>" |
| | gender_tokens = f"<|gender_{gender_id}|>" |
| | |
| | attribte_tokens = "".join( |
| | [gender_tokens, pitch_label_tokens, speed_label_tokens] |
| | ) |
| | |
| | control_tts_inputs = [ |
| | TASK_TOKEN_MAP["controllable_tts"], |
| | "<|start_content|>", |
| | text, |
| | "<|end_content|>", |
| | "<|start_style_label|>", |
| | attribte_tokens, |
| | "<|end_style_label|>", |
| | ] |
| | |
| | return "".join(control_tts_inputs) |
| | |
| | @torch.no_grad() |
| | def inference( |
| | self, |
| | text: str, |
| | prompt_speech_path: Path = None, |
| | prompt_text: str = None, |
| | gender: str = None, |
| | pitch: str = None, |
| | speed: str = None, |
| | temperature: float = 0.8, |
| | top_k: float = 50, |
| | top_p: float = 0.95, |
| | ) -> torch.Tensor: |
| | """ |
| | Performs inference to generate speech from text, incorporating prompt audio and/or text. |
| | |
| | Args: |
| | text (str): The text input to be converted to speech. |
| | prompt_speech_path (Path): Path to the audio file used as a prompt. |
| | prompt_text (str, optional): Transcript of the prompt audio. |
| | gender (str): female | male. |
| | pitch (str): very_low | low | moderate | high | very_high |
| | speed (str): very_low | low | moderate | high | very_high |
| | temperature (float, optional): Sampling temperature for controlling randomness. Default is 0.8. |
| | top_k (float, optional): Top-k sampling parameter. Default is 50. |
| | top_p (float, optional): Top-p (nucleus) sampling parameter. Default is 0.95. |
| | |
| | Returns: |
| | torch.Tensor: Generated waveform as a tensor. |
| | """ |
| | if gender is not None: |
| | prompt = self.process_prompt_control(gender, pitch, speed, text) |
| | |
| | else: |
| | prompt, global_token_ids = self.process_prompt( |
| | text, prompt_speech_path, prompt_text |
| | ) |
| | model_inputs = self.tokenizer([prompt], return_tensors="pt").to(self.device) |
| | |
| | # Generate speech using the model |
| | generated_ids = self.model.generate( |
| | **model_inputs, |
| | max_new_tokens=3000, |
| | do_sample=True, |
| | top_k=top_k, |
| | top_p=top_p, |
| | temperature=temperature, |
| | ) |
| | |
| | # Trim the output tokens to remove the input tokens |
| | generated_ids = [ |
| | output_ids[len(input_ids) :] |
| | for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
| | ] |
| | |
| | # Decode the generated tokens into text |
| | predicts = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
| | |
| | # Extract semantic token IDs from the generated text |
| | pred_semantic_ids = ( |
| | torch.tensor([int(token) for token in re.findall(r"bicodec_semantic_(\d+)", predicts)]) |
| | .long() |
| | .unsqueeze(0) |
| | ) |
| | |
| | if gender is not None: |
| | global_token_ids = ( |
| | torch.tensor([int(token) for token in re.findall(r"bicodec_global_(\d+)", predicts)]) |
| | .long() |
| | .unsqueeze(0) |
| | .unsqueeze(0) |
| | ) |
| | |
| | # Convert semantic tokens back to waveform |
| | wav = self.audio_tokenizer.detokenize( |
| | global_token_ids.to(self.device).squeeze(0), |
| | pred_semantic_ids.to(self.device), |
| | ) |
| | |
| | return wav |
| | |
| | |
| | ``` |
| | Download spark-tts github using below link |
| | ```sh |
| | git clone https://github.com/SparkAudio/Spark-TTS |
| | ``` |
| | ```python |
| | from huggingface_hub import snapshot_download |
| | snapshot_download("hyperneuronAILabs/hailabsparktts", local_dir = "hailabsparktts") |
| | |
| | tts = SparkTTS("hailabsparktts", device="cuda") |
| | # you can call the inference method and you will get the audio |
| | ``` |
| | ### Training Procedure |
| |
|
| | Reachout for training process. |
| |
|
| | ## Cost to finetune |
| |
|
| | It took just approx 5000 INR to finetune this model to a new language which is Hindi. |
| |
|
| | ## Model Card Authors [optional] |
| |
|
| | Hyperneuron Ai research team |
| | Drop an email at support@hyperneuron.in |