AryanNsc
/

IND-QWENTTS-V1

+---
+license: mit
+language:
+- en
+- gu
+base_model:
+- Qwen/Qwen2.5-0.5B
+pipeline_tag: text-to-speech
+tags:
+- tts
+- indian-accent
+---
+# Ind-QwenTTS
+A lightweight multilingual Text-to-Speech system with accent control for English and Gujarati.
+## Features
+- Multilingual: English + Gujarati
+- Accent Control: Indian & Gujarati accents
+- 4 voices (2 male, 2 female)
+- Accent transfer capability
+- Fast inference with 0.5B parameters
+## Supported Voices
+| Speaker ID | Language | Accent | Gender |
+|-----------|----------|---------|---------|
+| `SPK_EN_M_001` | English | Indian | Male |
+| `SPK_EN_F_001` | English | Indian | Female |
+| `SPK_GU_M_001` | Gujarati | Gujarati | Male |
+| `SPK_GU_F_001` | Gujarati | Gujarati | Female |
+## Installation
+```bash
+pip install transformers torch torchaudio snac torchcodec
+```
+## Usage
+```python
+import torch
+import torchaudio
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from snac import SNAC
+device = "cuda" if torch.cuda.is_available() else "cpu"
+tokenizer = AutoTokenizer.from_pretrained("AryanNsc/IND-QWENTTS-V1", fix_mistral_regex=True)
+model = AutoModelForCausalLM.from_pretrained("AryanNsc/IND-QWENTTS-V1", torch_dtype=torch.bfloat16).to(device).eval()
+snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device).eval()
+def generate_speech(text, language="english", accent="indian", gender="M", speaker=None, output_file="output.wav"):
+    if speaker is None:
+        speaker_map = {
+            ("english", "M"): "SPK_EN_M_001",
+            ("english", "F"): "SPK_EN_F_001",
+            ("gujarati", "M"): "SPK_GU_M_001",
+            ("gujarati", "F"): "SPK_GU_F_001"
+        }
+        speaker = speaker_map.get((language, gender), "SPK_EN_M_001")
+    prompt = f"<lang>{language}</lang><accent>{accent}</accent><gender>{gender}</gender><speaker>{speaker}</speaker> {text}"
+    input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device)
+    start_tokens = torch.tensor([
+        tokenizer.convert_tokens_to_ids("<|endoftext|>"),
+        tokenizer.convert_tokens_to_ids("<soh>"),
+        tokenizer.convert_tokens_to_ids("<soa>"),
+        tokenizer.convert_tokens_to_ids("<sos>")
+    ], device=device).unsqueeze(0)
+    full_input = torch.cat([input_ids, start_tokens], dim=1)
+    with torch.no_grad():
+        output = model.generate(
+            full_input,
+            max_new_tokens=1500,
+            temperature=0.7,
+            top_p=0.85,
+            repetition_penalty=1.15,
+            do_sample=True,
+            pad_token_id=tokenizer.pad_token_id,
+            eos_token_id=tokenizer.convert_tokens_to_ids("<eos>")
+        )
+    generated_ids = output[0, full_input.shape[1]:]
+    eos_id = tokenizer.convert_tokens_to_ids("<eos>")
+    if len(generated_ids) > 0 and generated_ids[-1] == eos_id:
+        generated_ids = generated_ids[:-1]
+    if len(generated_ids) % 7 != 0:
+        trunc_len = (len(generated_ids) // 7) * 7
+        generated_ids = generated_ids[:trunc_len]
+    if len(generated_ids) == 0:
+        print("Error: No audio generated.")
+        return
+    codes = generated_ids.reshape(-1, 7).T
+    snac_offset = model.config.vocab_size - 4096
+    codes = codes - snac_offset
+    codes = torch.clamp(codes, min=0)
+    l1 = codes[0, :]
+    l2 = torch.stack([codes[1, :], codes[4, :]], dim=1).flatten()
+    l3 = torch.stack([codes[2, :], codes[3, :], codes[5, :], codes[6, :]], dim=1).flatten()
+    with torch.inference_mode():
+        audio = snac.decode([l1.unsqueeze(0), l2.unsqueeze(0), l3.unsqueeze(0)])
+    audio_tensor = audio.squeeze(0).cpu()
+    torchaudio.save(output_file, audio_tensor, 24000)
+    print(f"Saved to {output_file}")
+generate_speech(
+    text="The competition results will be announced tomorrow morning.",
+    language="english",
+    accent="indian",
+    gender="M",
+    output_file="test_english.wav"
+)
+```
+## Examples
+**Basic English synthesis:**
+```python
+generate_speech("Hello world, this is a test.", language="english", accent="indian", gender="M")
+```
+**Gujarati synthesis:**
+```python
+generate_speech("નમસ્તે, તમે કેમ છો?", language="gujarati", accent="gujarati", gender="F")
+```
+## Parameters
+- `text`: Text to synthesize
+- `language`: `"english"` or `"gujarati"`
+- `accent`: `"indian"` or `"gujarati"`
+- `gender`: `"M"` (male) or `"F"` (female)
+- `speaker`: Optional specific speaker ID (auto-selected if not provided)
+## Training Code
+Training pipeline and scripts will be open-sourced soon.
+## Citation
+```bibtex
+@misc{ind-qwentts-2024,
+  title={Ind-QwenTTS: Multilingual Accent-Aware TTS},
+  author={Aryan Purohit},
+  year={2025}
+}
+```