AryanNsc commited on
Commit
ef92654
·
verified ·
1 Parent(s): 0824b26

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -3
README.md CHANGED
@@ -1,3 +1,161 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - gu
6
+ base_model:
7
+ - Qwen/Qwen2.5-0.5B
8
+ pipeline_tag: text-to-speech
9
+ tags:
10
+ - tts
11
+ - indian-accent
12
+ ---
13
+ # Ind-QwenTTS
14
+
15
+ A lightweight multilingual Text-to-Speech system with accent control for English and Gujarati.
16
+
17
+ ## Features
18
+
19
+ - Multilingual: English + Gujarati
20
+ - Accent Control: Indian & Gujarati accents
21
+ - 4 voices (2 male, 2 female)
22
+ - Accent transfer capability
23
+ - Fast inference with 0.5B parameters
24
+
25
+ ## Supported Voices
26
+
27
+ | Speaker ID | Language | Accent | Gender |
28
+ |-----------|----------|---------|---------|
29
+ | `SPK_EN_M_001` | English | Indian | Male |
30
+ | `SPK_EN_F_001` | English | Indian | Female |
31
+ | `SPK_GU_M_001` | Gujarati | Gujarati | Male |
32
+ | `SPK_GU_F_001` | Gujarati | Gujarati | Female |
33
+
34
+ ## Installation
35
+
36
+ ```bash
37
+ pip install transformers torch torchaudio snac torchcodec
38
+ ```
39
+
40
+ ## Usage
41
+
42
+ ```python
43
+ import torch
44
+ import torchaudio
45
+ from transformers import AutoModelForCausalLM, AutoTokenizer
46
+ from snac import SNAC
47
+
48
+ device = "cuda" if torch.cuda.is_available() else "cpu"
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained("AryanNsc/IND-QWENTTS-V1", fix_mistral_regex=True)
51
+ model = AutoModelForCausalLM.from_pretrained("AryanNsc/IND-QWENTTS-V1", torch_dtype=torch.bfloat16).to(device).eval()
52
+ snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to(device).eval()
53
+
54
+ def generate_speech(text, language="english", accent="indian", gender="M", speaker=None, output_file="output.wav"):
55
+ if speaker is None:
56
+ speaker_map = {
57
+ ("english", "M"): "SPK_EN_M_001",
58
+ ("english", "F"): "SPK_EN_F_001",
59
+ ("gujarati", "M"): "SPK_GU_M_001",
60
+ ("gujarati", "F"): "SPK_GU_F_001"
61
+ }
62
+ speaker = speaker_map.get((language, gender), "SPK_EN_M_001")
63
+
64
+ prompt = f"<lang>{language}</lang><accent>{accent}</accent><gender>{gender}</gender><speaker>{speaker}</speaker> {text}"
65
+
66
+ input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(device)
67
+
68
+ start_tokens = torch.tensor([
69
+ tokenizer.convert_tokens_to_ids("<|endoftext|>"),
70
+ tokenizer.convert_tokens_to_ids("<soh>"),
71
+ tokenizer.convert_tokens_to_ids("<soa>"),
72
+ tokenizer.convert_tokens_to_ids("<sos>")
73
+ ], device=device).unsqueeze(0)
74
+
75
+ full_input = torch.cat([input_ids, start_tokens], dim=1)
76
+
77
+ with torch.no_grad():
78
+ output = model.generate(
79
+ full_input,
80
+ max_new_tokens=1500,
81
+ temperature=0.7,
82
+ top_p=0.85,
83
+ repetition_penalty=1.15,
84
+ do_sample=True,
85
+ pad_token_id=tokenizer.pad_token_id,
86
+ eos_token_id=tokenizer.convert_tokens_to_ids("<eos>")
87
+ )
88
+
89
+ generated_ids = output[0, full_input.shape[1]:]
90
+
91
+ eos_id = tokenizer.convert_tokens_to_ids("<eos>")
92
+ if len(generated_ids) > 0 and generated_ids[-1] == eos_id:
93
+ generated_ids = generated_ids[:-1]
94
+
95
+ if len(generated_ids) % 7 != 0:
96
+ trunc_len = (len(generated_ids) // 7) * 7
97
+ generated_ids = generated_ids[:trunc_len]
98
+
99
+ if len(generated_ids) == 0:
100
+ print("Error: No audio generated.")
101
+ return
102
+
103
+ codes = generated_ids.reshape(-1, 7).T
104
+
105
+ snac_offset = model.config.vocab_size - 4096
106
+ codes = codes - snac_offset
107
+ codes = torch.clamp(codes, min=0)
108
+
109
+ l1 = codes[0, :]
110
+ l2 = torch.stack([codes[1, :], codes[4, :]], dim=1).flatten()
111
+ l3 = torch.stack([codes[2, :], codes[3, :], codes[5, :], codes[6, :]], dim=1).flatten()
112
+
113
+ with torch.inference_mode():
114
+ audio = snac.decode([l1.unsqueeze(0), l2.unsqueeze(0), l3.unsqueeze(0)])
115
+
116
+ audio_tensor = audio.squeeze(0).cpu()
117
+ torchaudio.save(output_file, audio_tensor, 24000)
118
+ print(f"Saved to {output_file}")
119
+
120
+ generate_speech(
121
+ text="The competition results will be announced tomorrow morning.",
122
+ language="english",
123
+ accent="indian",
124
+ gender="M",
125
+ output_file="test_english.wav"
126
+ )
127
+ ```
128
+
129
+ ## Examples
130
+
131
+ **Basic English synthesis:**
132
+ ```python
133
+ generate_speech("Hello world, this is a test.", language="english", accent="indian", gender="M")
134
+ ```
135
+
136
+ **Gujarati synthesis:**
137
+ ```python
138
+ generate_speech("નમસ્તે, તમે કેમ છો?", language="gujarati", accent="gujarati", gender="F")
139
+ ```
140
+
141
+ ## Parameters
142
+
143
+ - `text`: Text to synthesize
144
+ - `language`: `"english"` or `"gujarati"`
145
+ - `accent`: `"indian"` or `"gujarati"`
146
+ - `gender`: `"M"` (male) or `"F"` (female)
147
+ - `speaker`: Optional specific speaker ID (auto-selected if not provided)
148
+
149
+ ## Training Code
150
+
151
+ Training pipeline and scripts will be open-sourced soon.
152
+
153
+ ## Citation
154
+
155
+ ```bibtex
156
+ @misc{ind-qwentts-2024,
157
+ title={Ind-QwenTTS: Multilingual Accent-Aware TTS},
158
+ author={Aryan Purohit},
159
+ year={2025}
160
+ }
161
+ ```