bharathkumarK commited on
Commit
3b3d6bb
·
verified ·
1 Parent(s): 2881cc1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +257 -0
README.md ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ datasets: proprietary
7
+ tags:
8
+ - text-to-speech
9
+ - audio-generation
10
+ - voice-synthesis
11
+ - voice-design
12
+ pipeline_tag: text-to-audio
13
+ base_model:
14
+ - meta-llama/Llama-3.2-3B
15
+ ---
16
+
17
+ # maya-1-voice
18
+
19
+ ## Model Description
20
+
21
+ At MayaResearch, we pretrained a 3B-parameter Llama backbone on a large corpus of English audio to predict **[SNAC neural codec tokens](https://github.com/hubertsiuzdak/snac)** instead of waveforms. SNAC's multi-scale structure (≈12/23/47 Hz, ~0.98 kbps) keeps autoregressive sequences compact for real-time streaming.
22
+
23
+ Describe the voice—`<description="40-year-old, warm, low pitch, conversational">`—and get consistent speech across long passages. No speaker IDs. No prompt hacks.
24
+
25
+ **Note on codecs:** SNAC works well, but **[Mimi](https://huggingface.co/kyutai/mimi)** is worth exploring for lower frame rates and tighter streaming if your use case demands it.
26
+
27
+ ## The Problem & Solution
28
+
29
+ **The problem:** Traditional TTS struggles with three things—you can't reliably control voice characteristics without per-speaker training, streaming kills quality, and reproducing the same voice twice is inconsistent.
30
+
31
+ **What we built:** Declarative voice design through XML attributes. The model maps text descriptions to delivery. Add inline emotion tags (`<laugh>`, `<angry>`, and even `<sings>`, more explained below) for moment-level control without breaking persona. SNAC's low-bitrate tokens enable real-time generation with stable quality. Works with vLLM for production deployment.
32
+
33
+ ## Model Details
34
+
35
+ ### Architecture
36
+
37
+ We pretrained and finetuned a 3B-parameter decoder-only transformer (Llama-style) that predicts **SNAC codec tokens** instead of waveforms.
38
+
39
+ **The flow:** `<description="..."> text` → tokenize → generate SNAC codes (7 tokens/frame) → decode → 24 kHz audio. Emotion tags like `<laugh>` and `<sigh>` are special tokens placed exactly where needed.
40
+
41
+ **Why this works:** Discrete codecs let the model focus on delivery rather than raw acoustics. SNAC's hierarchical structure (≈12/23/47 Hz) keeps sequences compact for lower latency and stable long-form generation. Runs on standard LLM infrastructure (vLLM), making streaming trivial.
42
+
43
+ ### Preprocessing
44
+
45
+ We built a multi-gate pipeline that standardized everything before training.
46
+
47
+ 1. **Acoustic standardization** - Resample to 24 kHz mono. Normalize loudness to -23 LUFS. Trim silence with VAD. Enforce 1-14s clip lengths.
48
+ 2. **Text alignment** - Forced alignment (MFA) at sentence level for clean boundaries. Unicode normalization, number expansion, punctuation cleanup.
49
+ 3. **Emotion tagging** - Map all stage directions to a closed set of special tokens (`<laugh>`, `<whisper>`). Each tag is 1 token.
50
+ 4. **Deduplication** - MinHash/LSH for text near-dupes. Chromaprint for audio duplicates.
51
+ 5. **Codec prep** - SNAC encode at 24 kHz, pack into 7-token frames, discard partials.
52
+ 6. **Labeling** - Mask description text in loss (conditioning only). Keep emotion tags unmasked (control signals).
53
+ 7. **QC** - Speaker-disjoint splits. Automated checks for LUFS, SNR, alignment confidence, per-tag coverage.
54
+
55
+ **Why it mattered:** Consistent acoustics = consistent tokens = clean learning. No dupes = faster convergence. Clean boundaries = stable prosody.
56
+
57
+ ![Mimi adversarial reference codec](https://cdn-uploads.huggingface.co/production/uploads/642a7d4e556ab448a0701ca1/lY1xOY9xhJjI_MV_Nno_e.png)
58
+ *Mimi adversarial reference codec*
59
+
60
+ ## Training Data
61
+
62
+ ### Data Sourcing & Labeling
63
+
64
+ We used two data streams:
65
+
66
+ **Pretraining (50K hours)** - Internet-scale English speech. Broad acoustic coverage for general stability and coarticulation.
67
+
68
+ **SFT (proprietary curated)** - Studio recordings + handpicked clean clips. Each sample got description tags and emotion tags. We used LLMs to propose descriptors, humans approved. Emotions mapped to a closed set of angle-bracket tags (`<laugh>`, `<sigh>`, `<whisper>`)—each is a single token.
69
+
70
+ **Pipeline steps:**
71
+ - Resample to 24 kHz mono, reject corrupted files
72
+ - Loudness normalization (EBU-R128)
73
+ - VAD silence trimming with duration bounds
74
+ - Forced alignment (MFA) for clean phrase boundaries
75
+ - Removing Dedup Frames (MinHash-LSH), audio dedup (Chromaprint)
76
+ - SNAC encode at 24 kHz, pack 7-token frames, drop partials
77
+
78
+ **Labeling policy:** Description text is conditioning only—masked in loss. The model learns to predict audio tokens, not read descriptions. Emotion tags stay unmasked as control signals for timing.
79
+
80
+ ## Experiments - Prompt Formats
81
+
82
+ We ran four experiments on conditioning format. Same backbone, same SNAC target—only the text side changed.
83
+
84
+ ### (1) Colon format
85
+
86
+ ```
87
+ {description}: {text}
88
+ ```
89
+
90
+ **Issue:** Format drift broke conditioning. Model sometimes spoke descriptor text. Tokenization made ":" a soft boundary.
91
+
92
+ ### (2) Angle-list attributes
93
+
94
+ ```
95
+ <{age}, {pitch}, {character}> {text}
96
+ ```
97
+
98
+ **Issue:** Too rigid. Missing one field shifted tokens and hurt generalization. Forced discrete buckets instead of natural language.
99
+
100
+ ### (3) Typed key-value tags
101
+
102
+ ```
103
+ <age=40><pitch=low><timbre=warm> {text}
104
+ ```
105
+
106
+ **Result:** Decent performance but token bloat. Small format mistakes broke conditioning. Pushed users to tweak knobs rather than describe personas.
107
+
108
+ ### (4) XML-attribute description (final)
109
+
110
+ ```
111
+ <description="40-yr old, low-pitch, warm timbre, conversational"> {text}
112
+ ```
113
+
114
+ **Result:** Best trade-off. One control field that reads like English. Clear boundaries with BOS/EOT. No narration leaks. Robust to variations.
115
+
116
+ ![Prompt format comparison](https://cdn-uploads.huggingface.co/production/uploads/642a7d4e556ab448a0701ca1/t8N48YfXIXxnMvwqSAowY.png)
117
+
118
+ ## What We Learned
119
+
120
+ **Boundaries matter more than syntax.** Hard fences (BOS/EOT + turn tokens) with a single description span stopped meta-text leakage.
121
+
122
+ **Natural language beats micro-knobs.** One rich descriptor outperformed many tiny control tokens. Let the embedding space do its job. Keep emotions as lightweight tags.
123
+
124
+ **Rigid templates are fragile.** Missing fields in fixed formats shifted tokens and broke style. Free-form descriptions handled partial info gracefully.
125
+
126
+ **Codec framing enables streaming.** Enforcing mod-7 SNAC frames with consistent 24 kHz preprocessing kept long-form persona stable and audio streaming clean.
127
+
128
+ **Data hygiene was the multiplier.** Loudness normalization, MFA segmentation, deduplication (MinHash + Chromaprint), and closed emotion lexicon as single tokens—this boring pipeline work stabilized training and eliminated most prompt-reading failures.
129
+
130
+ ## How to Use
131
+
132
+ ### How to Prompt
133
+
134
+ **Core format:**
135
+
136
+ ```
137
+ <description="natural language persona"> Your text with inline <emotion> tags.
138
+ ```
139
+
140
+ The XML-attribute description gives the model one clean conditioning span. Inline tags (`<laugh>`, `<sigh>`) act as single-token controls placed exactly where needed.
141
+
142
+ **Examples:**
143
+
144
+ ```
145
+ <description="35-yr old, low pitch, warm, conversational, product demo">
146
+ Our new update <laugh> finally ships with the feature you asked for.
147
+ ```
148
+
149
+ ```
150
+ <description="event host, energetic, clear diction, slight NY accent">
151
+ Please welcome our next speaker <sigh> who needs no introduction.
152
+ ```
153
+
154
+ ```
155
+ <description="dark villain, breathy, slow, ominous">
156
+ You thought the night would hide you <whisper> but the night is mine.
157
+ ```
158
+
159
+ **Guidelines:**
160
+ - Write like you'd brief a voice actor. Use commas.
161
+ - Place emotions exactly where they happen.
162
+ - Don't nest quotes inside descriptions.
163
+ - Stay consistent with format—consistency = stable persona.
164
+
165
+ ### Available Emotions
166
+
167
+ We support a closed set of paralinguistic tags as single special tokens:
168
+
169
+ `<laugh>` `<sigh>` `<whisper>` `<angry>` `<giggle>` `<chuckle>` `<gasp>` `<cry>`
170
+
171
+ Full list available in the repository: [emotions.txt](https://huggingface.co/maya-research/maya-1-voice/blob/main/emotions.txt)
172
+
173
+ ### Inference & vLLM
174
+
175
+ **Setup:**
176
+
177
+ 1. Build text prefix as trained: `[SOH, BOS, <description> text, EOT, EOH, SOA, SOS]` → stream audio tokens.
178
+ 2. Decode with SNAC 24 kHz to PCM.
179
+ 3. Enable Automatic Prefix Caching (APC) in vLLM for reused descriptors—it caches KV for identical prefixes.
180
+ 4. Set stop to audio EOS only via `stop_token_ids`.
181
+ 5. For web playback, use WebAudio (AudioWorklet) with ring buffer—avoids underruns and clicks.
182
+
183
+ Runs comfortably on single GPU. Scale out for concurrent voices.
184
+
185
+ ```python
186
+ import torch
187
+ from transformers import AutoModelForCausalLM, AutoTokenizer
188
+ from snac import SNAC
189
+ import numpy as np
190
+
191
+ # Load model and tokenizer
192
+ model = AutoModelForCausalLM.from_pretrained("maya-research/maya-1-voice", torch_dtype=torch.bfloat16, device_map="auto")
193
+ tokenizer = AutoTokenizer.from_pretrained("maya-research/maya-1-voice")
194
+
195
+ # Load SNAC decoder (24kHz)
196
+ snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to("cuda")
197
+
198
+ # Build prompt
199
+ description = "Realistic male voice in the 30s age with american accent. Normal pitch, warm timbre, conversational pacing."
200
+ text = "Hello! This is Maya-1-Voice, a text-to-speech model."
201
+ prompt = f'<description="{description}"> {text}'
202
+
203
+ # Generate
204
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
205
+ with torch.inference_mode():
206
+ outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.4, top_p=0.9, do_sample=True)
207
+
208
+ # Extract SNAC tokens (128266-156937)
209
+ generated_ids = outputs[0, inputs['input_ids'].shape[1]:]
210
+ snac_tokens = [t.item() for t in generated_ids if 128266 <= t <= 156937]
211
+
212
+ # Decode to audio (simplified unpacking for 7-token frames)
213
+ frames = len(snac_tokens) // 7
214
+ codes = [[], [], []]
215
+ for i in range(frames):
216
+ s = snac_tokens[i*7:(i+1)*7]
217
+ codes[0].append((s[0]-128266) % 4096)
218
+ codes[1].extend([(s[1]-128266) % 4096, (s[4]-128266) % 4096])
219
+ codes[2].extend([(s[2]-128266) % 4096, (s[3]-128266) % 4096, (s[5]-128266) % 4096, (s[6]-128266) % 4096])
220
+
221
+ # Decode with SNAC
222
+ codes_tensor = [torch.tensor(c, dtype=torch.long, device="cuda").unsqueeze(0) for c in codes]
223
+ with torch.inference_mode():
224
+ audio = snac_model.decoder(snac_model.quantizer.from_codes(codes_tensor))[0, 0].cpu().numpy()
225
+
226
+ # Save audio (24kHz, mono)
227
+ import soundfile as sf
228
+ sf.write("output.wav", audio, 24000)
229
+ ```
230
+
231
+ Use our script on vLLM inference. - [vllm_streaming_inference.py](https://huggingface.co/maya-research/maya-1-voice/blob/main/vllm_streaming_inference.py)
232
+
233
+ **Example presets to try:**
234
+
235
+ **Roles:** `product_demo_voice`, `event_host`, `short_form_narrator`, `dark_villain`, `spy`
236
+
237
+ **Emotions:** `<laugh>`, `<sigh>`, `<whisper>`, `<angry>`, `<giggle>`
238
+
239
+ The voice brief is yours to write. For more details on prompts, check our prompt file in this repository: [prompt.txt](https://huggingface.co/maya-research/maya-1-voice/blob/main/prompt.txt)
240
+
241
+
242
+
243
+ Please also note that this model is highly experimental with descriptions. Use same seed for reproducing same audio. Generalisation of voices is something we are working on to help further of our models propser and be highly varying from each other. But this checkpoint is very reliable on consistant voices. Please try out before considering for production.
244
+
245
+
246
+ ## References & Citations
247
+
248
+ **[1]** SNAC: Multi-Scale Neural Audio Codec https://github.com/hubertsiuzdak/snac
249
+ **[2]** Mimi: Adversarial Reference Codec https://huggingface.co/kyutai/mimi
250
+ **[3]** vLLM: Async Inference Engine https://docs.vllm.ai/en/v0.6.5/dev/engine/async_llm_engine.html
251
+
252
+ ---
253
+
254
+ **Model developed by:** MayaResearch
255
+ **Model type:** Text-to-Speech, Audio Generation
256
+ **Language:** English
257
+ **License:** Apache 2.0