maya-research
/

maya1

+---
+language:
+- en
+license: apache-2.0
+library_name: transformers
+datasets: proprietary
+tags:
+- text-to-speech
+- audio-generation
+- voice-synthesis
+- voice-design
+pipeline_tag: text-to-audio
+base_model:
+- meta-llama/Llama-3.2-3B
+---
+# maya-1-voice
+## Model Description
+At MayaResearch, we pretrained a 3B-parameter Llama backbone on a large corpus of English audio to predict **[SNAC neural codec tokens](https://github.com/hubertsiuzdak/snac)** instead of waveforms. SNAC's multi-scale structure (≈12/23/47 Hz, ~0.98 kbps) keeps autoregressive sequences compact for real-time streaming.
+Describe the voice—`<description="40-year-old, warm, low pitch, conversational">`—and get consistent speech across long passages. No speaker IDs. No prompt hacks.
+**Note on codecs:** SNAC works well, but **[Mimi](https://huggingface.co/kyutai/mimi)** is worth exploring for lower frame rates and tighter streaming if your use case demands it.
+## The Problem & Solution
+**The problem:** Traditional TTS struggles with three things—you can't reliably control voice characteristics without per-speaker training, streaming kills quality, and reproducing the same voice twice is inconsistent.
+**What we built:** Declarative voice design through XML attributes. The model maps text descriptions to delivery. Add inline emotion tags (`<laugh>`, `<angry>`, and even `<sings>`, more explained below) for moment-level control without breaking persona. SNAC's low-bitrate tokens enable real-time generation with stable quality. Works with vLLM for production deployment.
+## Model Details
+### Architecture
+We pretrained and finetuned a 3B-parameter decoder-only transformer (Llama-style) that predicts **SNAC codec tokens** instead of waveforms.
+**The flow:** `<description="..."> text` → tokenize → generate SNAC codes (7 tokens/frame) → decode → 24 kHz audio. Emotion tags like `<laugh>` and `<sigh>` are special tokens placed exactly where needed.
+**Why this works:** Discrete codecs let the model focus on delivery rather than raw acoustics. SNAC's hierarchical structure (≈12/23/47 Hz) keeps sequences compact for lower latency and stable long-form generation. Runs on standard LLM infrastructure (vLLM), making streaming trivial.
+### Preprocessing
+We built a multi-gate pipeline that standardized everything before training.
+1. **Acoustic standardization** - Resample to 24 kHz mono. Normalize loudness to -23 LUFS. Trim silence with VAD. Enforce 1-14s clip lengths.
+2. **Text alignment** - Forced alignment (MFA) at sentence level for clean boundaries. Unicode normalization, number expansion, punctuation cleanup.
+3. **Emotion tagging** - Map all stage directions to a closed set of special tokens (`<laugh>`, `<whisper>`). Each tag is 1 token.
+4. **Deduplication** - MinHash/LSH for text near-dupes. Chromaprint for audio duplicates.
+5. **Codec prep** - SNAC encode at 24 kHz, pack into 7-token frames, discard partials.
+6. **Labeling** - Mask description text in loss (conditioning only). Keep emotion tags unmasked (control signals).
+7. **QC** - Speaker-disjoint splits. Automated checks for LUFS, SNR, alignment confidence, per-tag coverage.
+**Why it mattered:** Consistent acoustics = consistent tokens = clean learning. No dupes = faster convergence. Clean boundaries = stable prosody.
+![Mimi adversarial reference codec](https://cdn-uploads.huggingface.co/production/uploads/642a7d4e556ab448a0701ca1/lY1xOY9xhJjI_MV_Nno_e.png)
+*Mimi adversarial reference codec*
+## Training Data
+### Data Sourcing & Labeling
+We used two data streams:
+**Pretraining (50K hours)** - Internet-scale English speech. Broad acoustic coverage for general stability and coarticulation.
+**SFT (proprietary curated)** - Studio recordings + handpicked clean clips. Each sample got description tags and emotion tags. We used LLMs to propose descriptors, humans approved. Emotions mapped to a closed set of angle-bracket tags (`<laugh>`, `<sigh>`, `<whisper>`)—each is a single token.
+**Pipeline steps:**
+- Resample to 24 kHz mono, reject corrupted files
+- Loudness normalization (EBU-R128)
+- VAD silence trimming with duration bounds
+- Forced alignment (MFA) for clean phrase boundaries
+- Removing Dedup Frames (MinHash-LSH), audio dedup (Chromaprint)
+- SNAC encode at 24 kHz, pack 7-token frames, drop partials
+**Labeling policy:** Description text is conditioning only—masked in loss. The model learns to predict audio tokens, not read descriptions. Emotion tags stay unmasked as control signals for timing.
+## Experiments - Prompt Formats
+We ran four experiments on conditioning format. Same backbone, same SNAC target—only the text side changed.
+### (1) Colon format
+```
+{description}: {text}
+```
+**Issue:** Format drift broke conditioning. Model sometimes spoke descriptor text. Tokenization made ":" a soft boundary.
+### (2) Angle-list attributes
+```
+<{age}, {pitch}, {character}> {text}
+```
+**Issue:** Too rigid. Missing one field shifted tokens and hurt generalization. Forced discrete buckets instead of natural language.
+### (3) Typed key-value tags
+```
+<age=40><pitch=low><timbre=warm> {text}
+```
+**Result:** Decent performance but token bloat. Small format mistakes broke conditioning. Pushed users to tweak knobs rather than describe personas.
+### (4) XML-attribute description (final)
+```
+<description="40-yr old, low-pitch, warm timbre, conversational"> {text}
+```
+**Result:** Best trade-off. One control field that reads like English. Clear boundaries with BOS/EOT. No narration leaks. Robust to variations.
+![Prompt format comparison](https://cdn-uploads.huggingface.co/production/uploads/642a7d4e556ab448a0701ca1/t8N48YfXIXxnMvwqSAowY.png)
+## What We Learned
+**Boundaries matter more than syntax.** Hard fences (BOS/EOT + turn tokens) with a single description span stopped meta-text leakage.
+**Natural language beats micro-knobs.** One rich descriptor outperformed many tiny control tokens. Let the embedding space do its job. Keep emotions as lightweight tags.
+**Rigid templates are fragile.** Missing fields in fixed formats shifted tokens and broke style. Free-form descriptions handled partial info gracefully.
+**Codec framing enables streaming.** Enforcing mod-7 SNAC frames with consistent 24 kHz preprocessing kept long-form persona stable and audio streaming clean.
+**Data hygiene was the multiplier.** Loudness normalization, MFA segmentation, deduplication (MinHash + Chromaprint), and closed emotion lexicon as single tokens—this boring pipeline work stabilized training and eliminated most prompt-reading failures.
+## How to Use
+### How to Prompt
+**Core format:**
+```
+<description="natural language persona"> Your text with inline <emotion> tags.
+```
+The XML-attribute description gives the model one clean conditioning span. Inline tags (`<laugh>`, `<sigh>`) act as single-token controls placed exactly where needed.
+**Examples:**
+```
+<description="35-yr old, low pitch, warm, conversational, product demo">
+Our new update <laugh> finally ships with the feature you asked for.
+```
+```
+<description="event host, energetic, clear diction, slight NY accent">
+Please welcome our next speaker <sigh> who needs no introduction.
+```
+```
+<description="dark villain, breathy, slow, ominous">
+You thought the night would hide you <whisper> but the night is mine.
+```
+**Guidelines:**
+- Write like you'd brief a voice actor. Use commas.
+- Place emotions exactly where they happen.
+- Don't nest quotes inside descriptions.
+- Stay consistent with format—consistency = stable persona.
+### Available Emotions
+We support a closed set of paralinguistic tags as single special tokens:
+`<laugh>` `<sigh>` `<whisper>` `<angry>` `<giggle>` `<chuckle>` `<gasp>` `<cry>`
+Full list available in the repository: [emotions.txt](https://huggingface.co/maya-research/maya-1-voice/blob/main/emotions.txt)
+### Inference & vLLM
+**Setup:**
+1. Build text prefix as trained: `[SOH, BOS, <description> text, EOT, EOH, SOA, SOS]` → stream audio tokens.
+2. Decode with SNAC 24 kHz to PCM.
+3. Enable Automatic Prefix Caching (APC) in vLLM for reused descriptors—it caches KV for identical prefixes.
+4. Set stop to audio EOS only via `stop_token_ids`.
+5. For web playback, use WebAudio (AudioWorklet) with ring buffer—avoids underruns and clicks.
+Runs comfortably on single GPU. Scale out for concurrent voices.
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from snac import SNAC
+import numpy as np
+# Load model and tokenizer
+model = AutoModelForCausalLM.from_pretrained("maya-research/maya-1-voice", torch_dtype=torch.bfloat16, device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained("maya-research/maya-1-voice")
+# Load SNAC decoder (24kHz)
+snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().to("cuda")
+# Build prompt
+description = "Realistic male voice in the 30s age with american accent. Normal pitch, warm timbre, conversational pacing."
+text = "Hello! This is Maya-1-Voice, a text-to-speech model."
+prompt = f'<description="{description}"> {text}'
+# Generate
+inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+with torch.inference_mode():
+    outputs = model.generate(**inputs, max_new_tokens=500, temperature=0.4, top_p=0.9, do_sample=True)
+# Extract SNAC tokens (128266-156937)
+generated_ids = outputs[0, inputs['input_ids'].shape[1]:]
+snac_tokens = [t.item() for t in generated_ids if 128266 <= t <= 156937]
+# Decode to audio (simplified unpacking for 7-token frames)
+frames = len(snac_tokens) // 7
+codes = [[], [], []]
+for i in range(frames):
+    s = snac_tokens[i*7:(i+1)*7]
+    codes[0].append((s[0]-128266) % 4096)
+    codes[1].extend([(s[1]-128266) % 4096, (s[4]-128266) % 4096])
+    codes[2].extend([(s[2]-128266) % 4096, (s[3]-128266) % 4096, (s[5]-128266) % 4096, (s[6]-128266) % 4096])
+# Decode with SNAC
+codes_tensor = [torch.tensor(c, dtype=torch.long, device="cuda").unsqueeze(0) for c in codes]
+with torch.inference_mode():
+    audio = snac_model.decoder(snac_model.quantizer.from_codes(codes_tensor))[0, 0].cpu().numpy()
+# Save audio (24kHz, mono)
+import soundfile as sf
+sf.write("output.wav", audio, 24000)
+```
+Use our script on vLLM inference. - [vllm_streaming_inference.py](https://huggingface.co/maya-research/maya-1-voice/blob/main/vllm_streaming_inference.py)
+**Example presets to try:**
+**Roles:** `product_demo_voice`, `event_host`, `short_form_narrator`, `dark_villain`, `spy`
+**Emotions:** `<laugh>`, `<sigh>`, `<whisper>`, `<angry>`, `<giggle>`
+The voice brief is yours to write. For more details on prompts, check our prompt file in this repository: [prompt.txt](https://huggingface.co/maya-research/maya-1-voice/blob/main/prompt.txt)
+Please also note that this model is highly experimental with descriptions. Use same seed for reproducing same audio. Generalisation of voices is something we are working on to help further of our models propser and be highly varying from each other. But this checkpoint is very reliable on consistant voices. Please try out before considering for production.
+## References & Citations
+**[1]** SNAC: Multi-Scale Neural Audio Codec  https://github.com/hubertsiuzdak/snac
+**[2]** Mimi: Adversarial Reference Codec  https://huggingface.co/kyutai/mimi
+**[3]** vLLM: Async Inference Engine  https://docs.vllm.ai/en/v0.6.5/dev/engine/async_llm_engine.html
+---
+**Model developed by:** MayaResearch
+**Model type:** Text-to-Speech, Audio Generation
+**Language:** English
+**License:** Apache 2.0