Vaakya‑Open
Vaakya‑Open is a high‑quality, single‑speaker Text‑to‑Speech (TTS) model developed by Voxaura Labs, designed for English and Hindi voice synthesis. It features a natural female voice, optimized for voice‑overs, audiobooks, podcasts, narration, assistants, and production‑grade applications.
Built with a strong focus on clarity, consistency, and expressiveness, Vaakya‑Open is ideal for creators and developers looking for a dependable, studio‑like voice that works seamlessly across English, Hindi, and code‑mixed inputs.
🔊 Live Demo
👉 Try it instantly using the accompanying Gradio Space:
Vaakya‑Open TTS Demo — Convert English or Hindi text into natural‑sounding speech directly in your browser.
Note! If the space has become dormant due to inactivity, you may restart the space. Restart takes about 4 minutes (T4 small VM)
🎧 Audio Samples
Listen to sample outputs demonstrating the model's capabilities:
| Sample | Language | Audio Player |
|---|---|---|
| English | Pure English | |
| Hindi | Pure Hindi (Devanagari) | |
| Code‑Mixed | Hindi + English (Hinglish) |
All samples are generated at 24kHz with 16-bit PCM encoding.
✨ Key Highlights
- 🎙️ Single Professional Female Voice — consistent, warm, and narration‑ready
- 🌐 Bilingual Support — English & Hindi (with natural code‑mixing)
- 🎧 Studio‑Quality Audio — trained on pristine 192kHz recordings, output at 24kHz
- ⚡ Low‑Latency Inference — suitable for real‑time and batch workflows
- 🧠 Production‑Oriented — stable voice characteristics across long passages
🧠 Model Overview
| Attribute | Details |
|---|---|
| Model Name | Vaakya‑Open |
| Model Type | Autoregressive Transformer (Speech‑LLM) |
| Base Architecture | Llama 3B |
| Speaker | Single speaker (Female) |
| Languages | English, Hindi, Code‑mixed |
| Audio Codec | SNAC @ 24kHz |
| Sampling Rate | 24 kHz (output) |
| Developed By | Voxaura Labs |
| License | Apache 2.0 |
🏗️ Architecture
Vaakya‑Open is built on the Orpheus TTS architecture pioneered by Canopy Labs, which treats speech synthesis as a language modeling task. The model generates discrete audio tokens that are decoded into high‑quality waveforms.
┌──────────────────────────────────────────────────────────────────────────┐
│ VAAKYA-OPEN TTS ARCHITECTURE │
└──────────────────────────────────────────────────────────────────────────┘
╔══════════════════╗
║ Text Input ║ English / Hindi / Code-Mixed
║ ║
╚════════╤═════════╝
│
▼
╔═══════════════════════════╗
║ Llama 3B LLM ║ Autoregressive generation
║ (Speech Transformer) ║ 7 tokens per audio frame
╚═════════════╤═════════════╝
│
▼
╔═══════════════════════════╗
║ Audio Tokens ║ Discrete codes
║ (SNAC Format) ║ Hierarchical 3-level
╚═════════════╤═════════════╝
│
▼
╔═══════════════════════════╗
║ SNAC Decoder ║ Neural audio codec
║ (24kHz) ║ Token → Waveform
╚═════════════╤═════════════╝
│
▼
╔═══════════════════════════╗
║ 24kHz Audio Waveform ║ High-quality speech output
║ (Output) ║ Studio-grade quality
╚═══════════════════════════╝
How It Works
- Text Input — Your text (English, Hindi, or code‑mixed) is tokenized using a text tokenizer
- Audio Token Generation — The Llama‑based LLM autoregressively generates discrete audio tokens (7 tokens per audio frame)
- SNAC Decoding — The SNAC neural codec converts audio tokens back into a 24kHz waveform
- Output — High‑quality speech audio ready for playback or further processing
Key Architectural Features
| Component | Specification |
|---|---|
| LLM Backbone | Llama‑style autoregressive transformer (3B parameters) |
| Audio Tokenizer | SNAC with 7 tokens per frame (flattened sequence) |
| Tokens per Second | ~83 audio tokens/second |
| Context Length | 2048 tokens |
| Streaming Support | Yes (via sliding window decoding) |
Attribution: This architecture builds on Orpheus TTS by Canopy Labs, which demonstrated that LLMs can achieve human‑level speech synthesis.
🎼 Voice & Audio Quality
| Attribute | Value |
|---|---|
| Original Recording Rate | 192 kHz (studio‑grade) |
| Training / Output Rate | 24 kHz |
| Output Format | WAVE (PCM 16-bit, mono, 24kHz) |
| Recording Environment | Controlled studio conditions |
| Voice Style | Neutral, professional, voice‑over friendly |
This high‑resolution capture pipeline preserves subtle vocal textures, resulting in clean pronunciation, smooth prosody, and reduced artifacts.
🚀 Getting Started
Installation
pip install torch transformers soundfile accelerate
pip install snac # For audio decoding
For optional 4-bit quantization support:
pip install bitsandbytes
Basic Usage
import torch
import soundfile as sf
from transformers import AutoTokenizer, AutoModel
model_id = "voxaura-labs/vaakya-open"
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
text = "नमस्ते! This is Vaakya‑Open from Voxaura Labs."
with torch.no_grad():
audio = model.generate_speech(text)
sf.write("output.wav", audio, 24000)
Advanced Usage (Full Pipeline)
For users who need more control over generation parameters:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
import soundfile as sf
# Load model and tokenizer (FP16 for balanced speed and quality)
model = AutoModelForCausalLM.from_pretrained(
"voxaura-labs/vaakya-open",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("voxaura-labs/vaakya-open")
# Initialize SNAC decoder
# Note: SNAC will be moved to CPU during generation for stability
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
if torch.cuda.is_available():
snac_model = snac_model.cuda()
# Optional: For faster inference with lower memory usage, use 4-bit quantization:
# from transformers import BitsAndBytesConfig
# quantization_config = BitsAndBytesConfig(
# load_in_4bit=True,
# bnb_4bit_quant_type="nf4",
# bnb_4bit_compute_dtype=torch.bfloat16,
# bnb_4bit_use_double_quant=True,
# )
# Then pass quantization_config to from_pretrained() instead of torch_dtype
# Token IDs
START_OF_SPEECH_TOKEN = 128257
END_OF_SPEECH_TOKEN = 128258
START_OF_HUMAN_TOKEN = 128259
END_OF_HUMAN_TOKEN = 128260
START_OF_AI_TOKEN = 128261
END_OF_AI_TOKEN = 128262
AUDIO_CODE_BASE_OFFSET = 128266
# Audio token range: 128266 to 156937 (28,672 tokens total)
# The 7 tokens per frame use offsets: +0, +4096, +8192, +12288, +16384, +20480, +24576
MAX_AUDIO_TOKEN = 156937
def generate_speech(text, temperature=0.5, top_p=0.9):
"""Generate speech from text.
Args:
text: Input text (English, Hindi, or code-mixed)
temperature: Sampling temperature (0.4-0.7 recommended)
top_p: Nucleus sampling parameter
Returns:
numpy array: Audio waveform at 24kHz
"""
# Move SNAC to CPU for decoding (important for stability)
snac_model.to("cpu")
# Tokenize text (automatically adds BOS token)
input_ids = tokenizer(text, return_tensors="pt").input_ids
# Create input sequence: START_HUMAN + [BOS + Text + EOS] + END_HUMAN
start_token = torch.tensor([[START_OF_HUMAN_TOKEN]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, END_OF_HUMAN_TOKEN]], dtype=torch.int64) # EOS + END_HUMAN
input_sequence = torch.cat([start_token, input_ids, end_tokens], dim=1)
input_sequence = input_sequence.to(model.device)
# Generate audio tokens
with torch.no_grad():
generated_ids = model.generate(
input_ids=input_sequence,
max_new_tokens=2048,
do_sample=True,
temperature=temperature,
top_p=top_p,
repetition_penalty=1.1,
num_return_sequences=1,
eos_token_id=END_OF_SPEECH_TOKEN,
use_cache=True,
pad_token_id=tokenizer.pad_token_id,
)
# Extract generated tokens
gen_tokens = generated_ids[0]
# Find the last occurrence of START_OF_SPEECH token
sos_indices = (gen_tokens == START_OF_SPEECH_TOKEN).nonzero(as_tuple=True)[0]
if len(sos_indices) > 0:
# Start from after the last START_OF_SPEECH token
start_idx = sos_indices[-1].item() + 1
cropped_tokens = gen_tokens[start_idx:]
else:
cropped_tokens = gen_tokens
# Remove END_OF_SPEECH tokens
audio_tokens = cropped_tokens[cropped_tokens != END_OF_SPEECH_TOKEN]
# Truncate to make divisible by 7
num_tokens = len(audio_tokens)
num_frames = num_tokens // 7
audio_tokens = audio_tokens[:num_frames * 7]
if len(audio_tokens) == 0:
raise ValueError("No audio tokens generated")
# Convert to list and subtract offset
code_list = [t.item() - AUDIO_CODE_BASE_OFFSET for t in audio_tokens]
# Decode to audio
return decode_snac_tokens(code_list)
def decode_snac_tokens(code_list):
"""Decode SNAC tokens to audio waveform.
This model uses a 7-token interleaved encoding per audio frame with offsets:
Token 1: offset +0 (level 0, coarse)
Token 2: offset +4096 (level 1, medium)
Token 3: offset +8192 (level 2, fine)
Token 4: offset +12288 (level 2, fine)
Token 5: offset +16384 (level 1, medium)
Token 6: offset +20480 (level 2, fine)
Token 7: offset +24576 (level 2, fine)
Args:
code_list: List of audio codes (already offset-adjusted)
Returns:
numpy array: Decoded audio waveform
"""
if not code_list or len(code_list) % 7 != 0:
raise ValueError("Code list must be non-empty and divisible by 7")
# Redistribute codes into SNAC's 3-level hierarchy
layer_1 = [] # Coarse: 1 token per frame
layer_2 = [] # Medium: 2 tokens per frame
layer_3 = [] # Fine: 4 tokens per frame
for i in range(len(code_list) // 7):
# Extract and adjust each token based on its offset
layer_1.append(code_list[7*i])
layer_2.append(code_list[7*i + 1] - 4096)
layer_3.append(code_list[7*i + 2] - (2*4096))
layer_3.append(code_list[7*i + 3] - (3*4096))
layer_2.append(code_list[7*i + 4] - (4*4096))
layer_3.append(code_list[7*i + 5] - (5*4096))
layer_3.append(code_list[7*i + 6] - (6*4096))
# Create hierarchical code tensors for SNAC (on same device as SNAC model)
snac_device = next(snac_model.parameters()).device
codes = [
torch.tensor(layer_1, device=snac_device).unsqueeze(0),
torch.tensor(layer_2, device=snac_device).unsqueeze(0),
torch.tensor(layer_3, device=snac_device).unsqueeze(0)
]
# Decode to audio waveform
with torch.no_grad():
audio_hat = snac_model.decode(codes)
return audio_hat.squeeze().clamp(-1, 1).cpu().numpy()
# Example usage
text = "आज का मौसम बहुत अच्छा है। Let's go for a walk!"
audio = generate_speech(text, temperature=0.5, top_p=0.9)
sf.write("output.wav", audio, 24000)
print(f"Generated {len(audio) / 24000:.2f} seconds of audio")
🌍 Language Support
| Language | Status | Notes |
|---|---|---|
| English | ✅ Supported | Clear, neutral international pronunciation |
| Hindi | ✅ Supported | Natural, fluent, and expressive |
| Code‑Mixed (Hinglish) | ✅ Supported | Seamless language switching |
🧩 Use Cases
Vaakya‑Open is well‑suited for:
| Application | Description |
|---|---|
| 📚 Audiobooks & Narration | Long‑form content with consistent voice |
| 🎥 Video Voice‑Overs | Professional dubbing and narration |
| 🏥 Healthcare | Patient reminders, medication instructions |
| 📞 Voice AI Agents | IVR systems, conversational assistants |
| 🧑🏫 E‑learning | Educational content and tutorials |
| ♿ Accessibility | Screen readers for visually impaired users |
🏗️ Training Summary
| Attribute | Value |
|---|---|
| Speaker Count | 1 (professional voice‑over artist) |
| Recording Quality | Studio‑grade, 192kHz original capture |
| Output Quality | 24kHz (downsampled for efficiency) |
| Data Characteristics | Conversational, narrative, informational |
| Training Method | LoRA fine‑tuning on Orpheus base |
Training Data Sources
This model was trained using high‑quality speech data from publicly available academic datasets developed by premier Indian research institutions:
| Dataset | Institution | Description | License |
|---|---|---|---|
| IndicTTS | IIT Madras | Hindi and Indian English speech corpus for TTS | Custom (see below) |
| SYSPIN TTS Corpus | IISc Bangalore / SPIRE Lab | 900+ hours of studio-recorded TTS data in 9 Indian languages | CC‑BY‑4.0 |
| SPICOR TTS Corpus | IISc Bangalore / SPIRE Lab | 97+ hour domain-rich Indian English TTS corpus | CC‑BY‑4.0 |
We gratefully acknowledge these institutions for making their datasets available for research and development in Indian language speech synthesis.
📊 Performance
| Metric | Value |
|---|---|
| Latency (A100‑80GB) | ~120ms |
| Latency (RTX 4090) | ~200ms |
| Real‑time Factor | < 0.1x* |
| Output Sample Rate | 24kHz |
Note: Benchmarks measured with 4-bit quantization (
load_in_4bit=True), batch size 1, average text length ~20-30 tokens,max_new_tokens=512. Performance varies significantly based on hardware, precision (FP16/FP32/4-bit), text length, and generation parameters. FP16 inference may be 2-3x slower than 4-bit but provides better quality.
⚠️ Limitations
| Limitation | Description |
|---|---|
| Single Speaker | No speaker switching or voice selection |
| Language Support | English & Hindi only |
| Emotion Control | Emotion tokens not yet exposed |
| Hardware | GPU recommended for real‑time inference |
🛣️ Roadmap
We are actively working on:
- 🎭 Emotion & prosody control tokens
- 🌏 Additional Indian languages (Tamil, Telugu, Bengali, Marathi)
- 🎙️ Multiple voice variants (male, regional accents)
- ⚙️ CPU‑optimized inference paths
- 📡 Streaming inference support
📜 License
This project is released under the Apache 2.0 License.
Training Data Licenses
The training data used in this model is subject to the following licenses:
IIT Madras IndicTTS Dataset:
COPYRIGHT 2016 TTS Consortium, TDIL, Meity represented by Hema A Murthy & S Umesh, DEPARTMENT OF Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.
The IndicTTS dataset is provided under a permissive license that allows derivative works and free distribution. See the full license for details.
IISc Bangalore SYSPIN & SPICOR Datasets:
The SYSPIN and SPICOR TTS corpora are released under the Creative Commons Attribution 4.0 International License (CC‑BY‑4.0), which permits sharing and adaptation for any purpose, including commercial use, with appropriate attribution.
🙏 Acknowledgments
Training Data
We gratefully acknowledge the following institutions and projects for providing high‑quality speech datasets that made this model possible:
Indian Institute of Technology Madras (IIT Madras)
- IndicTTS Project — A comprehensive speech corpus for Indian languages developed by Prof. Hema A Murthy, Prof. S Umesh, and the TTS Consortium
- Funded by: Technology Development for Indian Languages (TDIL), Ministry of Electronics and Information Technology (MeitY), Government of India
Indian Institute of Science, Bangalore (IISc) — SPIRE Lab
- SYSPIN Project (SYnthesizing SPeech in INdian languages) — 900+ hours of studio-recorded TTS data in 9 Indian languages, led by Prof. Prasanta Kumar Ghosh
- Funded by: German Development Cooperation "FAIR Forward — AI for All" (GIZ Germany) and Bhashini AI Solutions Private Limited
- SPICOR Project — 97+ hours of domain-rich Indian English TTS corpus
- Dataset URL: https://spiredatasets.iisc.ac.in/
Architecture & Tools
- Canopy Labs — For pioneering the Orpheus TTS architecture and open‑sourcing their work
- Unsloth — For training optimizations and fine‑tuning tools
- SNAC — Hubert Siuzdak for the neural audio codec
Voice
- Voice Artist — For providing high‑quality studio recordings
🛡️ Safety, Ethics & Responsible Use
Intended Use
Vaakya‑Open is designed for legitimate applications including:
- 📚 Audiobook narration and content creation
- 🎥 Video voice‑overs and dubbing
- 🏥 Healthcare communication (patient reminders, medication instructions)
- 📞 Voice assistants and IVR systems
- 🧑🏫 Educational content and e‑learning
- ♿ Accessibility tools for visually impaired users
- 🔬 Academic research in speech synthesis
Prohibited Uses
Do not use this model for:
- ❌ Impersonation — Creating audio that falsely represents a real person without their explicit consent
- ❌ Fraud & Scams — Generating deceptive audio for phishing, vishing, or financial fraud
- ❌ Misinformation — Producing fake news, propaganda, or misleading content
- ❌ Deepfakes — Creating non‑consensual synthetic media intended to deceive
- ❌ Harassment — Generating content to bully, threaten, or demean individuals
- ❌ Illegal Activities — Any use that violates applicable laws or regulations
Transparency Recommendation
If you use Vaakya‑Open to generate speech for public‑facing applications, we strongly recommend disclosing to end users that they are listening to AI‑generated content. Transparency builds trust and helps prevent potential misuse.
Legal Compliance
Users are responsible for ensuring their use of this model complies with:
- All applicable local, national, and international laws
- Industry‑specific regulations (healthcare, finance, telecommunications, etc.)
- Platform terms of service where the generated audio is distributed
- Intellectual property and privacy rights of third parties
Disclaimer
THE DEVELOPERS OF VAAKYA‑OPEN ASSUME NO LIABILITY FOR ANY MISUSE OF THIS MODEL.
This model is provided "as is" without warranty of any kind. Voxaura Labs and its contributors are not responsible for any direct, indirect, incidental, or consequential damages arising from the use or misuse of this model or its outputs.
By using this model, you agree to follow all applicable laws, respect the rights and privacy of others, and uphold ethical standards in AI development and deployment.
Ethical AI Commitment
We at Voxaura Labs are committed to the responsible development and use of AI technologies. We believe that:
- AI should augment human capabilities, not deceive or harm
- Transparency is essential in AI‑generated content
- Privacy and consent must be respected in voice applications
- Accessibility should be a core consideration in voice technology
We encourage the community to use Vaakya‑Open as a force for good — to improve accessibility, enhance communication, and create meaningful experiences while respecting the dignity and rights of all individuals.
Reporting Misuse
If you become aware of any misuse of this model, please report it to us at hello@voxygen.ai. We take reports of misuse seriously and will take appropriate action where possible.
💬 About Voxaura Labs & Voxygen.ai
Voxaura Labs builds advanced Voice AI systems focused on realism, scalability, and multilingual accessibility — bridging the gap between human expression and artificial speech.
Voxygen is the product brand of Voxaura Labs, dedicated to making advanced Voice AI technologies accessible through practical applications and tools for creators, developers, and enterprises worldwide.
📝 Citation
@misc{vaakya2026,
title={Vaakya-Open: Text-to-Speech for Hindi and English},
author={Voxaura Labs},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/voxaura-labs/vaakya-open}
}
If you use this model, please also consider citing the training data sources:
@misc{indictts2016,
title={IndicTTS: Text-to-Speech for Indian Languages},
author={Murthy, Hema A and Umesh, S and {TTS Consortium}},
year={2016},
institution={Indian Institute of Technology Madras},
note={TDIL, MeitY},
url={https://www.iitm.ac.in/donlab/indictts/database}
}
@misc{syspin2025,
title={SYSPIN_S1.0 Corpus - A TTS Corpus of 900+ hours in nine Indian Languages},
author={Abhayjeet et al.},
year={2025},
institution={Indian Institute of Science, Bengaluru},
url={https://spiredatasets.iisc.ac.in/syspinCorpus}
}
@misc{spicor2025,
title={SPICOR TTS_1.0 Corpus - A 97+ hour domain-rich Indian English TTS Corpus},
author={Abhayjeet et al.},
year={2025},
institution={Indian Institute of Science, Bengaluru},
url={https://spiredatasets.iisc.ac.in/spicortts10}
}
Made in India 🇮🇳
If you use Vaakya‑Open in your work, we'd love to hear from you!
- Downloads last month
- 149
Model tree for voxaura-labs/vaakya-open
Base model
meta-llama/Llama-3.2-3B-Instruct