Vaakya‑Open

Vaakya‑Open is a high‑quality, single‑speaker Text‑to‑Speech (TTS) model developed by Voxaura Labs, designed for English and Hindi voice synthesis. It features a natural female voice, optimized for voice‑overs, audiobooks, podcasts, narration, assistants, and production‑grade applications.

Built with a strong focus on clarity, consistency, and expressiveness, Vaakya‑Open is ideal for creators and developers looking for a dependable, studio‑like voice that works seamlessly across English, Hindi, and code‑mixed inputs.


🔊 Live Demo

👉 Try it instantly using the accompanying Gradio Space:

Vaakya‑Open TTS Demo — Convert English or Hindi text into natural‑sounding speech directly in your browser.

Note! If the space has become dormant due to inactivity, you may restart the space. Restart takes about 4 minutes (T4 small VM)


🎧 Audio Samples

Listen to sample outputs demonstrating the model's capabilities:

Sample Language Audio Player
English Pure English
Hindi Pure Hindi (Devanagari)
Code‑Mixed Hindi + English (Hinglish)

All samples are generated at 24kHz with 16-bit PCM encoding.


✨ Key Highlights

  • 🎙️ Single Professional Female Voice — consistent, warm, and narration‑ready
  • 🌐 Bilingual Support — English & Hindi (with natural code‑mixing)
  • 🎧 Studio‑Quality Audio — trained on pristine 192kHz recordings, output at 24kHz
  • Low‑Latency Inference — suitable for real‑time and batch workflows
  • 🧠 Production‑Oriented — stable voice characteristics across long passages

🧠 Model Overview

Attribute Details
Model Name Vaakya‑Open
Model Type Autoregressive Transformer (Speech‑LLM)
Base Architecture Llama 3B
Speaker Single speaker (Female)
Languages English, Hindi, Code‑mixed
Audio Codec SNAC @ 24kHz
Sampling Rate 24 kHz (output)
Developed By Voxaura Labs
License Apache 2.0

🏗️ Architecture

Vaakya‑Open is built on the Orpheus TTS architecture pioneered by Canopy Labs, which treats speech synthesis as a language modeling task. The model generates discrete audio tokens that are decoded into high‑quality waveforms.

┌──────────────────────────────────────────────────────────────────────────┐
│                       VAAKYA-OPEN TTS ARCHITECTURE                       │
└──────────────────────────────────────────────────────────────────────────┘

    ╔══════════════════╗
    ║   Text Input     ║  English / Hindi / Code-Mixed
    ║                  ║
    ╚════════╤═════════╝
             │
             ▼
    ╔═══════════════════════════╗
    ║   Llama 3B LLM            ║  Autoregressive generation
    ║   (Speech Transformer)    ║  7 tokens per audio frame
    ╚═════════════╤═════════════╝
                  │
                  ▼
    ╔═══════════════════════════╗
    ║   Audio Tokens            ║  Discrete codes
    ║   (SNAC Format)           ║  Hierarchical 3-level
    ╚═════════════╤═════════════╝
                  │
                  ▼
    ╔═══════════════════════════╗
    ║   SNAC Decoder            ║  Neural audio codec
    ║   (24kHz)                 ║  Token → Waveform
    ╚═════════════╤═════════════╝
                  │
                  ▼
    ╔═══════════════════════════╗
    ║   24kHz Audio Waveform    ║  High-quality speech output
    ║   (Output)                ║  Studio-grade quality
    ╚═══════════════════════════╝

How It Works

  1. Text Input — Your text (English, Hindi, or code‑mixed) is tokenized using a text tokenizer
  2. Audio Token Generation — The Llama‑based LLM autoregressively generates discrete audio tokens (7 tokens per audio frame)
  3. SNAC Decoding — The SNAC neural codec converts audio tokens back into a 24kHz waveform
  4. Output — High‑quality speech audio ready for playback or further processing

Key Architectural Features

Component Specification
LLM Backbone Llama‑style autoregressive transformer (3B parameters)
Audio Tokenizer SNAC with 7 tokens per frame (flattened sequence)
Tokens per Second ~83 audio tokens/second
Context Length 2048 tokens
Streaming Support Yes (via sliding window decoding)

Attribution: This architecture builds on Orpheus TTS by Canopy Labs, which demonstrated that LLMs can achieve human‑level speech synthesis.


🎼 Voice & Audio Quality

Attribute Value
Original Recording Rate 192 kHz (studio‑grade)
Training / Output Rate 24 kHz
Output Format WAVE (PCM 16-bit, mono, 24kHz)
Recording Environment Controlled studio conditions
Voice Style Neutral, professional, voice‑over friendly

This high‑resolution capture pipeline preserves subtle vocal textures, resulting in clean pronunciation, smooth prosody, and reduced artifacts.


🚀 Getting Started

Installation

pip install torch transformers soundfile accelerate
pip install snac  # For audio decoding

For optional 4-bit quantization support:

pip install bitsandbytes

Basic Usage

import torch
import soundfile as sf
from transformers import AutoTokenizer, AutoModel

model_id = "voxaura-labs/vaakya-open"

model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

text = "नमस्ते! This is Vaakya‑Open from Voxaura Labs."

with torch.no_grad():
    audio = model.generate_speech(text)

sf.write("output.wav", audio, 24000)

Advanced Usage (Full Pipeline)

For users who need more control over generation parameters:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from snac import SNAC
import soundfile as sf

# Load model and tokenizer (FP16 for balanced speed and quality)
model = AutoModelForCausalLM.from_pretrained(
    "voxaura-labs/vaakya-open",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("voxaura-labs/vaakya-open")

# Initialize SNAC decoder
# Note: SNAC will be moved to CPU during generation for stability
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval()
if torch.cuda.is_available():
    snac_model = snac_model.cuda()

# Optional: For faster inference with lower memory usage, use 4-bit quantization:
# from transformers import BitsAndBytesConfig
# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_compute_dtype=torch.bfloat16,
#     bnb_4bit_use_double_quant=True,
# )
# Then pass quantization_config to from_pretrained() instead of torch_dtype

# Token IDs
START_OF_SPEECH_TOKEN = 128257
END_OF_SPEECH_TOKEN = 128258
START_OF_HUMAN_TOKEN = 128259
END_OF_HUMAN_TOKEN = 128260
START_OF_AI_TOKEN = 128261
END_OF_AI_TOKEN = 128262
AUDIO_CODE_BASE_OFFSET = 128266

# Audio token range: 128266 to 156937 (28,672 tokens total)
# The 7 tokens per frame use offsets: +0, +4096, +8192, +12288, +16384, +20480, +24576
MAX_AUDIO_TOKEN = 156937


def generate_speech(text, temperature=0.5, top_p=0.9):
    """Generate speech from text.

    Args:
        text: Input text (English, Hindi, or code-mixed)
        temperature: Sampling temperature (0.4-0.7 recommended)
        top_p: Nucleus sampling parameter

    Returns:
        numpy array: Audio waveform at 24kHz
    """
    # Move SNAC to CPU for decoding (important for stability)
    snac_model.to("cpu")

    # Tokenize text (automatically adds BOS token)
    input_ids = tokenizer(text, return_tensors="pt").input_ids

    # Create input sequence: START_HUMAN + [BOS + Text + EOS] + END_HUMAN
    start_token = torch.tensor([[START_OF_HUMAN_TOKEN]], dtype=torch.int64)
    end_tokens = torch.tensor([[128009, END_OF_HUMAN_TOKEN]], dtype=torch.int64)  # EOS + END_HUMAN

    input_sequence = torch.cat([start_token, input_ids, end_tokens], dim=1)
    input_sequence = input_sequence.to(model.device)

    # Generate audio tokens
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=input_sequence,
            max_new_tokens=2048,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=1.1,
            num_return_sequences=1,
            eos_token_id=END_OF_SPEECH_TOKEN,
            use_cache=True,
            pad_token_id=tokenizer.pad_token_id,
        )

    # Extract generated tokens
    gen_tokens = generated_ids[0]

    # Find the last occurrence of START_OF_SPEECH token
    sos_indices = (gen_tokens == START_OF_SPEECH_TOKEN).nonzero(as_tuple=True)[0]

    if len(sos_indices) > 0:
        # Start from after the last START_OF_SPEECH token
        start_idx = sos_indices[-1].item() + 1
        cropped_tokens = gen_tokens[start_idx:]
    else:
        cropped_tokens = gen_tokens

    # Remove END_OF_SPEECH tokens
    audio_tokens = cropped_tokens[cropped_tokens != END_OF_SPEECH_TOKEN]

    # Truncate to make divisible by 7
    num_tokens = len(audio_tokens)
    num_frames = num_tokens // 7
    audio_tokens = audio_tokens[:num_frames * 7]

    if len(audio_tokens) == 0:
        raise ValueError("No audio tokens generated")

    # Convert to list and subtract offset
    code_list = [t.item() - AUDIO_CODE_BASE_OFFSET for t in audio_tokens]

    # Decode to audio
    return decode_snac_tokens(code_list)


def decode_snac_tokens(code_list):
    """Decode SNAC tokens to audio waveform.

    This model uses a 7-token interleaved encoding per audio frame with offsets:
    Token 1: offset +0      (level 0, coarse)
    Token 2: offset +4096   (level 1, medium)
    Token 3: offset +8192   (level 2, fine)
    Token 4: offset +12288  (level 2, fine)
    Token 5: offset +16384  (level 1, medium)
    Token 6: offset +20480  (level 2, fine)
    Token 7: offset +24576  (level 2, fine)

    Args:
        code_list: List of audio codes (already offset-adjusted)

    Returns:
        numpy array: Decoded audio waveform
    """
    if not code_list or len(code_list) % 7 != 0:
        raise ValueError("Code list must be non-empty and divisible by 7")

    # Redistribute codes into SNAC's 3-level hierarchy
    layer_1 = []  # Coarse: 1 token per frame
    layer_2 = []  # Medium: 2 tokens per frame
    layer_3 = []  # Fine: 4 tokens per frame

    for i in range(len(code_list) // 7):
        # Extract and adjust each token based on its offset
        layer_1.append(code_list[7*i])
        layer_2.append(code_list[7*i + 1] - 4096)
        layer_3.append(code_list[7*i + 2] - (2*4096))
        layer_3.append(code_list[7*i + 3] - (3*4096))
        layer_2.append(code_list[7*i + 4] - (4*4096))
        layer_3.append(code_list[7*i + 5] - (5*4096))
        layer_3.append(code_list[7*i + 6] - (6*4096))

    # Create hierarchical code tensors for SNAC (on same device as SNAC model)
    snac_device = next(snac_model.parameters()).device
    codes = [
        torch.tensor(layer_1, device=snac_device).unsqueeze(0),
        torch.tensor(layer_2, device=snac_device).unsqueeze(0),
        torch.tensor(layer_3, device=snac_device).unsqueeze(0)
    ]

    # Decode to audio waveform
    with torch.no_grad():
        audio_hat = snac_model.decode(codes)

    return audio_hat.squeeze().clamp(-1, 1).cpu().numpy()


# Example usage
text = "आज का मौसम बहुत अच्छा है। Let's go for a walk!"
audio = generate_speech(text, temperature=0.5, top_p=0.9)
sf.write("output.wav", audio, 24000)
print(f"Generated {len(audio) / 24000:.2f} seconds of audio")

🌍 Language Support

Language Status Notes
English ✅ Supported Clear, neutral international pronunciation
Hindi ✅ Supported Natural, fluent, and expressive
Code‑Mixed (Hinglish) ✅ Supported Seamless language switching

🧩 Use Cases

Vaakya‑Open is well‑suited for:

Application Description
📚 Audiobooks & Narration Long‑form content with consistent voice
🎥 Video Voice‑Overs Professional dubbing and narration
🏥 Healthcare Patient reminders, medication instructions
📞 Voice AI Agents IVR systems, conversational assistants
🧑‍🏫 E‑learning Educational content and tutorials
Accessibility Screen readers for visually impaired users

🏗️ Training Summary

Attribute Value
Speaker Count 1 (professional voice‑over artist)
Recording Quality Studio‑grade, 192kHz original capture
Output Quality 24kHz (downsampled for efficiency)
Data Characteristics Conversational, narrative, informational
Training Method LoRA fine‑tuning on Orpheus base

Training Data Sources

This model was trained using high‑quality speech data from publicly available academic datasets developed by premier Indian research institutions:

Dataset Institution Description License
IndicTTS IIT Madras Hindi and Indian English speech corpus for TTS Custom (see below)
SYSPIN TTS Corpus IISc Bangalore / SPIRE Lab 900+ hours of studio-recorded TTS data in 9 Indian languages CC‑BY‑4.0
SPICOR TTS Corpus IISc Bangalore / SPIRE Lab 97+ hour domain-rich Indian English TTS corpus CC‑BY‑4.0

We gratefully acknowledge these institutions for making their datasets available for research and development in Indian language speech synthesis.


📊 Performance

Metric Value
Latency (A100‑80GB) ~120ms
Latency (RTX 4090) ~200ms
Real‑time Factor < 0.1x*
Output Sample Rate 24kHz

Note: Benchmarks measured with 4-bit quantization (load_in_4bit=True), batch size 1, average text length ~20-30 tokens, max_new_tokens=512. Performance varies significantly based on hardware, precision (FP16/FP32/4-bit), text length, and generation parameters. FP16 inference may be 2-3x slower than 4-bit but provides better quality.


⚠️ Limitations

Limitation Description
Single Speaker No speaker switching or voice selection
Language Support English & Hindi only
Emotion Control Emotion tokens not yet exposed
Hardware GPU recommended for real‑time inference

🛣️ Roadmap

We are actively working on:

  • 🎭 Emotion & prosody control tokens
  • 🌏 Additional Indian languages (Tamil, Telugu, Bengali, Marathi)
  • 🎙️ Multiple voice variants (male, regional accents)
  • ⚙️ CPU‑optimized inference paths
  • 📡 Streaming inference support

📜 License

This project is released under the Apache 2.0 License.

Training Data Licenses

The training data used in this model is subject to the following licenses:

IIT Madras IndicTTS Dataset:

COPYRIGHT 2016 TTS Consortium, TDIL, Meity represented by Hema A Murthy & S Umesh, DEPARTMENT OF Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.

The IndicTTS dataset is provided under a permissive license that allows derivative works and free distribution. See the full license for details.

IISc Bangalore SYSPIN & SPICOR Datasets:

The SYSPIN and SPICOR TTS corpora are released under the Creative Commons Attribution 4.0 International License (CC‑BY‑4.0), which permits sharing and adaptation for any purpose, including commercial use, with appropriate attribution.


🙏 Acknowledgments

Training Data

We gratefully acknowledge the following institutions and projects for providing high‑quality speech datasets that made this model possible:

Indian Institute of Technology Madras (IIT Madras)

  • IndicTTS Project — A comprehensive speech corpus for Indian languages developed by Prof. Hema A Murthy, Prof. S Umesh, and the TTS Consortium
  • Funded by: Technology Development for Indian Languages (TDIL), Ministry of Electronics and Information Technology (MeitY), Government of India

Indian Institute of Science, Bangalore (IISc) — SPIRE Lab

  • SYSPIN Project (SYnthesizing SPeech in INdian languages) — 900+ hours of studio-recorded TTS data in 9 Indian languages, led by Prof. Prasanta Kumar Ghosh
    • Funded by: German Development Cooperation "FAIR Forward — AI for All" (GIZ Germany) and Bhashini AI Solutions Private Limited
  • SPICOR Project — 97+ hours of domain-rich Indian English TTS corpus
  • Dataset URL: https://spiredatasets.iisc.ac.in/

Architecture & Tools

  • Canopy Labs — For pioneering the Orpheus TTS architecture and open‑sourcing their work
  • Unsloth — For training optimizations and fine‑tuning tools
  • SNAC — Hubert Siuzdak for the neural audio codec

Voice

  • Voice Artist — For providing high‑quality studio recordings

🛡️ Safety, Ethics & Responsible Use

Intended Use

Vaakya‑Open is designed for legitimate applications including:

  • 📚 Audiobook narration and content creation
  • 🎥 Video voice‑overs and dubbing
  • 🏥 Healthcare communication (patient reminders, medication instructions)
  • 📞 Voice assistants and IVR systems
  • 🧑‍🏫 Educational content and e‑learning
  • ♿ Accessibility tools for visually impaired users
  • 🔬 Academic research in speech synthesis

Prohibited Uses

Do not use this model for:

  • Impersonation — Creating audio that falsely represents a real person without their explicit consent
  • Fraud & Scams — Generating deceptive audio for phishing, vishing, or financial fraud
  • Misinformation — Producing fake news, propaganda, or misleading content
  • Deepfakes — Creating non‑consensual synthetic media intended to deceive
  • Harassment — Generating content to bully, threaten, or demean individuals
  • Illegal Activities — Any use that violates applicable laws or regulations

Transparency Recommendation

If you use Vaakya‑Open to generate speech for public‑facing applications, we strongly recommend disclosing to end users that they are listening to AI‑generated content. Transparency builds trust and helps prevent potential misuse.

Legal Compliance

Users are responsible for ensuring their use of this model complies with:

  • All applicable local, national, and international laws
  • Industry‑specific regulations (healthcare, finance, telecommunications, etc.)
  • Platform terms of service where the generated audio is distributed
  • Intellectual property and privacy rights of third parties

Disclaimer

THE DEVELOPERS OF VAAKYA‑OPEN ASSUME NO LIABILITY FOR ANY MISUSE OF THIS MODEL.

This model is provided "as is" without warranty of any kind. Voxaura Labs and its contributors are not responsible for any direct, indirect, incidental, or consequential damages arising from the use or misuse of this model or its outputs.

By using this model, you agree to follow all applicable laws, respect the rights and privacy of others, and uphold ethical standards in AI development and deployment.

Ethical AI Commitment

We at Voxaura Labs are committed to the responsible development and use of AI technologies. We believe that:

  1. AI should augment human capabilities, not deceive or harm
  2. Transparency is essential in AI‑generated content
  3. Privacy and consent must be respected in voice applications
  4. Accessibility should be a core consideration in voice technology

We encourage the community to use Vaakya‑Open as a force for good — to improve accessibility, enhance communication, and create meaningful experiences while respecting the dignity and rights of all individuals.

Reporting Misuse

If you become aware of any misuse of this model, please report it to us at hello@voxygen.ai. We take reports of misuse seriously and will take appropriate action where possible.


💬 About Voxaura Labs & Voxygen.ai

Voxaura Labs builds advanced Voice AI systems focused on realism, scalability, and multilingual accessibility — bridging the gap between human expression and artificial speech.

Voxygen is the product brand of Voxaura Labs, dedicated to making advanced Voice AI technologies accessible through practical applications and tools for creators, developers, and enterprises worldwide.

🌐 Website · 📧 Contact


📝 Citation

@misc{vaakya2026,
    title={Vaakya-Open: Text-to-Speech for Hindi and English},
    author={Voxaura Labs},
    year={2026},
    publisher={HuggingFace},
    url={https://huggingface.co/voxaura-labs/vaakya-open}
}

If you use this model, please also consider citing the training data sources:

@misc{indictts2016,
    title={IndicTTS: Text-to-Speech for Indian Languages},
    author={Murthy, Hema A and Umesh, S and {TTS Consortium}},
    year={2016},
    institution={Indian Institute of Technology Madras},
    note={TDIL, MeitY},
    url={https://www.iitm.ac.in/donlab/indictts/database}
}

@misc{syspin2025,
    title={SYSPIN_S1.0 Corpus - A TTS Corpus of 900+ hours in nine Indian Languages},
    author={Abhayjeet et al.},
    year={2025},
    institution={Indian Institute of Science, Bengaluru},
    url={https://spiredatasets.iisc.ac.in/syspinCorpus}
}

@misc{spicor2025,
    title={SPICOR TTS_1.0 Corpus - A 97+ hour domain-rich Indian English TTS Corpus},
    author={Abhayjeet et al.},
    year={2025},
    institution={Indian Institute of Science, Bengaluru},
    url={https://spiredatasets.iisc.ac.in/spicortts10}
}

Made in India 🇮🇳

If you use Vaakya‑Open in your work, we'd love to hear from you!

Downloads last month
149
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for voxaura-labs/vaakya-open

Finetuned
(18)
this model
Quantizations
1 model

Space using voxaura-labs/vaakya-open 1