--- language: en license: cc-by-nc-4.0 library_name: transformers tags: - tts - text-to-speech - neucodec - audio-generation - research - speech-synthesis datasets: - custom model-index: - name: BeigeTTS results: [] --- # BeigeTTS: Research Release for Neural Speech Synthesis ## Overview BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures. ## Research Context & Motivation BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including: - Multi-speaker voice cloning (10,000+ voices) - Real-time multilingual synthesis (57 languages) - Emotion and prosody transfer - Sub-50ms streaming latency - Production-grade robustness BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes. ## Technical Architecture ### Model Foundation - **Base Model**: Google Gemma-3 4B Instruct - **Parameter Count**: ~4 billion parameters (Khaki uses 70B+) - **Audio Codec**: NeuCodec (24kHz, single codebook) - **Training Steps**: 1,435,000 steps - **Context Length**: 2048 tokens - **Vocabulary Size**: Extended to 327,690 tokens (includes NeuCodec token space) ### Research Implications This release enables researchers to explore: 1. **Unified Text-Audio Modeling**: How large language models can be adapted for audio generation tasks 2. **Token-Based Audio Synthesis**: Advantages of discrete token representations over continuous methods 3. **Efficient Streaming**: Real-time generation with minimal latency 4. **Cross-Modal Learning**: Transfer learning between text and audio modalities ### Token Space Design The model employs a unified token space combining text and audio: ``` Standard Gemma Tokens: 0-262,144 Special Audio Markers: - AUDIO_START: 262,145 - AUDIO_END: 262,146 NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens) ``` ## Capabilities & Limitations ### Current Capabilities (BeigeTTS) - High-quality English speech synthesis - Natural prosody and intonation - Streaming generation support - Adjustable speaking rate and style - Context-aware generation ### Production Capabilities (Khaki - Not Released) - **Multilingual**: 57 languages with accent control - **Voice Cloning**: Zero-shot and few-shot speaker adaptation - **Emotion Control**: 12 distinct emotional states - **Ultra-Low Latency**: <50ms time-to-first-audio - **Long-Form**: Stable generation for 30+ minute audio - **Voice Conversion**: Real-time voice transformation - **Singing Synthesis**: Musical vocal generation ### Research Limitations BeigeTTS is released for non-commercial research purposes only. Key limitations include: - English-only synthesis (multilingual reserved for Khaki) - Single speaker (multi-speaker in Khaki) - 10-second maximum generation (unlimited in Khaki) - No voice cloning (available in Khaki) - Research license only ## Installation ```bash pip install torch transformers accelerate pip install git+https://github.com/neuphonic/neucodec.git pip install soundfile numpy scipy ``` ## Quick Start ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from neucodec import NeuCodec import soundfile as sf # Load model model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS") tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS") neucodec = NeuCodec.from_pretrained("neuphonic/neucodec") # Generate speech text = "Hello! This is BeigeTTS, a research release from BlandAI." prompt = f"user\n{text}\nmodel\n" # Tokenize and generate inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( inputs.input_ids, max_new_tokens=500, temperature=0.1, top_p=0.97, eos_token_id=[tokenizer.eos_token_id, 262146] ) # Decode audio (see inference script for full implementation) ``` ## Research Applications ### Suggested Research Directions 1. **Prosody Modeling**: Investigating controllable prosody generation 2. **Cross-Lingual Transfer**: Adapting to new languages with minimal data 3. **Emotion Synthesis**: Fine-tuning for emotional speech generation 4. **Compression Studies**: Analyzing audio token efficiency 5. **Streaming Optimization**: Reducing latency for real-time applications 6. **Robustness Analysis**: Handling out-of-distribution text inputs ### Academic Collaborations We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact research@bland.ai ## Performance Characteristics - **Inference Speed**: ~150 tokens/second on A100 - **Audio Quality**: 24kHz (Khaki supports 48kHz) - **Latency**: <500ms first audio (Khaki: <50ms) - **Memory Usage**: ~16GB VRAM ## Multilingual Research Notes While BeigeTTS is English-only, the architecture supports multilingual synthesis through: - Language-specific token embeddings - Cross-lingual phoneme mapping - Accent and dialect modeling - Code-switching capabilities The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions. ## Ethical Considerations & License ### Non-Commercial Use Only BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means: - ✅ Research and academic use - ✅ Personal experimentation - ✅ Open-source contributions - ❌ Commercial applications - ❌ Production deployment - ❌ Monetized services For commercial licensing of our full Khaki system, contact partnerships@bland.ai ### Responsible AI Guidelines - Always disclose AI-generated content - Do not use for impersonation without consent - Respect privacy and intellectual property - Consider potential biases in synthesis - Implement appropriate safety measures ## Citation If you use BeigeTTS in your research, please cite: ```bibtex @misc{blandai2024beigetss, title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis}, author={BlandAI Research Team}, year={2024}, publisher={HuggingFace}, note={Scaled research version of the Khaki TTS system} } ``` ## Related Work BeigeTTS builds upon: - Gemma (Google, 2024) - NeuCodec (Neuphonic, 2024) - Our production Khaki TTS system (not publicly available) ## Future Research Releases We plan to release additional research artifacts: - **TaupeVC**: Voice conversion research model - **EcruTTS**: Lightweight edge deployment model - **SandAlign**: Forced alignment for TTS training ## Support & Community - Research inquiries: research@bland.ai - Technical issues: GitHub Issues - Commercial licensing: partnerships@bland.ai ## Acknowledgments We thank the open-source community and our research partners. Special recognition to: - Google for the Gemma foundation model - Neuphonic for NeuCodec - The broader TTS research community ## Disclaimer BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options. --- *BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai*