BeigeTTS / README.md
isaiahintelliga's picture
Upload BeigeTTS - Research release for neural speech synthesis (CC BY-NC 4.0)
d28be92 verified
---
language: en
license: cc-by-nc-4.0
library_name: transformers
tags:
- tts
- text-to-speech
- neucodec
- audio-generation
- research
- speech-synthesis
datasets:
- custom
model-index:
- name: BeigeTTS
results: []
---
# BeigeTTS: Research Release for Neural Speech Synthesis
## Overview
BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures.
## Research Context & Motivation
BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including:
- Multi-speaker voice cloning (10,000+ voices)
- Real-time multilingual synthesis (57 languages)
- Emotion and prosody transfer
- Sub-50ms streaming latency
- Production-grade robustness
BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes.
## Technical Architecture
### Model Foundation
- **Base Model**: Google Gemma-3 4B Instruct
- **Parameter Count**: ~4 billion parameters (Khaki uses 70B+)
- **Audio Codec**: NeuCodec (24kHz, single codebook)
- **Training Steps**: 1,435,000 steps
- **Context Length**: 2048 tokens
- **Vocabulary Size**: Extended to 327,690 tokens (includes NeuCodec token space)
### Research Implications
This release enables researchers to explore:
1. **Unified Text-Audio Modeling**: How large language models can be adapted for audio generation tasks
2. **Token-Based Audio Synthesis**: Advantages of discrete token representations over continuous methods
3. **Efficient Streaming**: Real-time generation with minimal latency
4. **Cross-Modal Learning**: Transfer learning between text and audio modalities
### Token Space Design
The model employs a unified token space combining text and audio:
```
Standard Gemma Tokens: 0-262,144
Special Audio Markers:
- AUDIO_START: 262,145
- AUDIO_END: 262,146
NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens)
```
## Capabilities & Limitations
### Current Capabilities (BeigeTTS)
- High-quality English speech synthesis
- Natural prosody and intonation
- Streaming generation support
- Adjustable speaking rate and style
- Context-aware generation
### Production Capabilities (Khaki - Not Released)
- **Multilingual**: 57 languages with accent control
- **Voice Cloning**: Zero-shot and few-shot speaker adaptation
- **Emotion Control**: 12 distinct emotional states
- **Ultra-Low Latency**: <50ms time-to-first-audio
- **Long-Form**: Stable generation for 30+ minute audio
- **Voice Conversion**: Real-time voice transformation
- **Singing Synthesis**: Musical vocal generation
### Research Limitations
BeigeTTS is released for non-commercial research purposes only. Key limitations include:
- English-only synthesis (multilingual reserved for Khaki)
- Single speaker (multi-speaker in Khaki)
- 10-second maximum generation (unlimited in Khaki)
- No voice cloning (available in Khaki)
- Research license only
## Installation
```bash
pip install torch transformers accelerate
pip install git+https://github.com/neuphonic/neucodec.git
pip install soundfile numpy scipy
```
## Quick Start
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from neucodec import NeuCodec
import soundfile as sf
# Load model
model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS")
tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS")
neucodec = NeuCodec.from_pretrained("neuphonic/neucodec")
# Generate speech
text = "Hello! This is BeigeTTS, a research release from BlandAI."
prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>"
# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_new_tokens=500,
temperature=0.1,
top_p=0.97,
eos_token_id=[tokenizer.eos_token_id, 262146]
)
# Decode audio (see inference script for full implementation)
```
## Research Applications
### Suggested Research Directions
1. **Prosody Modeling**: Investigating controllable prosody generation
2. **Cross-Lingual Transfer**: Adapting to new languages with minimal data
3. **Emotion Synthesis**: Fine-tuning for emotional speech generation
4. **Compression Studies**: Analyzing audio token efficiency
5. **Streaming Optimization**: Reducing latency for real-time applications
6. **Robustness Analysis**: Handling out-of-distribution text inputs
### Academic Collaborations
We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact research@bland.ai
## Performance Characteristics
- **Inference Speed**: ~150 tokens/second on A100
- **Audio Quality**: 24kHz (Khaki supports 48kHz)
- **Latency**: <500ms first audio (Khaki: <50ms)
- **Memory Usage**: ~16GB VRAM
## Multilingual Research Notes
While BeigeTTS is English-only, the architecture supports multilingual synthesis through:
- Language-specific token embeddings
- Cross-lingual phoneme mapping
- Accent and dialect modeling
- Code-switching capabilities
The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions.
## Ethical Considerations & License
### Non-Commercial Use Only
BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:
- ✅ Research and academic use
- ✅ Personal experimentation
- ✅ Open-source contributions
- ❌ Commercial applications
- ❌ Production deployment
- ❌ Monetized services
For commercial licensing of our full Khaki system, contact partnerships@bland.ai
### Responsible AI Guidelines
- Always disclose AI-generated content
- Do not use for impersonation without consent
- Respect privacy and intellectual property
- Consider potential biases in synthesis
- Implement appropriate safety measures
## Citation
If you use BeigeTTS in your research, please cite:
```bibtex
@misc{blandai2024beigetss,
title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis},
author={BlandAI Research Team},
year={2024},
publisher={HuggingFace},
note={Scaled research version of the Khaki TTS system}
}
```
## Related Work
BeigeTTS builds upon:
- Gemma (Google, 2024)
- NeuCodec (Neuphonic, 2024)
- Our production Khaki TTS system (not publicly available)
## Future Research Releases
We plan to release additional research artifacts:
- **TaupeVC**: Voice conversion research model
- **EcruTTS**: Lightweight edge deployment model
- **SandAlign**: Forced alignment for TTS training
## Support & Community
- Research inquiries: research@bland.ai
- Technical issues: GitHub Issues
- Commercial licensing: partnerships@bland.ai
## Acknowledgments
We thank the open-source community and our research partners. Special recognition to:
- Google for the Gemma foundation model
- Neuphonic for NeuCodec
- The broader TTS research community
## Disclaimer
BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options.
---
*BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai*