|
|
--- |
|
|
language: en |
|
|
license: cc-by-nc-4.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- tts |
|
|
- text-to-speech |
|
|
- neucodec |
|
|
- audio-generation |
|
|
- research |
|
|
- speech-synthesis |
|
|
datasets: |
|
|
- custom |
|
|
model-index: |
|
|
- name: BeigeTTS |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# BeigeTTS: Research Release for Neural Speech Synthesis |
|
|
|
|
|
## Overview |
|
|
|
|
|
BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures. |
|
|
|
|
|
## Research Context & Motivation |
|
|
|
|
|
BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including: |
|
|
- Multi-speaker voice cloning (10,000+ voices) |
|
|
- Real-time multilingual synthesis (57 languages) |
|
|
- Emotion and prosody transfer |
|
|
- Sub-50ms streaming latency |
|
|
- Production-grade robustness |
|
|
|
|
|
BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes. |
|
|
|
|
|
## Technical Architecture |
|
|
|
|
|
### Model Foundation |
|
|
- **Base Model**: Google Gemma-3 4B Instruct |
|
|
- **Parameter Count**: ~4 billion parameters (Khaki uses 70B+) |
|
|
- **Audio Codec**: NeuCodec (24kHz, single codebook) |
|
|
- **Training Steps**: 1,435,000 steps |
|
|
- **Context Length**: 2048 tokens |
|
|
- **Vocabulary Size**: Extended to 327,690 tokens (includes NeuCodec token space) |
|
|
|
|
|
### Research Implications |
|
|
|
|
|
This release enables researchers to explore: |
|
|
|
|
|
1. **Unified Text-Audio Modeling**: How large language models can be adapted for audio generation tasks |
|
|
2. **Token-Based Audio Synthesis**: Advantages of discrete token representations over continuous methods |
|
|
3. **Efficient Streaming**: Real-time generation with minimal latency |
|
|
4. **Cross-Modal Learning**: Transfer learning between text and audio modalities |
|
|
|
|
|
### Token Space Design |
|
|
|
|
|
The model employs a unified token space combining text and audio: |
|
|
|
|
|
``` |
|
|
Standard Gemma Tokens: 0-262,144 |
|
|
Special Audio Markers: |
|
|
- AUDIO_START: 262,145 |
|
|
- AUDIO_END: 262,146 |
|
|
NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens) |
|
|
``` |
|
|
|
|
|
## Capabilities & Limitations |
|
|
|
|
|
### Current Capabilities (BeigeTTS) |
|
|
- High-quality English speech synthesis |
|
|
- Natural prosody and intonation |
|
|
- Streaming generation support |
|
|
- Adjustable speaking rate and style |
|
|
- Context-aware generation |
|
|
|
|
|
### Production Capabilities (Khaki - Not Released) |
|
|
- **Multilingual**: 57 languages with accent control |
|
|
- **Voice Cloning**: Zero-shot and few-shot speaker adaptation |
|
|
- **Emotion Control**: 12 distinct emotional states |
|
|
- **Ultra-Low Latency**: <50ms time-to-first-audio |
|
|
- **Long-Form**: Stable generation for 30+ minute audio |
|
|
- **Voice Conversion**: Real-time voice transformation |
|
|
- **Singing Synthesis**: Musical vocal generation |
|
|
|
|
|
### Research Limitations |
|
|
|
|
|
BeigeTTS is released for non-commercial research purposes only. Key limitations include: |
|
|
- English-only synthesis (multilingual reserved for Khaki) |
|
|
- Single speaker (multi-speaker in Khaki) |
|
|
- 10-second maximum generation (unlimited in Khaki) |
|
|
- No voice cloning (available in Khaki) |
|
|
- Research license only |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install torch transformers accelerate |
|
|
pip install git+https://github.com/neuphonic/neucodec.git |
|
|
pip install soundfile numpy scipy |
|
|
``` |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from neucodec import NeuCodec |
|
|
import soundfile as sf |
|
|
|
|
|
# Load model |
|
|
model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS") |
|
|
tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS") |
|
|
neucodec = NeuCodec.from_pretrained("neuphonic/neucodec") |
|
|
|
|
|
# Generate speech |
|
|
text = "Hello! This is BeigeTTS, a research release from BlandAI." |
|
|
prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>" |
|
|
|
|
|
# Tokenize and generate |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
inputs.input_ids, |
|
|
max_new_tokens=500, |
|
|
temperature=0.1, |
|
|
top_p=0.97, |
|
|
eos_token_id=[tokenizer.eos_token_id, 262146] |
|
|
) |
|
|
|
|
|
# Decode audio (see inference script for full implementation) |
|
|
``` |
|
|
|
|
|
## Research Applications |
|
|
|
|
|
### Suggested Research Directions |
|
|
|
|
|
1. **Prosody Modeling**: Investigating controllable prosody generation |
|
|
2. **Cross-Lingual Transfer**: Adapting to new languages with minimal data |
|
|
3. **Emotion Synthesis**: Fine-tuning for emotional speech generation |
|
|
4. **Compression Studies**: Analyzing audio token efficiency |
|
|
5. **Streaming Optimization**: Reducing latency for real-time applications |
|
|
6. **Robustness Analysis**: Handling out-of-distribution text inputs |
|
|
|
|
|
### Academic Collaborations |
|
|
|
|
|
We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact research@bland.ai |
|
|
|
|
|
## Performance Characteristics |
|
|
|
|
|
- **Inference Speed**: ~150 tokens/second on A100 |
|
|
- **Audio Quality**: 24kHz (Khaki supports 48kHz) |
|
|
- **Latency**: <500ms first audio (Khaki: <50ms) |
|
|
- **Memory Usage**: ~16GB VRAM |
|
|
|
|
|
## Multilingual Research Notes |
|
|
|
|
|
While BeigeTTS is English-only, the architecture supports multilingual synthesis through: |
|
|
- Language-specific token embeddings |
|
|
- Cross-lingual phoneme mapping |
|
|
- Accent and dialect modeling |
|
|
- Code-switching capabilities |
|
|
|
|
|
The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions. |
|
|
|
|
|
## Ethical Considerations & License |
|
|
|
|
|
### Non-Commercial Use Only |
|
|
|
|
|
BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means: |
|
|
- ✅ Research and academic use |
|
|
- ✅ Personal experimentation |
|
|
- ✅ Open-source contributions |
|
|
- ❌ Commercial applications |
|
|
- ❌ Production deployment |
|
|
- ❌ Monetized services |
|
|
|
|
|
For commercial licensing of our full Khaki system, contact partnerships@bland.ai |
|
|
|
|
|
### Responsible AI Guidelines |
|
|
|
|
|
- Always disclose AI-generated content |
|
|
- Do not use for impersonation without consent |
|
|
- Respect privacy and intellectual property |
|
|
- Consider potential biases in synthesis |
|
|
- Implement appropriate safety measures |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use BeigeTTS in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{blandai2024beigetss, |
|
|
title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis}, |
|
|
author={BlandAI Research Team}, |
|
|
year={2024}, |
|
|
publisher={HuggingFace}, |
|
|
note={Scaled research version of the Khaki TTS system} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Related Work |
|
|
|
|
|
BeigeTTS builds upon: |
|
|
- Gemma (Google, 2024) |
|
|
- NeuCodec (Neuphonic, 2024) |
|
|
- Our production Khaki TTS system (not publicly available) |
|
|
|
|
|
## Future Research Releases |
|
|
|
|
|
We plan to release additional research artifacts: |
|
|
- **TaupeVC**: Voice conversion research model |
|
|
- **EcruTTS**: Lightweight edge deployment model |
|
|
- **SandAlign**: Forced alignment for TTS training |
|
|
|
|
|
## Support & Community |
|
|
|
|
|
- Research inquiries: research@bland.ai |
|
|
- Technical issues: GitHub Issues |
|
|
- Commercial licensing: partnerships@bland.ai |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
We thank the open-source community and our research partners. Special recognition to: |
|
|
- Google for the Gemma foundation model |
|
|
- Neuphonic for NeuCodec |
|
|
- The broader TTS research community |
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options. |
|
|
|
|
|
--- |
|
|
|
|
|
*BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai* |