File size: 7,930 Bytes
d28be92 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 |
---
language: en
license: cc-by-nc-4.0
library_name: transformers
tags:
- tts
- text-to-speech
- neucodec
- audio-generation
- research
- speech-synthesis
datasets:
- custom
model-index:
- name: BeigeTTS
results: []
---
# BeigeTTS: Research Release for Neural Speech Synthesis
## Overview
BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures.
## Research Context & Motivation
BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including:
- Multi-speaker voice cloning (10,000+ voices)
- Real-time multilingual synthesis (57 languages)
- Emotion and prosody transfer
- Sub-50ms streaming latency
- Production-grade robustness
BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes.
## Technical Architecture
### Model Foundation
- **Base Model**: Google Gemma-3 4B Instruct
- **Parameter Count**: ~4 billion parameters (Khaki uses 70B+)
- **Audio Codec**: NeuCodec (24kHz, single codebook)
- **Training Steps**: 1,435,000 steps
- **Context Length**: 2048 tokens
- **Vocabulary Size**: Extended to 327,690 tokens (includes NeuCodec token space)
### Research Implications
This release enables researchers to explore:
1. **Unified Text-Audio Modeling**: How large language models can be adapted for audio generation tasks
2. **Token-Based Audio Synthesis**: Advantages of discrete token representations over continuous methods
3. **Efficient Streaming**: Real-time generation with minimal latency
4. **Cross-Modal Learning**: Transfer learning between text and audio modalities
### Token Space Design
The model employs a unified token space combining text and audio:
```
Standard Gemma Tokens: 0-262,144
Special Audio Markers:
- AUDIO_START: 262,145
- AUDIO_END: 262,146
NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens)
```
## Capabilities & Limitations
### Current Capabilities (BeigeTTS)
- High-quality English speech synthesis
- Natural prosody and intonation
- Streaming generation support
- Adjustable speaking rate and style
- Context-aware generation
### Production Capabilities (Khaki - Not Released)
- **Multilingual**: 57 languages with accent control
- **Voice Cloning**: Zero-shot and few-shot speaker adaptation
- **Emotion Control**: 12 distinct emotional states
- **Ultra-Low Latency**: <50ms time-to-first-audio
- **Long-Form**: Stable generation for 30+ minute audio
- **Voice Conversion**: Real-time voice transformation
- **Singing Synthesis**: Musical vocal generation
### Research Limitations
BeigeTTS is released for non-commercial research purposes only. Key limitations include:
- English-only synthesis (multilingual reserved for Khaki)
- Single speaker (multi-speaker in Khaki)
- 10-second maximum generation (unlimited in Khaki)
- No voice cloning (available in Khaki)
- Research license only
## Installation
```bash
pip install torch transformers accelerate
pip install git+https://github.com/neuphonic/neucodec.git
pip install soundfile numpy scipy
```
## Quick Start
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from neucodec import NeuCodec
import soundfile as sf
# Load model
model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS")
tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS")
neucodec = NeuCodec.from_pretrained("neuphonic/neucodec")
# Generate speech
text = "Hello! This is BeigeTTS, a research release from BlandAI."
prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>"
# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
max_new_tokens=500,
temperature=0.1,
top_p=0.97,
eos_token_id=[tokenizer.eos_token_id, 262146]
)
# Decode audio (see inference script for full implementation)
```
## Research Applications
### Suggested Research Directions
1. **Prosody Modeling**: Investigating controllable prosody generation
2. **Cross-Lingual Transfer**: Adapting to new languages with minimal data
3. **Emotion Synthesis**: Fine-tuning for emotional speech generation
4. **Compression Studies**: Analyzing audio token efficiency
5. **Streaming Optimization**: Reducing latency for real-time applications
6. **Robustness Analysis**: Handling out-of-distribution text inputs
### Academic Collaborations
We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact research@bland.ai
## Performance Characteristics
- **Inference Speed**: ~150 tokens/second on A100
- **Audio Quality**: 24kHz (Khaki supports 48kHz)
- **Latency**: <500ms first audio (Khaki: <50ms)
- **Memory Usage**: ~16GB VRAM
## Multilingual Research Notes
While BeigeTTS is English-only, the architecture supports multilingual synthesis through:
- Language-specific token embeddings
- Cross-lingual phoneme mapping
- Accent and dialect modeling
- Code-switching capabilities
The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions.
## Ethical Considerations & License
### Non-Commercial Use Only
BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:
- ✅ Research and academic use
- ✅ Personal experimentation
- ✅ Open-source contributions
- ❌ Commercial applications
- ❌ Production deployment
- ❌ Monetized services
For commercial licensing of our full Khaki system, contact partnerships@bland.ai
### Responsible AI Guidelines
- Always disclose AI-generated content
- Do not use for impersonation without consent
- Respect privacy and intellectual property
- Consider potential biases in synthesis
- Implement appropriate safety measures
## Citation
If you use BeigeTTS in your research, please cite:
```bibtex
@misc{blandai2024beigetss,
title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis},
author={BlandAI Research Team},
year={2024},
publisher={HuggingFace},
note={Scaled research version of the Khaki TTS system}
}
```
## Related Work
BeigeTTS builds upon:
- Gemma (Google, 2024)
- NeuCodec (Neuphonic, 2024)
- Our production Khaki TTS system (not publicly available)
## Future Research Releases
We plan to release additional research artifacts:
- **TaupeVC**: Voice conversion research model
- **EcruTTS**: Lightweight edge deployment model
- **SandAlign**: Forced alignment for TTS training
## Support & Community
- Research inquiries: research@bland.ai
- Technical issues: GitHub Issues
- Commercial licensing: partnerships@bland.ai
## Acknowledgments
We thank the open-source community and our research partners. Special recognition to:
- Google for the Gemma foundation model
- Neuphonic for NeuCodec
- The broader TTS research community
## Disclaimer
BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options.
---
*BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai* |