File size: 7,930 Bytes
d28be92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
language: en
license: cc-by-nc-4.0
library_name: transformers
tags:
  - tts
  - text-to-speech
  - neucodec
  - audio-generation
  - research
  - speech-synthesis
datasets:
  - custom
model-index:
  - name: BeigeTTS
    results: []
---

# BeigeTTS: Research Release for Neural Speech Synthesis

## Overview

BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures.

## Research Context & Motivation

BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including:
- Multi-speaker voice cloning (10,000+ voices)
- Real-time multilingual synthesis (57 languages)
- Emotion and prosody transfer
- Sub-50ms streaming latency
- Production-grade robustness

BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes.

## Technical Architecture

### Model Foundation
- **Base Model**: Google Gemma-3 4B Instruct
- **Parameter Count**: ~4 billion parameters (Khaki uses 70B+)
- **Audio Codec**: NeuCodec (24kHz, single codebook)
- **Training Steps**: 1,435,000 steps
- **Context Length**: 2048 tokens
- **Vocabulary Size**: Extended to 327,690 tokens (includes NeuCodec token space)

### Research Implications

This release enables researchers to explore:

1. **Unified Text-Audio Modeling**: How large language models can be adapted for audio generation tasks
2. **Token-Based Audio Synthesis**: Advantages of discrete token representations over continuous methods
3. **Efficient Streaming**: Real-time generation with minimal latency
4. **Cross-Modal Learning**: Transfer learning between text and audio modalities

### Token Space Design

The model employs a unified token space combining text and audio:

```
Standard Gemma Tokens: 0-262,144
Special Audio Markers:
  - AUDIO_START: 262,145
  - AUDIO_END: 262,146
NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens)
```

## Capabilities & Limitations

### Current Capabilities (BeigeTTS)
- High-quality English speech synthesis
- Natural prosody and intonation
- Streaming generation support
- Adjustable speaking rate and style
- Context-aware generation

### Production Capabilities (Khaki - Not Released)
- **Multilingual**: 57 languages with accent control
- **Voice Cloning**: Zero-shot and few-shot speaker adaptation
- **Emotion Control**: 12 distinct emotional states
- **Ultra-Low Latency**: <50ms time-to-first-audio
- **Long-Form**: Stable generation for 30+ minute audio
- **Voice Conversion**: Real-time voice transformation
- **Singing Synthesis**: Musical vocal generation

### Research Limitations

BeigeTTS is released for non-commercial research purposes only. Key limitations include:
- English-only synthesis (multilingual reserved for Khaki)
- Single speaker (multi-speaker in Khaki)
- 10-second maximum generation (unlimited in Khaki)
- No voice cloning (available in Khaki)
- Research license only

## Installation

```bash
pip install torch transformers accelerate
pip install git+https://github.com/neuphonic/neucodec.git
pip install soundfile numpy scipy
```

## Quick Start

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from neucodec import NeuCodec
import soundfile as sf

# Load model
model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS")
tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS")
neucodec = NeuCodec.from_pretrained("neuphonic/neucodec")

# Generate speech
text = "Hello! This is BeigeTTS, a research release from BlandAI."
prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>"

# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=500,
        temperature=0.1,
        top_p=0.97,
        eos_token_id=[tokenizer.eos_token_id, 262146]
    )

# Decode audio (see inference script for full implementation)
```

## Research Applications

### Suggested Research Directions

1. **Prosody Modeling**: Investigating controllable prosody generation
2. **Cross-Lingual Transfer**: Adapting to new languages with minimal data
3. **Emotion Synthesis**: Fine-tuning for emotional speech generation
4. **Compression Studies**: Analyzing audio token efficiency
5. **Streaming Optimization**: Reducing latency for real-time applications
6. **Robustness Analysis**: Handling out-of-distribution text inputs

### Academic Collaborations

We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact research@bland.ai

## Performance Characteristics

- **Inference Speed**: ~150 tokens/second on A100
- **Audio Quality**: 24kHz (Khaki supports 48kHz)
- **Latency**: <500ms first audio (Khaki: <50ms)
- **Memory Usage**: ~16GB VRAM

## Multilingual Research Notes

While BeigeTTS is English-only, the architecture supports multilingual synthesis through:
- Language-specific token embeddings
- Cross-lingual phoneme mapping
- Accent and dialect modeling
- Code-switching capabilities

The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions.

## Ethical Considerations & License

### Non-Commercial Use Only

BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:
- ✅ Research and academic use
- ✅ Personal experimentation
- ✅ Open-source contributions
- ❌ Commercial applications
- ❌ Production deployment
- ❌ Monetized services

For commercial licensing of our full Khaki system, contact partnerships@bland.ai

### Responsible AI Guidelines

- Always disclose AI-generated content
- Do not use for impersonation without consent
- Respect privacy and intellectual property
- Consider potential biases in synthesis
- Implement appropriate safety measures

## Citation

If you use BeigeTTS in your research, please cite:

```bibtex
@misc{blandai2024beigetss,
  title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis},
  author={BlandAI Research Team},
  year={2024},
  publisher={HuggingFace},
  note={Scaled research version of the Khaki TTS system}
}
```

## Related Work

BeigeTTS builds upon:
- Gemma (Google, 2024)
- NeuCodec (Neuphonic, 2024)
- Our production Khaki TTS system (not publicly available)

## Future Research Releases

We plan to release additional research artifacts:
- **TaupeVC**: Voice conversion research model
- **EcruTTS**: Lightweight edge deployment model
- **SandAlign**: Forced alignment for TTS training

## Support & Community

- Research inquiries: research@bland.ai
- Technical issues: GitHub Issues
- Commercial licensing: partnerships@bland.ai

## Acknowledgments

We thank the open-source community and our research partners. Special recognition to:
- Google for the Gemma foundation model
- Neuphonic for NeuCodec
- The broader TTS research community

## Disclaimer

BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options.

---

*BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai*