File size: 7,930 Bytes

d28be92

---
language: en
license: cc-by-nc-4.0
library_name: transformers
tags:
  - tts
  - text-to-speech
  - neucodec
  - audio-generation
  - research
  - speech-synthesis
datasets:
  - custom
model-index:
  - name: BeigeTTS
    results: []
---

# BeigeTTS: Research Release for Neural Speech Synthesis

## Overview

BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures.

## Research Context & Motivation

BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including:
- Multi-speaker voice cloning (10,000+ voices)
- Real-time multilingual synthesis (57 languages)
- Emotion and prosody transfer
- Sub-50ms streaming latency
- Production-grade robustness

BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes.

## Technical Architecture

### Model Foundation
- **Base Model**: Google Gemma-3 4B Instruct
- **Parameter Count**: ~4 billion parameters (Khaki uses 70B+)
- **Audio Codec**: NeuCodec (24kHz, single codebook)
- **Training Steps**: 1,435,000 steps
- **Context Length**: 2048 tokens
- **Vocabulary Size**: Extended to 327,690 tokens (includes NeuCodec token space)

### Research Implications

This release enables researchers to explore:

1. **Unified Text-Audio Modeling**: How large language models can be adapted for audio generation tasks
2. **Token-Based Audio Synthesis**: Advantages of discrete token representations over continuous methods
3. **Efficient Streaming**: Real-time generation with minimal latency
4. **Cross-Modal Learning**: Transfer learning between text and audio modalities

### Token Space Design

The model employs a unified token space combining text and audio:

```
Standard Gemma Tokens: 0-262,144
Special Audio Markers:
  - AUDIO_START: 262,145
  - AUDIO_END: 262,146
NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens)
```

## Capabilities & Limitations

### Current Capabilities (BeigeTTS)
- High-quality English speech synthesis
- Natural prosody and intonation
- Streaming generation support
- Adjustable speaking rate and style
- Context-aware generation

### Production Capabilities (Khaki - Not Released)
- **Multilingual**: 57 languages with accent control
- **Voice Cloning**: Zero-shot and few-shot speaker adaptation
- **Emotion Control**: 12 distinct emotional states
- **Ultra-Low Latency**: <50ms time-to-first-audio
- **Long-Form**: Stable generation for 30+ minute audio
- **Voice Conversion**: Real-time voice transformation
- **Singing Synthesis**: Musical vocal generation

### Research Limitations

BeigeTTS is released for non-commercial research purposes only. Key limitations include:
- English-only synthesis (multilingual reserved for Khaki)
- Single speaker (multi-speaker in Khaki)
- 10-second maximum generation (unlimited in Khaki)
- No voice cloning (available in Khaki)
- Research license only

## Installation

```bash
pip install torch transformers accelerate
pip install git+https://github.com/neuphonic/neucodec.git
pip install soundfile numpy scipy
```

## Quick Start

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from neucodec import NeuCodec
import soundfile as sf

# Load model
model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS")
tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS")
neucodec = NeuCodec.from_pretrained("neuphonic/neucodec")

# Generate speech
text = "Hello! This is BeigeTTS, a research release from BlandAI."
prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>"

# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=500,
        temperature=0.1,
        top_p=0.97,
        eos_token_id=[tokenizer.eos_token_id, 262146]
    )

# Decode audio (see inference script for full implementation)
```

## Research Applications

### Suggested Research Directions

1. **Prosody Modeling**: Investigating controllable prosody generation
2. **Cross-Lingual Transfer**: Adapting to new languages with minimal data
3. **Emotion Synthesis**: Fine-tuning for emotional speech generation
4. **Compression Studies**: Analyzing audio token efficiency
5. **Streaming Optimization**: Reducing latency for real-time applications
6. **Robustness Analysis**: Handling out-of-distribution text inputs

### Academic Collaborations

We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact research@bland.ai

## Performance Characteristics

- **Inference Speed**: ~150 tokens/second on A100
- **Audio Quality**: 24kHz (Khaki supports 48kHz)
- **Latency**: <500ms first audio (Khaki: <50ms)
- **Memory Usage**: ~16GB VRAM

## Multilingual Research Notes

While BeigeTTS is English-only, the architecture supports multilingual synthesis through:
- Language-specific token embeddings
- Cross-lingual phoneme mapping
- Accent and dialect modeling
- Code-switching capabilities

The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions.

## Ethical Considerations & License

### Non-Commercial Use Only

BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:
- ✅ Research and academic use
- ✅ Personal experimentation
- ✅ Open-source contributions
- ❌ Commercial applications
- ❌ Production deployment
- ❌ Monetized services

For commercial licensing of our full Khaki system, contact partnerships@bland.ai

### Responsible AI Guidelines

- Always disclose AI-generated content
- Do not use for impersonation without consent
- Respect privacy and intellectual property
- Consider potential biases in synthesis
- Implement appropriate safety measures

## Citation

If you use BeigeTTS in your research, please cite:

```bibtex
@misc{blandai2024beigetss,
  title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis},
  author={BlandAI Research Team},
  year={2024},
  publisher={HuggingFace},
  note={Scaled research version of the Khaki TTS system}
}
```

## Related Work

BeigeTTS builds upon:
- Gemma (Google, 2024)
- NeuCodec (Neuphonic, 2024)
- Our production Khaki TTS system (not publicly available)

## Future Research Releases

We plan to release additional research artifacts:
- **TaupeVC**: Voice conversion research model
- **EcruTTS**: Lightweight edge deployment model
- **SandAlign**: Forced alignment for TTS training

## Support & Community

- Research inquiries: research@bland.ai
- Technical issues: GitHub Issues
- Commercial licensing: partnerships@bland.ai

## Acknowledgments

We thank the open-source community and our research partners. Special recognition to:
- Google for the Gemma foundation model
- Neuphonic for NeuCodec
- The broader TTS research community

## Disclaimer

BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options.

---

*BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai*