BeigeTTS / README.md

Upload BeigeTTS - Research release for neural speech synthesis (CC BY-NC 4.0)

d28be92 verified 4 months ago

7.93 kB

	---
	language: en
	license: cc-by-nc-4.0
	library_name: transformers
	tags:
	- tts
	- text-to-speech
	- neucodec
	- audio-generation
	- research
	- speech-synthesis
	datasets:
	- custom
	model-index:
	- name: BeigeTTS
	results: []
	---

	# BeigeTTS: Research Release for Neural Speech Synthesis

	## Overview

	BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures.

	## Research Context & Motivation

	BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including:
	- Multi-speaker voice cloning (10,000+ voices)
	- Real-time multilingual synthesis (57 languages)
	- Emotion and prosody transfer
	- Sub-50ms streaming latency
	- Production-grade robustness

	BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes.

	## Technical Architecture

	### Model Foundation
	- Base Model: Google Gemma-3 4B Instruct
	- Parameter Count: ~4 billion parameters (Khaki uses 70B+)
	- Audio Codec: NeuCodec (24kHz, single codebook)
	- Training Steps: 1,435,000 steps
	- Context Length: 2048 tokens
	- Vocabulary Size: Extended to 327,690 tokens (includes NeuCodec token space)

	### Research Implications

	This release enables researchers to explore:

	1. Unified Text-Audio Modeling: How large language models can be adapted for audio generation tasks
	2. Token-Based Audio Synthesis: Advantages of discrete token representations over continuous methods
	3. Efficient Streaming: Real-time generation with minimal latency
	4. Cross-Modal Learning: Transfer learning between text and audio modalities

	### Token Space Design

	The model employs a unified token space combining text and audio:

	```
	Standard Gemma Tokens: 0-262,144
	Special Audio Markers:
	- AUDIO_START: 262,145
	- AUDIO_END: 262,146
	NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens)
	```

	## Capabilities & Limitations

	### Current Capabilities (BeigeTTS)
	- High-quality English speech synthesis
	- Natural prosody and intonation
	- Streaming generation support
	- Adjustable speaking rate and style
	- Context-aware generation

	### Production Capabilities (Khaki - Not Released)
	- Multilingual: 57 languages with accent control
	- Voice Cloning: Zero-shot and few-shot speaker adaptation
	- Emotion Control: 12 distinct emotional states
	- Ultra-Low Latency: <50ms time-to-first-audio
	- Long-Form: Stable generation for 30+ minute audio
	- Voice Conversion: Real-time voice transformation
	- Singing Synthesis: Musical vocal generation

	### Research Limitations

	BeigeTTS is released for non-commercial research purposes only. Key limitations include:
	- English-only synthesis (multilingual reserved for Khaki)
	- Single speaker (multi-speaker in Khaki)
	- 10-second maximum generation (unlimited in Khaki)
	- No voice cloning (available in Khaki)
	- Research license only

	## Installation

	```bash
	pip install torch transformers accelerate
	pip install git+https://github.com/neuphonic/neucodec.git
	pip install soundfile numpy scipy
	```

	## Quick Start

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from neucodec import NeuCodec
	import soundfile as sf

	# Load model
	model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS")
	tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS")
	neucodec = NeuCodec.from_pretrained("neuphonic/neucodec")

	# Generate speech
	text = "Hello! This is BeigeTTS, a research release from BlandAI."
	prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>"

	# Tokenize and generate
	inputs = tokenizer(prompt, return_tensors="pt")
	with torch.no_grad():
	outputs = model.generate(
	inputs.input_ids,
	max_new_tokens=500,
	temperature=0.1,
	top_p=0.97,
	eos_token_id=[tokenizer.eos_token_id, 262146]
	)

	# Decode audio (see inference script for full implementation)
	```

	## Research Applications

	### Suggested Research Directions

	1. Prosody Modeling: Investigating controllable prosody generation
	2. Cross-Lingual Transfer: Adapting to new languages with minimal data
	3. Emotion Synthesis: Fine-tuning for emotional speech generation
	4. Compression Studies: Analyzing audio token efficiency
	5. Streaming Optimization: Reducing latency for real-time applications
	6. Robustness Analysis: Handling out-of-distribution text inputs

	### Academic Collaborations

	We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact research@bland.ai

	## Performance Characteristics

	- Inference Speed: ~150 tokens/second on A100
	- Audio Quality: 24kHz (Khaki supports 48kHz)
	- Latency: <500ms first audio (Khaki: <50ms)
	- Memory Usage: ~16GB VRAM

	## Multilingual Research Notes

	While BeigeTTS is English-only, the architecture supports multilingual synthesis through:
	- Language-specific token embeddings
	- Cross-lingual phoneme mapping
	- Accent and dialect modeling
	- Code-switching capabilities

	The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions.

	## Ethical Considerations & License

	### Non-Commercial Use Only

	BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:
	- ✅ Research and academic use
	- ✅ Personal experimentation
	- ✅ Open-source contributions
	- ❌ Commercial applications
	- ❌ Production deployment
	- ❌ Monetized services

	For commercial licensing of our full Khaki system, contact partnerships@bland.ai

	### Responsible AI Guidelines

	- Always disclose AI-generated content
	- Do not use for impersonation without consent
	- Respect privacy and intellectual property
	- Consider potential biases in synthesis
	- Implement appropriate safety measures

	## Citation

	If you use BeigeTTS in your research, please cite:

	```bibtex
	@misc{blandai2024beigetss,
	title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis},
	author={BlandAI Research Team},
	year={2024},
	publisher={HuggingFace},
	note={Scaled research version of the Khaki TTS system}
	}
	```

	## Related Work

	BeigeTTS builds upon:
	- Gemma (Google, 2024)
	- NeuCodec (Neuphonic, 2024)
	- Our production Khaki TTS system (not publicly available)

	## Future Research Releases

	We plan to release additional research artifacts:
	- TaupeVC: Voice conversion research model
	- EcruTTS: Lightweight edge deployment model
	- SandAlign: Forced alignment for TTS training

	## Support & Community

	- Research inquiries: research@bland.ai
	- Technical issues: GitHub Issues
	- Commercial licensing: partnerships@bland.ai

	## Acknowledgments

	We thank the open-source community and our research partners. Special recognition to:
	- Google for the Gemma foundation model
	- Neuphonic for NeuCodec
	- The broader TTS research community

	## Disclaimer

	BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options.

	---

	BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai