---
license: apache-2.0
language:
- bg
- en
pipeline_tag: text-to-speech
tags:
- tts
- bulgarian
- miocodec
- encoder-decoder
- voice-cloning
- speech-synthesis
library_name: pytorch
---
# BgTTS-38M V2 — Bulgarian Text-to-Speech with Voice Cloning
A lightweight **38M parameter** encoder-decoder TTS model for **Bulgarian and English** speech synthesis with **zero-shot voice cloning** via [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz).
**V2 improvements over V1:**
- **Speaker normalization** — stable voice quality across all reference audio files
- **Larger training dataset** — 1,537 hours (vs 1,172h in V1)
- **BF16 training** — more stable gradients, no GradScaler needed
- **Zero dropout** — better utilization of model capacity
- **20 epochs** with careful LR scheduling
## Audio Samples
### Female Voice (Bulgarian)
### Female Voice (English)
### Male Voice 1 (Bulgarian)
### Male Voice 1 (English)
### Male Voice 2 (Bulgarian)
### Male Voice 2 (English)
## Key Features
- **Bilingual**: Native Bulgarian + English in a single model
- **Voice cloning**: Zero-shot — just provide 3-10 seconds of reference audio
- **Tiny footprint**: 146 MB inference checkpoint, runs on CPU
- **Fast**: RTF ~0.3 on both GPU and CPU (3.3× faster than real-time)
- **Speaker-stable**: V2's normalized speaker embedding ensures consistent quality regardless of reference audio
## 🎙️ Voice Cloning
This model supports zero-shot voice cloning — it can generate speech in any voice given just a short reference audio clip. No fine-tuning needed.
### How it Works
1. Provide a reference audio (3-10 seconds of clear speech, WAV format, ideally 24kHz)
2. MioCodec extracts a 128-dimensional speaker embedding (`global_embedding`)
3. The embedding is **L2-normalized** and scaled by a learned parameter (`spk_scale`) before being added to the decoder
4. The same embedding is used for MioCodec waveform reconstruction
### V2 Improvement: Speaker Normalization
In V1, the speaker embedding had 7× larger norm than content tokens, causing the model to over-rely on the reference audio for pronunciation quality. V2 normalizes the speaker vector to unit norm, ensuring:
- **Consistent quality** across all reference voices
- The model learns speech patterns from data, not from speaker shortcuts
- Reference audio only affects **timbre**, not articulation
## Model Architecture
| Component | Details |
|---|---|
| Text Encoder | 4-layer bidirectional Transformer (d=384, 6 heads, ff=1536) |
| Audio Decoder | 8-layer causal Transformer (d=384, 6 heads, ff=1536) with cross-attention |
| Speaker Injection | L2-normalized Linear(128 → 384) with learned scale, additive bias |
| Audio Codec | [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) 25Hz, 1 codebook, 12800 codes, 24kHz output |
| Total Parameters | 38.2M (Encoder: 9.6M, Decoder: 28.6M) |
| Activations | SwiGLU |
| Normalization | RMSNorm (pre-norm) |
| Positional Encoding | Learned (encoder), RoPE (decoder) |
| Embeddings | Tied decoder (lm_head = token_embedding) |
| KV-Cache | Yes (for fast autoregressive inference) |
### Tokenizer
Character-level tokenizer supporting 146 characters:
- Bulgarian Cyrillic (А-Я, а-я)
- English Latin (A-Z, a-z)
- Digits, punctuation, whitespace
Total vocabulary: **12,955 tokens** (9 special + 146 text + 12,800 audio codes)
## Training
| Parameter | Value |
|---|---|
| **Data** | 728K samples, **1,537 hours** total |
| Bulgarian | ~620K samples (~1,368 hours) |
| English | ~108K samples (~169 hours) |
| **Epochs** | 20 |
| **LR Schedule** | Cosine decay, peak 7e-5, warmup 2 epochs, min 5e-6 |
| **Batch Size** | 64 |
| **Optimizer** | AdamW (betas=0.9, 0.999), weight decay 0.01 |
| **Precision** | BF16 (no GradScaler) |
| **Dropout** | 0.0 (unnecessary — model is 38M, data is 1,537h) |
| **Final Loss** | 5.04 |
| **Hardware** | NVIDIA RTX 5090 (32GB VRAM) |
### Why Zero Dropout?
With only 38M parameters and 138M audio tokens (1,537 hours), the model has **0.28 parameters per token**. Overfitting is mathematically impossible — the model is severely underfitting the data. Dropout only slows convergence without providing any regularization benefit.
## Quick Start
### Requirements
```bash
pip install torch torchaudio soundfile miocodec
```
### Python API
```python
import torch
from model import load_for_inference
from tokenizer import TTSTokenizer
from codec import CodecV6
from inference import generate
device = "cuda" # or "cpu"
# Load model
model = load_for_inference("checkpoint_inference.pt", device=device)
tokenizer = TTSTokenizer()
codec = CodecV6(device=device)
# Get speaker embedding from reference audio
ref = codec.encode("reference_speaker.wav")
speaker_emb = ref["global_embedding"].to(device)
# Generate
codes = generate(
model, tokenizer,
text="Здравейте, как сте днес?",
speaker_emb=speaker_emb,
temperature=0.3,
top_k=250,
max_new_tokens=512,
device=device,
)
# Decode to audio
if codes is not None:
wav = codec.tokens_to_wav(codes, speaker_emb, "output.wav")
```
### CLI
```bash
python inference.py \
--checkpoint checkpoint_inference.pt \
--text "Здравейте, как сте днес?" \
--speaker-wav reference.wav \
--output output.wav \
--temperature 0.3
```
### Web UI (Gradio)
```bash
python server.py
# Opens at http://localhost:7860
```
### Parameters
| Parameter | Default | Description |
|---|---|---|
| `--temperature` | 0.3 | Sampling temperature (lower = stable, higher = expressive) |
| `--top-k` | 250 | Top-k filtering |
| `--top-p` | 0.95 | Nucleus sampling threshold |
| `--rep-penalty` | 1.1 | Repetition penalty on recent tokens |
| `--max-tokens` | 512 | Maximum decoder steps (~20 seconds) |
**Recommended temperature: 0.3** for clean, stable output. Use 0.5-0.7 for more expressive/varied speech.
## ⚠️ Important: Sentence Length
> The encoder supports up to **256 characters** (~18 seconds of audio). For longer texts, `inference.py` automatically splits by sentence boundaries and concatenates the audio. No manual splitting needed.
## Files
```
checkpoint_inference.pt # Model weights only (146 MB)
checkpoint.pt # Full checkpoint with optimizer state (438 MB, for continued training)
config.py # Model configuration
model.py # Architecture (TTSEncoderDecoder + speaker normalization)
tokenizer.py # Character-level tokenizer
codec.py # MioCodec wrapper
inference.py # Inference pipeline with KV-cache + sentence splitting
train.py # Training script (BF16)
server.py # Gradio web UI
samples/ # Audio samples (3 voices × 2 languages × 3 texts)
```
## Performance
### Benchmarks
| Hardware | RTF | Speed | Notes |
|---|---|---|---|
| **Intel i3-9100F (CPU)** | **0.30** | **3.3× real-time** | **Windows 10, CPU-only, no GPU** |
### CPU-only Deployment (Tested on Windows 10)
| Component | Disk Space |
|---|---|
| Python venv (PyTorch CPU + deps) | 654 MB |
| BgTTS-38M-V2 (checkpoint + code) | 146 MB |
| MioCodec (auto-downloaded, cached) | 499 MB |
| WavLM base+ (auto-downloaded, cached) | 872 MB |
| **Total** | **2.12 GB** |
No NVIDIA GPU, no CUDA, no special drivers needed. Works on any x86-64 machine with Python 3.8+.
## Comparison with Other Models
| Model | Parameters | Size | Languages | Voice Cloning | Open Source |
|---|---|---|---|---|---|
| **BgTTS-38M V2** | **38M** | **146 MB** | BG + EN | ✅ | ✅ |
| Kokoro-82M | 82M | ~200 MB | Multi | ❌ | ✅ |
| XTTS-v2 | ~467M | ~1.8 GB | 16 | ✅ | ✅ |
| CSM-1B | 1B | ~4 GB | EN | ✅ | ✅ |
| Dia-1.6B | 1.6B | ~6.4 GB | EN | ✅ | ✅ |
BgTTS-38M V2 is the **smallest TTS model with voice cloning** we are aware of, and the **only** open-source TTS model with native Bulgarian language support.
## Limitations
- Best with sentences up to ~18 seconds. Longer texts are auto-split by `inference.py`.
- Bulgarian quality is superior to English (82% of training data is Bulgarian).
- Voice cloning quality depends on reference audio clarity — use clean recordings without background noise.
- No explicit prosody control (pitch, speed) — these are implicitly learned from data.
- Character-level tokenizer may struggle with rare Unicode characters outside the supported set.
## License
Apache 2.0
## Citation
```bibtex
@misc{bgtts38mv2,
title={BgTTS-38M V2: Bulgarian Text-to-Speech with Voice Cloning and Speaker Normalization},
author={beleata74},
year={2026},
url={https://huggingface.co/beleata74/BgTTS-38M-V2}
}
```