--- license: apache-2.0 language: - bg - en pipeline_tag: text-to-speech tags: - tts - bulgarian - miocodec - encoder-decoder - voice-cloning - speech-synthesis library_name: pytorch --- # BgTTS-38M V2 — Bulgarian Text-to-Speech with Voice Cloning A lightweight **38M parameter** encoder-decoder TTS model for **Bulgarian and English** speech synthesis with **zero-shot voice cloning** via [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz). **V2 improvements over V1:** - **Speaker normalization** — stable voice quality across all reference audio files - **Larger training dataset** — 1,537 hours (vs 1,172h in V1) - **BF16 training** — more stable gradients, no GradScaler needed - **Zero dropout** — better utilization of model capacity - **20 epochs** with careful LR scheduling ## Audio Samples ### Female Voice (Bulgarian) ### Female Voice (English) ### Male Voice 1 (Bulgarian) ### Male Voice 1 (English) ### Male Voice 2 (Bulgarian) ### Male Voice 2 (English) ## Key Features - **Bilingual**: Native Bulgarian + English in a single model - **Voice cloning**: Zero-shot — just provide 3-10 seconds of reference audio - **Tiny footprint**: 146 MB inference checkpoint, runs on CPU - **Fast**: RTF ~0.3 on both GPU and CPU (3.3× faster than real-time) - **Speaker-stable**: V2's normalized speaker embedding ensures consistent quality regardless of reference audio ## 🎙️ Voice Cloning This model supports zero-shot voice cloning — it can generate speech in any voice given just a short reference audio clip. No fine-tuning needed. ### How it Works 1. Provide a reference audio (3-10 seconds of clear speech, WAV format, ideally 24kHz) 2. MioCodec extracts a 128-dimensional speaker embedding (`global_embedding`) 3. The embedding is **L2-normalized** and scaled by a learned parameter (`spk_scale`) before being added to the decoder 4. The same embedding is used for MioCodec waveform reconstruction ### V2 Improvement: Speaker Normalization In V1, the speaker embedding had 7× larger norm than content tokens, causing the model to over-rely on the reference audio for pronunciation quality. V2 normalizes the speaker vector to unit norm, ensuring: - **Consistent quality** across all reference voices - The model learns speech patterns from data, not from speaker shortcuts - Reference audio only affects **timbre**, not articulation ## Model Architecture | Component | Details | |---|---| | Text Encoder | 4-layer bidirectional Transformer (d=384, 6 heads, ff=1536) | | Audio Decoder | 8-layer causal Transformer (d=384, 6 heads, ff=1536) with cross-attention | | Speaker Injection | L2-normalized Linear(128 → 384) with learned scale, additive bias | | Audio Codec | [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) 25Hz, 1 codebook, 12800 codes, 24kHz output | | Total Parameters | 38.2M (Encoder: 9.6M, Decoder: 28.6M) | | Activations | SwiGLU | | Normalization | RMSNorm (pre-norm) | | Positional Encoding | Learned (encoder), RoPE (decoder) | | Embeddings | Tied decoder (lm_head = token_embedding) | | KV-Cache | Yes (for fast autoregressive inference) | ### Tokenizer Character-level tokenizer supporting 146 characters: - Bulgarian Cyrillic (А-Я, а-я) - English Latin (A-Z, a-z) - Digits, punctuation, whitespace Total vocabulary: **12,955 tokens** (9 special + 146 text + 12,800 audio codes) ## Training | Parameter | Value | |---|---| | **Data** | 728K samples, **1,537 hours** total | | Bulgarian | ~620K samples (~1,368 hours) | | English | ~108K samples (~169 hours) | | **Epochs** | 20 | | **LR Schedule** | Cosine decay, peak 7e-5, warmup 2 epochs, min 5e-6 | | **Batch Size** | 64 | | **Optimizer** | AdamW (betas=0.9, 0.999), weight decay 0.01 | | **Precision** | BF16 (no GradScaler) | | **Dropout** | 0.0 (unnecessary — model is 38M, data is 1,537h) | | **Final Loss** | 5.04 | | **Hardware** | NVIDIA RTX 5090 (32GB VRAM) | ### Why Zero Dropout? With only 38M parameters and 138M audio tokens (1,537 hours), the model has **0.28 parameters per token**. Overfitting is mathematically impossible — the model is severely underfitting the data. Dropout only slows convergence without providing any regularization benefit. ## Quick Start ### Requirements ```bash pip install torch torchaudio soundfile miocodec ``` ### Python API ```python import torch from model import load_for_inference from tokenizer import TTSTokenizer from codec import CodecV6 from inference import generate device = "cuda" # or "cpu" # Load model model = load_for_inference("checkpoint_inference.pt", device=device) tokenizer = TTSTokenizer() codec = CodecV6(device=device) # Get speaker embedding from reference audio ref = codec.encode("reference_speaker.wav") speaker_emb = ref["global_embedding"].to(device) # Generate codes = generate( model, tokenizer, text="Здравейте, как сте днес?", speaker_emb=speaker_emb, temperature=0.3, top_k=250, max_new_tokens=512, device=device, ) # Decode to audio if codes is not None: wav = codec.tokens_to_wav(codes, speaker_emb, "output.wav") ``` ### CLI ```bash python inference.py \ --checkpoint checkpoint_inference.pt \ --text "Здравейте, как сте днес?" \ --speaker-wav reference.wav \ --output output.wav \ --temperature 0.3 ``` ### Web UI (Gradio) ```bash python server.py # Opens at http://localhost:7860 ``` ### Parameters | Parameter | Default | Description | |---|---|---| | `--temperature` | 0.3 | Sampling temperature (lower = stable, higher = expressive) | | `--top-k` | 250 | Top-k filtering | | `--top-p` | 0.95 | Nucleus sampling threshold | | `--rep-penalty` | 1.1 | Repetition penalty on recent tokens | | `--max-tokens` | 512 | Maximum decoder steps (~20 seconds) | **Recommended temperature: 0.3** for clean, stable output. Use 0.5-0.7 for more expressive/varied speech. ## ⚠️ Important: Sentence Length > The encoder supports up to **256 characters** (~18 seconds of audio). For longer texts, `inference.py` automatically splits by sentence boundaries and concatenates the audio. No manual splitting needed. ## Files ``` checkpoint_inference.pt # Model weights only (146 MB) checkpoint.pt # Full checkpoint with optimizer state (438 MB, for continued training) config.py # Model configuration model.py # Architecture (TTSEncoderDecoder + speaker normalization) tokenizer.py # Character-level tokenizer codec.py # MioCodec wrapper inference.py # Inference pipeline with KV-cache + sentence splitting train.py # Training script (BF16) server.py # Gradio web UI samples/ # Audio samples (3 voices × 2 languages × 3 texts) ``` ## Performance ### Benchmarks | Hardware | RTF | Speed | Notes | |---|---|---|---| | **Intel i3-9100F (CPU)** | **0.30** | **3.3× real-time** | **Windows 10, CPU-only, no GPU** | ### CPU-only Deployment (Tested on Windows 10) | Component | Disk Space | |---|---| | Python venv (PyTorch CPU + deps) | 654 MB | | BgTTS-38M-V2 (checkpoint + code) | 146 MB | | MioCodec (auto-downloaded, cached) | 499 MB | | WavLM base+ (auto-downloaded, cached) | 872 MB | | **Total** | **2.12 GB** | No NVIDIA GPU, no CUDA, no special drivers needed. Works on any x86-64 machine with Python 3.8+. ## Comparison with Other Models | Model | Parameters | Size | Languages | Voice Cloning | Open Source | |---|---|---|---|---|---| | **BgTTS-38M V2** | **38M** | **146 MB** | BG + EN | ✅ | ✅ | | Kokoro-82M | 82M | ~200 MB | Multi | ❌ | ✅ | | XTTS-v2 | ~467M | ~1.8 GB | 16 | ✅ | ✅ | | CSM-1B | 1B | ~4 GB | EN | ✅ | ✅ | | Dia-1.6B | 1.6B | ~6.4 GB | EN | ✅ | ✅ | BgTTS-38M V2 is the **smallest TTS model with voice cloning** we are aware of, and the **only** open-source TTS model with native Bulgarian language support. ## Limitations - Best with sentences up to ~18 seconds. Longer texts are auto-split by `inference.py`. - Bulgarian quality is superior to English (82% of training data is Bulgarian). - Voice cloning quality depends on reference audio clarity — use clean recordings without background noise. - No explicit prosody control (pitch, speed) — these are implicitly learned from data. - Character-level tokenizer may struggle with rare Unicode characters outside the supported set. ## License Apache 2.0 ## Citation ```bibtex @misc{bgtts38mv2, title={BgTTS-38M V2: Bulgarian Text-to-Speech with Voice Cloning and Speaker Normalization}, author={beleata74}, year={2026}, url={https://huggingface.co/beleata74/BgTTS-38M-V2} } ```