| ---
|
| license: apache-2.0
|
| language:
|
| - bg
|
| - en
|
| pipeline_tag: text-to-speech
|
| tags:
|
| - tts
|
| - bulgarian
|
| - miocodec
|
| - encoder-decoder
|
| - voice-cloning
|
| - speech-synthesis
|
| library_name: pytorch
|
| ---
|
|
|
| # BgTTS-38M V2 — Bulgarian Text-to-Speech with Voice Cloning
|
|
|
| A lightweight **38M parameter** encoder-decoder TTS model for **Bulgarian and English** speech synthesis with **zero-shot voice cloning** via [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz).
|
|
|
| **V2 improvements over V1:**
|
| - **Speaker normalization** — stable voice quality across all reference audio files
|
| - **Larger training dataset** — 1,537 hours (vs 1,172h in V1)
|
| - **BF16 training** — more stable gradients, no GradScaler needed
|
| - **Zero dropout** — better utilization of model capacity
|
| - **20 epochs** with careful LR scheduling
|
|
|
| ## Audio Samples
|
|
|
| ### Female Voice (Bulgarian)
|
|
|
| <audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_female_bg1.wav"></audio>
|
|
|
| ### Female Voice (English)
|
|
|
| <audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_female_en1.wav"></audio>
|
|
|
| ### Male Voice 1 (Bulgarian)
|
|
|
| <audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_male_bg1.wav"></audio>
|
|
|
| ### Male Voice 1 (English)
|
|
|
| <audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_male_en1.wav"></audio>
|
|
|
| ### Male Voice 2 (Bulgarian)
|
|
|
| <audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_male2_bg1.wav"></audio>
|
|
|
| ### Male Voice 2 (English)
|
|
|
| <audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_male2_en1.wav"></audio>
|
|
|
| ## Key Features
|
|
|
| - **Bilingual**: Native Bulgarian + English in a single model
|
| - **Voice cloning**: Zero-shot — just provide 3-10 seconds of reference audio
|
| - **Tiny footprint**: 146 MB inference checkpoint, runs on CPU
|
| - **Fast**: RTF ~0.3 on both GPU and CPU (3.3× faster than real-time)
|
| - **Speaker-stable**: V2's normalized speaker embedding ensures consistent quality regardless of reference audio
|
|
|
| ## 🎙️ Voice Cloning
|
|
|
| This model supports zero-shot voice cloning — it can generate speech in any voice given just a short reference audio clip. No fine-tuning needed.
|
|
|
| ### How it Works
|
|
|
| 1. Provide a reference audio (3-10 seconds of clear speech, WAV format, ideally 24kHz)
|
| 2. MioCodec extracts a 128-dimensional speaker embedding (`global_embedding`)
|
| 3. The embedding is **L2-normalized** and scaled by a learned parameter (`spk_scale`) before being added to the decoder
|
| 4. The same embedding is used for MioCodec waveform reconstruction
|
|
|
| ### V2 Improvement: Speaker Normalization
|
|
|
| In V1, the speaker embedding had 7× larger norm than content tokens, causing the model to over-rely on the reference audio for pronunciation quality. V2 normalizes the speaker vector to unit norm, ensuring:
|
| - **Consistent quality** across all reference voices
|
| - The model learns speech patterns from data, not from speaker shortcuts
|
| - Reference audio only affects **timbre**, not articulation
|
|
|
| ## Model Architecture
|
|
|
| | Component | Details |
|
| |---|---|
|
| | Text Encoder | 4-layer bidirectional Transformer (d=384, 6 heads, ff=1536) |
|
| | Audio Decoder | 8-layer causal Transformer (d=384, 6 heads, ff=1536) with cross-attention |
|
| | Speaker Injection | L2-normalized Linear(128 → 384) with learned scale, additive bias |
|
| | Audio Codec | [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) 25Hz, 1 codebook, 12800 codes, 24kHz output |
|
| | Total Parameters | 38.2M (Encoder: 9.6M, Decoder: 28.6M) |
|
| | Activations | SwiGLU |
|
| | Normalization | RMSNorm (pre-norm) |
|
| | Positional Encoding | Learned (encoder), RoPE (decoder) |
|
| | Embeddings | Tied decoder (lm_head = token_embedding) |
|
| | KV-Cache | Yes (for fast autoregressive inference) |
|
|
|
| ### Tokenizer
|
|
|
| Character-level tokenizer supporting 146 characters:
|
| - Bulgarian Cyrillic (А-Я, а-я)
|
| - English Latin (A-Z, a-z)
|
| - Digits, punctuation, whitespace
|
|
|
| Total vocabulary: **12,955 tokens** (9 special + 146 text + 12,800 audio codes)
|
|
|
| ## Training
|
|
|
| | Parameter | Value |
|
| |---|---|
|
| | **Data** | 728K samples, **1,537 hours** total |
|
| | Bulgarian | ~620K samples (~1,368 hours) |
|
| | English | ~108K samples (~169 hours) |
|
| | **Epochs** | 20 |
|
| | **LR Schedule** | Cosine decay, peak 7e-5, warmup 2 epochs, min 5e-6 |
|
| | **Batch Size** | 64 |
|
| | **Optimizer** | AdamW (betas=0.9, 0.999), weight decay 0.01 |
|
| | **Precision** | BF16 (no GradScaler) |
|
| | **Dropout** | 0.0 (unnecessary — model is 38M, data is 1,537h) |
|
| | **Final Loss** | 5.04 |
|
| | **Hardware** | NVIDIA RTX 5090 (32GB VRAM) |
|
|
|
| ### Why Zero Dropout?
|
|
|
| With only 38M parameters and 138M audio tokens (1,537 hours), the model has **0.28 parameters per token**. Overfitting is mathematically impossible — the model is severely underfitting the data. Dropout only slows convergence without providing any regularization benefit.
|
|
|
| ## Quick Start
|
|
|
| ### Requirements
|
|
|
| ```bash
|
| pip install torch torchaudio soundfile miocodec
|
| ```
|
|
|
| ### Python API
|
|
|
| ```python
|
| import torch
|
| from model import load_for_inference
|
| from tokenizer import TTSTokenizer
|
| from codec import CodecV6
|
| from inference import generate
|
|
|
| device = "cuda" # or "cpu"
|
|
|
| # Load model
|
| model = load_for_inference("checkpoint_inference.pt", device=device)
|
| tokenizer = TTSTokenizer()
|
| codec = CodecV6(device=device)
|
|
|
| # Get speaker embedding from reference audio
|
| ref = codec.encode("reference_speaker.wav")
|
| speaker_emb = ref["global_embedding"].to(device)
|
|
|
| # Generate
|
| codes = generate(
|
| model, tokenizer,
|
| text="Здравейте, как сте днес?",
|
| speaker_emb=speaker_emb,
|
| temperature=0.3,
|
| top_k=250,
|
| max_new_tokens=512,
|
| device=device,
|
| )
|
|
|
| # Decode to audio
|
| if codes is not None:
|
| wav = codec.tokens_to_wav(codes, speaker_emb, "output.wav")
|
| ```
|
|
|
| ### CLI
|
|
|
| ```bash
|
| python inference.py \
|
| --checkpoint checkpoint_inference.pt \
|
| --text "Здравейте, как сте днес?" \
|
| --speaker-wav reference.wav \
|
| --output output.wav \
|
| --temperature 0.3
|
| ```
|
|
|
| ### Web UI (Gradio)
|
|
|
| ```bash
|
| python server.py
|
| # Opens at http://localhost:7860
|
| ```
|
|
|
| ### Parameters
|
|
|
| | Parameter | Default | Description |
|
| |---|---|---|
|
| | `--temperature` | 0.3 | Sampling temperature (lower = stable, higher = expressive) |
|
| | `--top-k` | 250 | Top-k filtering |
|
| | `--top-p` | 0.95 | Nucleus sampling threshold |
|
| | `--rep-penalty` | 1.1 | Repetition penalty on recent tokens |
|
| | `--max-tokens` | 512 | Maximum decoder steps (~20 seconds) |
|
|
|
| **Recommended temperature: 0.3** for clean, stable output. Use 0.5-0.7 for more expressive/varied speech.
|
|
|
| ## ⚠️ Important: Sentence Length
|
|
|
| > The encoder supports up to **256 characters** (~18 seconds of audio). For longer texts, `inference.py` automatically splits by sentence boundaries and concatenates the audio. No manual splitting needed.
|
|
|
| ## Files
|
|
|
| ```
|
| checkpoint_inference.pt # Model weights only (146 MB)
|
| checkpoint.pt # Full checkpoint with optimizer state (438 MB, for continued training)
|
| config.py # Model configuration
|
| model.py # Architecture (TTSEncoderDecoder + speaker normalization)
|
| tokenizer.py # Character-level tokenizer
|
| codec.py # MioCodec wrapper
|
| inference.py # Inference pipeline with KV-cache + sentence splitting
|
| train.py # Training script (BF16)
|
| server.py # Gradio web UI
|
| samples/ # Audio samples (3 voices × 2 languages × 3 texts)
|
| ```
|
|
|
| ## Performance
|
|
|
| ### Benchmarks
|
|
|
| | Hardware | RTF | Speed | Notes |
|
| |---|---|---|---|
|
| | **Intel i3-9100F (CPU)** | **0.30** | **3.3× real-time** | **Windows 10, CPU-only, no GPU** |
|
|
|
| ### CPU-only Deployment (Tested on Windows 10)
|
|
|
| | Component | Disk Space |
|
| |---|---|
|
| | Python venv (PyTorch CPU + deps) | 654 MB |
|
| | BgTTS-38M-V2 (checkpoint + code) | 146 MB |
|
| | MioCodec (auto-downloaded, cached) | 499 MB |
|
| | WavLM base+ (auto-downloaded, cached) | 872 MB |
|
| | **Total** | **2.12 GB** |
|
|
|
| No NVIDIA GPU, no CUDA, no special drivers needed. Works on any x86-64 machine with Python 3.8+.
|
|
|
| ## Comparison with Other Models
|
|
|
| | Model | Parameters | Size | Languages | Voice Cloning | Open Source |
|
| |---|---|---|---|---|---|
|
| | **BgTTS-38M V2** | **38M** | **146 MB** | BG + EN | ✅ | ✅ |
|
| | Kokoro-82M | 82M | ~200 MB | Multi | ❌ | ✅ |
|
| | XTTS-v2 | ~467M | ~1.8 GB | 16 | ✅ | ✅ |
|
| | CSM-1B | 1B | ~4 GB | EN | ✅ | ✅ |
|
| | Dia-1.6B | 1.6B | ~6.4 GB | EN | ✅ | ✅ |
|
|
|
| BgTTS-38M V2 is the **smallest TTS model with voice cloning** we are aware of, and the **only** open-source TTS model with native Bulgarian language support.
|
|
|
| ## Limitations
|
|
|
| - Best with sentences up to ~18 seconds. Longer texts are auto-split by `inference.py`.
|
| - Bulgarian quality is superior to English (82% of training data is Bulgarian).
|
| - Voice cloning quality depends on reference audio clarity — use clean recordings without background noise.
|
| - No explicit prosody control (pitch, speed) — these are implicitly learned from data.
|
| - Character-level tokenizer may struggle with rare Unicode characters outside the supported set.
|
|
|
| ## License
|
|
|
| Apache 2.0
|
|
|
| ## Citation
|
|
|
| ```bibtex
|
| @misc{bgtts38mv2,
|
| title={BgTTS-38M V2: Bulgarian Text-to-Speech with Voice Cloning and Speaker Normalization},
|
| author={beleata74},
|
| year={2026},
|
| url={https://huggingface.co/beleata74/BgTTS-38M-V2}
|
| }
|
| ```
|
|
|