BgTTS-38M V2 — Bulgarian Text-to-Speech with Voice Cloning

A lightweight 38M parameter encoder-decoder TTS model for Bulgarian and English speech synthesis with zero-shot voice cloning via MioCodec.

V2 improvements over V1:

  • Speaker normalization — stable voice quality across all reference audio files
  • Larger training dataset — 1,537 hours (vs 1,172h in V1)
  • BF16 training — more stable gradients, no GradScaler needed
  • Zero dropout — better utilization of model capacity
  • 20 epochs with careful LR scheduling

Audio Samples

Female Voice (Bulgarian)

Female Voice (English)

Male Voice 1 (Bulgarian)

Male Voice 1 (English)

Male Voice 2 (Bulgarian)

Male Voice 2 (English)

Key Features

  • Bilingual: Native Bulgarian + English in a single model
  • Voice cloning: Zero-shot — just provide 3-10 seconds of reference audio
  • Tiny footprint: 146 MB inference checkpoint, runs on CPU
  • Fast: RTF ~0.3 on both GPU and CPU (3.3× faster than real-time)
  • Speaker-stable: V2's normalized speaker embedding ensures consistent quality regardless of reference audio

🎙️ Voice Cloning

This model supports zero-shot voice cloning — it can generate speech in any voice given just a short reference audio clip. No fine-tuning needed.

How it Works

  1. Provide a reference audio (3-10 seconds of clear speech, WAV format, ideally 24kHz)
  2. MioCodec extracts a 128-dimensional speaker embedding (global_embedding)
  3. The embedding is L2-normalized and scaled by a learned parameter (spk_scale) before being added to the decoder
  4. The same embedding is used for MioCodec waveform reconstruction

V2 Improvement: Speaker Normalization

In V1, the speaker embedding had 7× larger norm than content tokens, causing the model to over-rely on the reference audio for pronunciation quality. V2 normalizes the speaker vector to unit norm, ensuring:

  • Consistent quality across all reference voices
  • The model learns speech patterns from data, not from speaker shortcuts
  • Reference audio only affects timbre, not articulation

Model Architecture

Component Details
Text Encoder 4-layer bidirectional Transformer (d=384, 6 heads, ff=1536)
Audio Decoder 8-layer causal Transformer (d=384, 6 heads, ff=1536) with cross-attention
Speaker Injection L2-normalized Linear(128 → 384) with learned scale, additive bias
Audio Codec MioCodec 25Hz, 1 codebook, 12800 codes, 24kHz output
Total Parameters 38.2M (Encoder: 9.6M, Decoder: 28.6M)
Activations SwiGLU
Normalization RMSNorm (pre-norm)
Positional Encoding Learned (encoder), RoPE (decoder)
Embeddings Tied decoder (lm_head = token_embedding)
KV-Cache Yes (for fast autoregressive inference)

Tokenizer

Character-level tokenizer supporting 146 characters:

  • Bulgarian Cyrillic (А-Я, а-я)
  • English Latin (A-Z, a-z)
  • Digits, punctuation, whitespace

Total vocabulary: 12,955 tokens (9 special + 146 text + 12,800 audio codes)

Training

Parameter Value
Data 728K samples, 1,537 hours total
Bulgarian 620K samples (1,368 hours)
English 108K samples (169 hours)
Epochs 20
LR Schedule Cosine decay, peak 7e-5, warmup 2 epochs, min 5e-6
Batch Size 64
Optimizer AdamW (betas=0.9, 0.999), weight decay 0.01
Precision BF16 (no GradScaler)
Dropout 0.0 (unnecessary — model is 38M, data is 1,537h)
Final Loss 5.04
Hardware NVIDIA RTX 5090 (32GB VRAM)

Why Zero Dropout?

With only 38M parameters and 138M audio tokens (1,537 hours), the model has 0.28 parameters per token. Overfitting is mathematically impossible — the model is severely underfitting the data. Dropout only slows convergence without providing any regularization benefit.

Quick Start

Requirements

pip install torch torchaudio soundfile miocodec

Python API

import torch
from model import load_for_inference
from tokenizer import TTSTokenizer
from codec import CodecV6
from inference import generate

device = "cuda"  # or "cpu"

# Load model
model = load_for_inference("checkpoint_inference.pt", device=device)
tokenizer = TTSTokenizer()
codec = CodecV6(device=device)

# Get speaker embedding from reference audio
ref = codec.encode("reference_speaker.wav")
speaker_emb = ref["global_embedding"].to(device)

# Generate
codes = generate(
    model, tokenizer,
    text="Здравейте, как сте днес?",
    speaker_emb=speaker_emb,
    temperature=0.3,
    top_k=250,
    max_new_tokens=512,
    device=device,
)

# Decode to audio
if codes is not None:
    wav = codec.tokens_to_wav(codes, speaker_emb, "output.wav")

CLI

python inference.py \
  --checkpoint checkpoint_inference.pt \
  --text "Здравейте, как сте днес?" \
  --speaker-wav reference.wav \
  --output output.wav \
  --temperature 0.3

Web UI (Gradio)

python server.py
# Opens at http://localhost:7860

Parameters

Parameter Default Description
--temperature 0.3 Sampling temperature (lower = stable, higher = expressive)
--top-k 250 Top-k filtering
--top-p 0.95 Nucleus sampling threshold
--rep-penalty 1.1 Repetition penalty on recent tokens
--max-tokens 512 Maximum decoder steps (~20 seconds)

Recommended temperature: 0.3 for clean, stable output. Use 0.5-0.7 for more expressive/varied speech.

⚠️ Important: Sentence Length

The encoder supports up to 256 characters (~18 seconds of audio). For longer texts, inference.py automatically splits by sentence boundaries and concatenates the audio. No manual splitting needed.

Files

checkpoint_inference.pt   # Model weights only (146 MB)
checkpoint.pt             # Full checkpoint with optimizer state (438 MB, for continued training)
config.py                 # Model configuration
model.py                  # Architecture (TTSEncoderDecoder + speaker normalization)
tokenizer.py              # Character-level tokenizer
codec.py                  # MioCodec wrapper
inference.py              # Inference pipeline with KV-cache + sentence splitting
train.py                  # Training script (BF16)
server.py                 # Gradio web UI
samples/                  # Audio samples (3 voices × 2 languages × 3 texts)

Performance

Benchmarks

Hardware RTF Speed Notes
Intel i3-9100F (CPU) 0.30 3.3× real-time Windows 10, CPU-only, no GPU

CPU-only Deployment (Tested on Windows 10)

Component Disk Space
Python venv (PyTorch CPU + deps) 654 MB
BgTTS-38M-V2 (checkpoint + code) 146 MB
MioCodec (auto-downloaded, cached) 499 MB
WavLM base+ (auto-downloaded, cached) 872 MB
Total 2.12 GB

No NVIDIA GPU, no CUDA, no special drivers needed. Works on any x86-64 machine with Python 3.8+.

Comparison with Other Models

Model Parameters Size Languages Voice Cloning Open Source
BgTTS-38M V2 38M 146 MB BG + EN
Kokoro-82M 82M ~200 MB Multi
XTTS-v2 ~467M ~1.8 GB 16
CSM-1B 1B ~4 GB EN
Dia-1.6B 1.6B ~6.4 GB EN

BgTTS-38M V2 is the smallest TTS model with voice cloning we are aware of, and the only open-source TTS model with native Bulgarian language support.

Limitations

  • Best with sentences up to ~18 seconds. Longer texts are auto-split by inference.py.
  • Bulgarian quality is superior to English (82% of training data is Bulgarian).
  • Voice cloning quality depends on reference audio clarity — use clean recordings without background noise.
  • No explicit prosody control (pitch, speed) — these are implicitly learned from data.
  • Character-level tokenizer may struggle with rare Unicode characters outside the supported set.

License

Apache 2.0

Citation

@misc{bgtts38mv2,
  title={BgTTS-38M V2: Bulgarian Text-to-Speech with Voice Cloning and Speaker Normalization},
  author={beleata74},
  year={2026},
  url={https://huggingface.co/beleata74/BgTTS-38M-V2}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support