BgTTS-38M V2 — Bulgarian Text-to-Speech with Voice Cloning

A lightweight 38M parameter encoder-decoder TTS model for Bulgarian and English speech synthesis with zero-shot voice cloning via MioCodec.

V2 improvements over V1:

Speaker normalization — stable voice quality across all reference audio files
Larger training dataset — 1,537 hours (vs 1,172h in V1)
BF16 training — more stable gradients, no GradScaler needed
Zero dropout — better utilization of model capacity
20 epochs with careful LR scheduling

Audio Samples

Female Voice (Bulgarian)

Female Voice (English)

Male Voice 1 (Bulgarian)

Male Voice 1 (English)

Male Voice 2 (Bulgarian)

Male Voice 2 (English)

Key Features

Bilingual: Native Bulgarian + English in a single model
Voice cloning: Zero-shot — just provide 3-10 seconds of reference audio
Tiny footprint: 146 MB inference checkpoint, runs on CPU
Fast: RTF ~0.3 on both GPU and CPU (3.3× faster than real-time)
Speaker-stable: V2's normalized speaker embedding ensures consistent quality regardless of reference audio

🎙️ Voice Cloning

This model supports zero-shot voice cloning — it can generate speech in any voice given just a short reference audio clip. No fine-tuning needed.

How it Works

Provide a reference audio (3-10 seconds of clear speech, WAV format, ideally 24kHz)
MioCodec extracts a 128-dimensional speaker embedding (global_embedding)
The embedding is L2-normalized and scaled by a learned parameter (spk_scale) before being added to the decoder
The same embedding is used for MioCodec waveform reconstruction

V2 Improvement: Speaker Normalization

In V1, the speaker embedding had 7× larger norm than content tokens, causing the model to over-rely on the reference audio for pronunciation quality. V2 normalizes the speaker vector to unit norm, ensuring:

Consistent quality across all reference voices
The model learns speech patterns from data, not from speaker shortcuts
Reference audio only affects timbre, not articulation

Model Architecture

Component	Details
Text Encoder	4-layer bidirectional Transformer (d=384, 6 heads, ff=1536)
Audio Decoder	8-layer causal Transformer (d=384, 6 heads, ff=1536) with cross-attention
Speaker Injection	L2-normalized Linear(128 → 384) with learned scale, additive bias
Audio Codec	MioCodec 25Hz, 1 codebook, 12800 codes, 24kHz output
Total Parameters	38.2M (Encoder: 9.6M, Decoder: 28.6M)
Activations	SwiGLU
Normalization	RMSNorm (pre-norm)
Positional Encoding	Learned (encoder), RoPE (decoder)
Embeddings	Tied decoder (lm_head = token_embedding)
KV-Cache	Yes (for fast autoregressive inference)

Tokenizer

Character-level tokenizer supporting 146 characters:

Bulgarian Cyrillic (А-Я, а-я)
English Latin (A-Z, a-z)
Digits, punctuation, whitespace

Total vocabulary: 12,955 tokens (9 special + 146 text + 12,800 audio codes)

Training

Parameter	Value
Data	728K samples, 1,537 hours total
Bulgarian	~~620K samples (~~1,368 hours)
English	~~108K samples (~~169 hours)
Epochs	20
LR Schedule	Cosine decay, peak 7e-5, warmup 2 epochs, min 5e-6
Batch Size	64
Optimizer	AdamW (betas=0.9, 0.999), weight decay 0.01
Precision	BF16 (no GradScaler)
Dropout	0.0 (unnecessary — model is 38M, data is 1,537h)
Final Loss	5.04
Hardware	NVIDIA RTX 5090 (32GB VRAM)

Why Zero Dropout?

With only 38M parameters and 138M audio tokens (1,537 hours), the model has 0.28 parameters per token. Overfitting is mathematically impossible — the model is severely underfitting the data. Dropout only slows convergence without providing any regularization benefit.

Quick Start

Requirements

pip install torch torchaudio soundfile miocodec

Python API

import torch
from model import load_for_inference
from tokenizer import TTSTokenizer
from codec import CodecV6
from inference import generate

device = "cuda"  # or "cpu"

# Load model
model = load_for_inference("checkpoint_inference.pt", device=device)
tokenizer = TTSTokenizer()
codec = CodecV6(device=device)

# Get speaker embedding from reference audio
ref = codec.encode("reference_speaker.wav")
speaker_emb = ref["global_embedding"].to(device)

# Generate
codes = generate(
    model, tokenizer,
    text="Здравейте, как сте днес?",
    speaker_emb=speaker_emb,
    temperature=0.3,
    top_k=250,
    max_new_tokens=512,
    device=device,
)

# Decode to audio
if codes is not None:
    wav = codec.tokens_to_wav(codes, speaker_emb, "output.wav")

CLI

python inference.py \
  --checkpoint checkpoint_inference.pt \
  --text "Здравейте, как сте днес?" \
  --speaker-wav reference.wav \
  --output output.wav \
  --temperature 0.3

Web UI (Gradio)

python server.py
# Opens at http://localhost:7860

Parameters

Parameter	Default	Description
`--temperature`	0.3	Sampling temperature (lower = stable, higher = expressive)
`--top-k`	250	Top-k filtering
`--top-p`	0.95	Nucleus sampling threshold
`--rep-penalty`	1.1	Repetition penalty on recent tokens
`--max-tokens`	512	Maximum decoder steps (~20 seconds)

Recommended temperature: 0.3 for clean, stable output. Use 0.5-0.7 for more expressive/varied speech.

⚠️ Important: Sentence Length

The encoder supports up to 256 characters (~18 seconds of audio). For longer texts, inference.py automatically splits by sentence boundaries and concatenates the audio. No manual splitting needed.

Files

checkpoint_inference.pt   # Model weights only (146 MB)
checkpoint.pt             # Full checkpoint with optimizer state (438 MB, for continued training)
config.py                 # Model configuration
model.py                  # Architecture (TTSEncoderDecoder + speaker normalization)
tokenizer.py              # Character-level tokenizer
codec.py                  # MioCodec wrapper
inference.py              # Inference pipeline with KV-cache + sentence splitting
train.py                  # Training script (BF16)
server.py                 # Gradio web UI
samples/                  # Audio samples (3 voices × 2 languages × 3 texts)

Performance

Benchmarks

Hardware	RTF	Speed	Notes
Intel i3-9100F (CPU)	0.30	3.3× real-time	Windows 10, CPU-only, no GPU

CPU-only Deployment (Tested on Windows 10)

Component	Disk Space
Python venv (PyTorch CPU + deps)	654 MB
BgTTS-38M-V2 (checkpoint + code)	146 MB
MioCodec (auto-downloaded, cached)	499 MB
WavLM base+ (auto-downloaded, cached)	872 MB
Total	2.12 GB

No NVIDIA GPU, no CUDA, no special drivers needed. Works on any x86-64 machine with Python 3.8+.

Comparison with Other Models

Model	Parameters	Size	Languages	Voice Cloning	Open Source
BgTTS-38M V2	38M	146 MB	BG + EN	✅	✅
Kokoro-82M	82M	~200 MB	Multi	❌	✅
XTTS-v2	~467M	~1.8 GB	16	✅	✅
CSM-1B	1B	~4 GB	EN	✅	✅
Dia-1.6B	1.6B	~6.4 GB	EN	✅	✅

BgTTS-38M V2 is the smallest TTS model with voice cloning we are aware of, and the only open-source TTS model with native Bulgarian language support.

Limitations

Best with sentences up to ~18 seconds. Longer texts are auto-split by inference.py.
Bulgarian quality is superior to English (82% of training data is Bulgarian).
Voice cloning quality depends on reference audio clarity — use clean recordings without background noise.
No explicit prosody control (pitch, speed) — these are implicitly learned from data.
Character-level tokenizer may struggle with rare Unicode characters outside the supported set.

License

Apache 2.0

Citation

@misc{bgtts38mv2,
  title={BgTTS-38M V2: Bulgarian Text-to-Speech with Voice Cloning and Speaker Normalization},
  author={beleata74},
  year={2026},
  url={https://huggingface.co/beleata74/BgTTS-38M-V2}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for beleata74/BgTTS-38M-V2

Finetunes

1 model