BgTTS-38M — Bulgarian Text-to-Speech with Voice Cloning
A lightweight 38M parameter encoder-decoder TTS model for Bulgarian and English speech synthesis with zero-shot voice cloning via MioCodec.
Audio Samples
"Това е тест на българския синтез на реч."
"This is a test of the English speech synthesis."
"Този model е trained на български and English data."
🎙️ Voice Cloning
This model supports zero-shot voice cloning — it can generate speech in any voice given just a short reference audio clip. No fine-tuning needed.
How it Works
- Record or provide a reference audio (3-10 seconds of clear speech, WAV format, ideally 24kHz)
- MioCodec extracts a 128-dimensional speaker embedding (
global_embedding) from the reference - The model uses this embedding as an additive bias in the decoder — every generated token is influenced by the speaker's voice characteristics
- The same embedding is used for MioCodec decoding to reconstruct the final waveform
Tips for Best Voice Cloning
- Use clean audio without background music or noise
- 3-10 seconds is enough — longer isn't necessarily better
- The reference audio doesn't need to be in the same language as the generated text
- You can save and reuse speaker embeddings to avoid re-encoding:
import torch
from codec import CodecV6
codec = CodecV6(device="cuda")
# Extract and save speaker embedding
ref = codec.encode("my_voice.wav")
torch.save(ref["global_embedding"], "my_voice_emb.pt")
# Later, load and use it
speaker_emb = torch.load("my_voice_emb.pt")
CLI Voice Cloning
# Clone from a reference WAV file
python inference.py \
--checkpoint . \
--text "Здравейте, аз съм клониран глас." \
--speaker-wav my_voice.wav \
--output cloned_output.wav
# Or use a pre-saved embedding
python inference.py \
--checkpoint . \
--text "Здравейте, аз съм клониран глас." \
--speaker-emb my_voice_emb.pt \
--output cloned_output.wav
⚠️ Important: Sentence Length
This model works best with sentences up to
8 seconds of audio (200 tokens at 25fps).For longer texts, split them into shorter sentences (1-2 sentences at a time) and concatenate the audio. The maximum supported length is ~19 seconds (475 tokens), but quality degrades noticeably beyond 8s.
Model Architecture
| Component | Details |
|---|---|
| Text Encoder | 4-layer bidirectional Transformer (d=384, 6 heads, ff=1536) |
| Audio Decoder | 8-layer causal Transformer (d=384, 6 heads, ff=1536) with cross-attention |
| Speaker Injection | Linear(128 → 384), additive bias from MioCodec global_embedding |
| Audio Codec | MioCodec 25Hz, 1 codebook, 12800 codes, 24kHz output (~350 bps) |
| Total Parameters | 38.2M (Encoder: 9.6M, Decoder: 28.6M) |
| Activations | SwiGLU |
| Normalization | RMSNorm |
| Positional Encoding | Learned (encoder), RoPE (decoder) |
| Embeddings | Tied decoder (lm_head = token_embedding) |
Tokenizer
Character-level tokenizer supporting 146 characters:
- Bulgarian Cyrillic (А-Я, а-я)
- English Latin (A-Z, a-z)
- Digits, punctuation, whitespace
Total vocabulary: 12,955 tokens (9 special + 146 text + 12,800 audio codes)
Training
- Data: 830K samples, 1,172 hours total
- Bulgarian: 292K samples (~661 hours)
- English: 538K samples (~511 hours)
- Schedule: 30 epochs total
- 5 epochs cosine decay from scratch
- 5 epochs warm restarts (4 cycles, decay=0.7)
- 20 epochs twophase LR (25% fast linear drop → 75% slow cosine in productive zone)
- Best val_loss: 5.2759 at step 63,500
- Hardware: NVIDIA RTX 5090 (32GB), ~600 samples/sec
Quick Start
Requirements
pip install torch torchaudio soundfile miocodec
Inference
import torch
from model import load_for_inference
from tokenizer import TTSTokenizer
from codec import CodecV6
from inference import generate
device = "cuda"
# Load model
model = load_for_inference(".", device=device)
tokenizer = TTSTokenizer()
codec = CodecV6(device=device)
# Get speaker embedding from reference audio
ref = codec.encode("reference_speaker.wav")
speaker_emb = ref["global_embedding"].to(device)
# Generate
codes = generate(
model, tokenizer,
text="Здравейте, как сте днес?",
speaker_emb=speaker_emb,
temperature=0.7,
top_k=250,
max_new_tokens=512,
device=device,
)
# Decode to audio
if codes is not None:
wav = codec.tokens_to_wav(codes, speaker_emb, "output.wav")
CLI
python inference.py \
--checkpoint . \
--text "Здравейте, как сте днес?" \
--speaker-wav reference.wav \
--output output.wav \
--temperature 0.7
Parameters
| Parameter | Default | Description |
|---|---|---|
--temperature |
0.7 | Sampling temperature (lower = more stable, higher = more expressive) |
--top-k |
250 | Top-k filtering |
--top-p |
0.95 | Nucleus sampling threshold |
--rep-penalty |
1.1 | Repetition penalty on recent tokens |
--max-tokens |
512 | Maximum decoder steps |
Recommended temperature: 0.5 for stable output, 0.7-0.8 for more natural/expressive speech.
Files
checkpoint.pt # Model weights (153MB, inference-only — no optimizer state)
config.py # All model configuration constants
model.py # Model architecture (TTSEncoderDecoder)
tokenizer.py # Character-level tokenizer
codec.py # MioCodec wrapper (encode/decode)
inference.py # Inference pipeline with KV-cache
samples/ # Audio samples
Limitations
- Best with short sentences (up to ~8 seconds / 200 tokens). Split longer texts.
- Trained primarily on Bulgarian data — Bulgarian quality is better than English.
- Zero-shot voice cloning quality depends on the reference audio clarity.
- No prosody control (pitch, speed, emotion) — these are implicitly learned.
- Character-level tokenizer may struggle with rare Unicode characters outside the supported set.
License
Apache 2.0
Citation
If you use this model, please cite:
@misc{bgtts38m,
title={BgTTS-38M: Bulgarian Text-to-Speech with MioCodec},
author={beleata74},
year={2026},
url={https://huggingface.co/beleata74/BgTTS-38M}
}