BgTTS-38M — Bulgarian Text-to-Speech with Voice Cloning

A lightweight 38M parameter encoder-decoder TTS model for Bulgarian and English speech synthesis with zero-shot voice cloning via MioCodec.

Audio Samples

"Това е тест на българския синтез на реч."

"This is a test of the English speech synthesis."

"Този model е trained на български and English data."

🎙️ Voice Cloning

This model supports zero-shot voice cloning — it can generate speech in any voice given just a short reference audio clip. No fine-tuning needed.

How it Works

Record or provide a reference audio (3-10 seconds of clear speech, WAV format, ideally 24kHz)
MioCodec extracts a 128-dimensional speaker embedding (global_embedding) from the reference
The model uses this embedding as an additive bias in the decoder — every generated token is influenced by the speaker's voice characteristics
The same embedding is used for MioCodec decoding to reconstruct the final waveform

Tips for Best Voice Cloning

Use clean audio without background music or noise
3-10 seconds is enough — longer isn't necessarily better
The reference audio doesn't need to be in the same language as the generated text
You can save and reuse speaker embeddings to avoid re-encoding:

import torch
from codec import CodecV6

codec = CodecV6(device="cuda")

# Extract and save speaker embedding
ref = codec.encode("my_voice.wav")
torch.save(ref["global_embedding"], "my_voice_emb.pt")

# Later, load and use it
speaker_emb = torch.load("my_voice_emb.pt")

CLI Voice Cloning

# Clone from a reference WAV file
python inference.py \
  --checkpoint . \
  --text "Здравейте, аз съм клониран глас." \
  --speaker-wav my_voice.wav \
  --output cloned_output.wav

# Or use a pre-saved embedding
python inference.py \
  --checkpoint . \
  --text "Здравейте, аз съм клониран глас." \
  --speaker-emb my_voice_emb.pt \
  --output cloned_output.wav

⚠️ Important: Sentence Length

This model works best with sentences up to ~~8 seconds of audio (~~200 tokens at 25fps).

For longer texts, split them into shorter sentences (1-2 sentences at a time) and concatenate the audio. The maximum supported length is ~19 seconds (475 tokens), but quality degrades noticeably beyond 8s.

Model Architecture

Component	Details
Text Encoder	4-layer bidirectional Transformer (d=384, 6 heads, ff=1536)
Audio Decoder	8-layer causal Transformer (d=384, 6 heads, ff=1536) with cross-attention
Speaker Injection	Linear(128 → 384), additive bias from MioCodec global_embedding
Audio Codec	MioCodec 25Hz, 1 codebook, 12800 codes, 24kHz output (~350 bps)
Total Parameters	38.2M (Encoder: 9.6M, Decoder: 28.6M)
Activations	SwiGLU
Normalization	RMSNorm
Positional Encoding	Learned (encoder), RoPE (decoder)
Embeddings	Tied decoder (lm_head = token_embedding)

Tokenizer

Character-level tokenizer supporting 146 characters:

Bulgarian Cyrillic (А-Я, а-я)
English Latin (A-Z, a-z)
Digits, punctuation, whitespace

Total vocabulary: 12,955 tokens (9 special + 146 text + 12,800 audio codes)

Training

Data: 830K samples, 1,172 hours total
- Bulgarian: 292K samples (~661 hours)
- English: 538K samples (~511 hours)
Schedule: 30 epochs total
- 5 epochs cosine decay from scratch
- 5 epochs warm restarts (4 cycles, decay=0.7)
- 20 epochs twophase LR (25% fast linear drop → 75% slow cosine in productive zone)
Best val_loss: 5.2759 at step 63,500
Hardware: NVIDIA RTX 5090 (32GB), ~600 samples/sec

Quick Start

Requirements

pip install torch torchaudio soundfile miocodec

Inference

import torch
from model import load_for_inference
from tokenizer import TTSTokenizer
from codec import CodecV6
from inference import generate

device = "cuda"

# Load model
model = load_for_inference(".", device=device)
tokenizer = TTSTokenizer()
codec = CodecV6(device=device)

# Get speaker embedding from reference audio
ref = codec.encode("reference_speaker.wav")
speaker_emb = ref["global_embedding"].to(device)

# Generate
codes = generate(
    model, tokenizer,
    text="Здравейте, как сте днес?",
    speaker_emb=speaker_emb,
    temperature=0.7,
    top_k=250,
    max_new_tokens=512,
    device=device,
)

# Decode to audio
if codes is not None:
    wav = codec.tokens_to_wav(codes, speaker_emb, "output.wav")

CLI

python inference.py \
  --checkpoint . \
  --text "Здравейте, как сте днес?" \
  --speaker-wav reference.wav \
  --output output.wav \
  --temperature 0.7

Parameters

Parameter	Default	Description
`--temperature`	0.7	Sampling temperature (lower = more stable, higher = more expressive)
`--top-k`	250	Top-k filtering
`--top-p`	0.95	Nucleus sampling threshold
`--rep-penalty`	1.1	Repetition penalty on recent tokens
`--max-tokens`	512	Maximum decoder steps

Recommended temperature: 0.5 for stable output, 0.7-0.8 for more natural/expressive speech.

Files

checkpoint.pt     # Model weights (153MB, inference-only — no optimizer state)
config.py         # All model configuration constants
model.py          # Model architecture (TTSEncoderDecoder)
tokenizer.py      # Character-level tokenizer
codec.py          # MioCodec wrapper (encode/decode)
inference.py      # Inference pipeline with KV-cache
samples/          # Audio samples

Limitations

Best with short sentences (up to ~8 seconds / 200 tokens). Split longer texts.
Trained primarily on Bulgarian data — Bulgarian quality is better than English.
Zero-shot voice cloning quality depends on the reference audio clarity.
No prosody control (pitch, speed, emotion) — these are implicitly learned.
Character-level tokenizer may struggle with rare Unicode characters outside the supported set.

License

Apache 2.0

Citation

If you use this model, please cite:

@misc{bgtts38m,
  title={BgTTS-38M: Bulgarian Text-to-Speech with MioCodec},
  author={beleata74},
  year={2026},
  url={https://huggingface.co/beleata74/BgTTS-38M}
}

Downloads last month: -; Downloads are not tracked for this model. How to track