BgTTS-38M — Bulgarian Text-to-Speech with Voice Cloning

A lightweight 38M parameter encoder-decoder TTS model for Bulgarian and English speech synthesis with zero-shot voice cloning via MioCodec.

Audio Samples

"Това е тест на българския синтез на реч."

"This is a test of the English speech synthesis."

"Този model е trained на български and English data."

🎙️ Voice Cloning

This model supports zero-shot voice cloning — it can generate speech in any voice given just a short reference audio clip. No fine-tuning needed.

How it Works

  1. Record or provide a reference audio (3-10 seconds of clear speech, WAV format, ideally 24kHz)
  2. MioCodec extracts a 128-dimensional speaker embedding (global_embedding) from the reference
  3. The model uses this embedding as an additive bias in the decoder — every generated token is influenced by the speaker's voice characteristics
  4. The same embedding is used for MioCodec decoding to reconstruct the final waveform

Tips for Best Voice Cloning

  • Use clean audio without background music or noise
  • 3-10 seconds is enough — longer isn't necessarily better
  • The reference audio doesn't need to be in the same language as the generated text
  • You can save and reuse speaker embeddings to avoid re-encoding:
import torch
from codec import CodecV6

codec = CodecV6(device="cuda")

# Extract and save speaker embedding
ref = codec.encode("my_voice.wav")
torch.save(ref["global_embedding"], "my_voice_emb.pt")

# Later, load and use it
speaker_emb = torch.load("my_voice_emb.pt")

CLI Voice Cloning

# Clone from a reference WAV file
python inference.py \
  --checkpoint . \
  --text "Здравейте, аз съм клониран глас." \
  --speaker-wav my_voice.wav \
  --output cloned_output.wav

# Or use a pre-saved embedding
python inference.py \
  --checkpoint . \
  --text "Здравейте, аз съм клониран глас." \
  --speaker-emb my_voice_emb.pt \
  --output cloned_output.wav

⚠️ Important: Sentence Length

This model works best with sentences up to 8 seconds of audio (200 tokens at 25fps).

For longer texts, split them into shorter sentences (1-2 sentences at a time) and concatenate the audio. The maximum supported length is ~19 seconds (475 tokens), but quality degrades noticeably beyond 8s.

Model Architecture

Component Details
Text Encoder 4-layer bidirectional Transformer (d=384, 6 heads, ff=1536)
Audio Decoder 8-layer causal Transformer (d=384, 6 heads, ff=1536) with cross-attention
Speaker Injection Linear(128 → 384), additive bias from MioCodec global_embedding
Audio Codec MioCodec 25Hz, 1 codebook, 12800 codes, 24kHz output (~350 bps)
Total Parameters 38.2M (Encoder: 9.6M, Decoder: 28.6M)
Activations SwiGLU
Normalization RMSNorm
Positional Encoding Learned (encoder), RoPE (decoder)
Embeddings Tied decoder (lm_head = token_embedding)

Tokenizer

Character-level tokenizer supporting 146 characters:

  • Bulgarian Cyrillic (А-Я, а-я)
  • English Latin (A-Z, a-z)
  • Digits, punctuation, whitespace

Total vocabulary: 12,955 tokens (9 special + 146 text + 12,800 audio codes)

Training

  • Data: 830K samples, 1,172 hours total
    • Bulgarian: 292K samples (~661 hours)
    • English: 538K samples (~511 hours)
  • Schedule: 30 epochs total
    • 5 epochs cosine decay from scratch
    • 5 epochs warm restarts (4 cycles, decay=0.7)
    • 20 epochs twophase LR (25% fast linear drop → 75% slow cosine in productive zone)
  • Best val_loss: 5.2759 at step 63,500
  • Hardware: NVIDIA RTX 5090 (32GB), ~600 samples/sec

Quick Start

Requirements

pip install torch torchaudio soundfile miocodec

Inference

import torch
from model import load_for_inference
from tokenizer import TTSTokenizer
from codec import CodecV6
from inference import generate

device = "cuda"

# Load model
model = load_for_inference(".", device=device)
tokenizer = TTSTokenizer()
codec = CodecV6(device=device)

# Get speaker embedding from reference audio
ref = codec.encode("reference_speaker.wav")
speaker_emb = ref["global_embedding"].to(device)

# Generate
codes = generate(
    model, tokenizer,
    text="Здравейте, как сте днес?",
    speaker_emb=speaker_emb,
    temperature=0.7,
    top_k=250,
    max_new_tokens=512,
    device=device,
)

# Decode to audio
if codes is not None:
    wav = codec.tokens_to_wav(codes, speaker_emb, "output.wav")

CLI

python inference.py \
  --checkpoint . \
  --text "Здравейте, как сте днес?" \
  --speaker-wav reference.wav \
  --output output.wav \
  --temperature 0.7

Parameters

Parameter Default Description
--temperature 0.7 Sampling temperature (lower = more stable, higher = more expressive)
--top-k 250 Top-k filtering
--top-p 0.95 Nucleus sampling threshold
--rep-penalty 1.1 Repetition penalty on recent tokens
--max-tokens 512 Maximum decoder steps

Recommended temperature: 0.5 for stable output, 0.7-0.8 for more natural/expressive speech.

Files

checkpoint.pt     # Model weights (153MB, inference-only — no optimizer state)
config.py         # All model configuration constants
model.py          # Model architecture (TTSEncoderDecoder)
tokenizer.py      # Character-level tokenizer
codec.py          # MioCodec wrapper (encode/decode)
inference.py      # Inference pipeline with KV-cache
samples/          # Audio samples

Limitations

  • Best with short sentences (up to ~8 seconds / 200 tokens). Split longer texts.
  • Trained primarily on Bulgarian data — Bulgarian quality is better than English.
  • Zero-shot voice cloning quality depends on the reference audio clarity.
  • No prosody control (pitch, speed, emotion) — these are implicitly learned.
  • Character-level tokenizer may struggle with rare Unicode characters outside the supported set.

License

Apache 2.0

Citation

If you use this model, please cite:

@misc{bgtts38m,
  title={BgTTS-38M: Bulgarian Text-to-Speech with MioCodec},
  author={beleata74},
  year={2026},
  url={https://huggingface.co/beleata74/BgTTS-38M}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support