Duplicate from beleata74/Ani-Voice-API

7eecd1a 26 days ago

9.69 kB

	---
	license: apache-2.0
	language:
	- bg
	- en
	pipeline_tag: text-to-speech
	tags:
	- tts
	- bulgarian
	- miocodec
	- encoder-decoder
	- voice-cloning
	- speech-synthesis
	library_name: pytorch
	---

	# BgTTS-38M V2 — Bulgarian Text-to-Speech with Voice Cloning

	A lightweight 38M parameter encoder-decoder TTS model for Bulgarian and English speech synthesis with zero-shot voice cloning via [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz).

	V2 improvements over V1:
	- Speaker normalization — stable voice quality across all reference audio files
	- Larger training dataset — 1,537 hours (vs 1,172h in V1)
	- BF16 training — more stable gradients, no GradScaler needed
	- Zero dropout — better utilization of model capacity
	- 20 epochs with careful LR scheduling

	## Audio Samples

	### Female Voice (Bulgarian)

	<audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_female_bg1.wav"></audio>

	### Female Voice (English)

	<audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_female_en1.wav"></audio>

	### Male Voice 1 (Bulgarian)

	<audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_male_bg1.wav"></audio>

	### Male Voice 1 (English)

	<audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_male_en1.wav"></audio>

	### Male Voice 2 (Bulgarian)

	<audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_male2_bg1.wav"></audio>

	### Male Voice 2 (English)

	<audio controls src="https://huggingface.co/beleata74/BgTTS-38M-V2/resolve/main/samples/sample_male2_en1.wav"></audio>

	## Key Features

	- Bilingual: Native Bulgarian + English in a single model
	- Voice cloning: Zero-shot — just provide 3-10 seconds of reference audio
	- Tiny footprint: 146 MB inference checkpoint, runs on CPU
	- Fast: RTF ~0.3 on both GPU and CPU (3.3× faster than real-time)
	- Speaker-stable: V2's normalized speaker embedding ensures consistent quality regardless of reference audio

	## 🎙️ Voice Cloning

	This model supports zero-shot voice cloning — it can generate speech in any voice given just a short reference audio clip. No fine-tuning needed.

	### How it Works

	1. Provide a reference audio (3-10 seconds of clear speech, WAV format, ideally 24kHz)
	2. MioCodec extracts a 128-dimensional speaker embedding (`global_embedding`)
	3. The embedding is L2-normalized and scaled by a learned parameter (`spk_scale`) before being added to the decoder
	4. The same embedding is used for MioCodec waveform reconstruction

	### V2 Improvement: Speaker Normalization

	In V1, the speaker embedding had 7× larger norm than content tokens, causing the model to over-rely on the reference audio for pronunciation quality. V2 normalizes the speaker vector to unit norm, ensuring:
	- Consistent quality across all reference voices
	- The model learns speech patterns from data, not from speaker shortcuts
	- Reference audio only affects timbre, not articulation

	## Model Architecture

	\| Component \| Details \|
	\|---\|---\|
	\| Text Encoder \| 4-layer bidirectional Transformer (d=384, 6 heads, ff=1536) \|
	\| Audio Decoder \| 8-layer causal Transformer (d=384, 6 heads, ff=1536) with cross-attention \|
	\| Speaker Injection \| L2-normalized Linear(128 → 384) with learned scale, additive bias \|
	\| Audio Codec \| [MioCodec](https://huggingface.co/Aratako/MioCodec-25Hz-24kHz) 25Hz, 1 codebook, 12800 codes, 24kHz output \|
	\| Total Parameters \| 38.2M (Encoder: 9.6M, Decoder: 28.6M) \|
	\| Activations \| SwiGLU \|
	\| Normalization \| RMSNorm (pre-norm) \|
	\| Positional Encoding \| Learned (encoder), RoPE (decoder) \|
	\| Embeddings \| Tied decoder (lm_head = token_embedding) \|
	\| KV-Cache \| Yes (for fast autoregressive inference) \|

	### Tokenizer

	Character-level tokenizer supporting 146 characters:
	- Bulgarian Cyrillic (А-Я, а-я)
	- English Latin (A-Z, a-z)
	- Digits, punctuation, whitespace

	Total vocabulary: 12,955 tokens (9 special + 146 text + 12,800 audio codes)

	## Training

	\| Parameter \| Value \|
	\|---\|---\|
	\| Data \| 728K samples, 1,537 hours total \|
	\| Bulgarian \| ~620K samples (~1,368 hours) \|
	\| English \| ~108K samples (~169 hours) \|
	\| Epochs \| 20 \|
	\| LR Schedule \| Cosine decay, peak 7e-5, warmup 2 epochs, min 5e-6 \|
	\| Batch Size \| 64 \|
	\| Optimizer \| AdamW (betas=0.9, 0.999), weight decay 0.01 \|
	\| Precision \| BF16 (no GradScaler) \|
	\| Dropout \| 0.0 (unnecessary — model is 38M, data is 1,537h) \|
	\| Final Loss \| 5.04 \|
	\| Hardware \| NVIDIA RTX 5090 (32GB VRAM) \|

	### Why Zero Dropout?

	With only 38M parameters and 138M audio tokens (1,537 hours), the model has 0.28 parameters per token. Overfitting is mathematically impossible — the model is severely underfitting the data. Dropout only slows convergence without providing any regularization benefit.

	## Quick Start

	### Requirements

	```bash
	pip install torch torchaudio soundfile miocodec
	```

	### Python API

	```python
	import torch
	from model import load_for_inference
	from tokenizer import TTSTokenizer
	from codec import CodecV6
	from inference import generate

	device = "cuda" # or "cpu"

	# Load model
	model = load_for_inference("checkpoint_inference.pt", device=device)
	tokenizer = TTSTokenizer()
	codec = CodecV6(device=device)

	# Get speaker embedding from reference audio
	ref = codec.encode("reference_speaker.wav")
	speaker_emb = ref["global_embedding"].to(device)

	# Generate
	codes = generate(
	model, tokenizer,
	text="Здравейте, как сте днес?",
	speaker_emb=speaker_emb,
	temperature=0.3,
	top_k=250,
	max_new_tokens=512,
	device=device,
	)

	# Decode to audio
	if codes is not None:
	wav = codec.tokens_to_wav(codes, speaker_emb, "output.wav")
	```

	### CLI

	```bash
	python inference.py \
	--checkpoint checkpoint_inference.pt \
	--text "Здравейте, как сте днес?" \
	--speaker-wav reference.wav \
	--output output.wav \
	--temperature 0.3
	```

	### Web UI (Gradio)

	```bash
	python server.py
	# Opens at http://localhost:7860
	```

	### Parameters

	\| Parameter \| Default \| Description \|
	\|---\|---\|---\|
	\| `--temperature` \| 0.3 \| Sampling temperature (lower = stable, higher = expressive) \|
	\| `--top-k` \| 250 \| Top-k filtering \|
	\| `--top-p` \| 0.95 \| Nucleus sampling threshold \|
	\| `--rep-penalty` \| 1.1 \| Repetition penalty on recent tokens \|
	\| `--max-tokens` \| 512 \| Maximum decoder steps (~20 seconds) \|

	Recommended temperature: 0.3 for clean, stable output. Use 0.5-0.7 for more expressive/varied speech.

	## ⚠️ Important: Sentence Length

	> The encoder supports up to 256 characters (~18 seconds of audio). For longer texts, `inference.py` automatically splits by sentence boundaries and concatenates the audio. No manual splitting needed.

	## Files

	```
	checkpoint_inference.pt # Model weights only (146 MB)
	checkpoint.pt # Full checkpoint with optimizer state (438 MB, for continued training)
	config.py # Model configuration
	model.py # Architecture (TTSEncoderDecoder + speaker normalization)
	tokenizer.py # Character-level tokenizer
	codec.py # MioCodec wrapper
	inference.py # Inference pipeline with KV-cache + sentence splitting
	train.py # Training script (BF16)
	server.py # Gradio web UI
	samples/ # Audio samples (3 voices × 2 languages × 3 texts)
	```

	## Performance

	### Benchmarks

	\| Hardware \| RTF \| Speed \| Notes \|
	\|---\|---\|---\|---\|
	\| Intel i3-9100F (CPU) \| 0.30 \| 3.3× real-time \| Windows 10, CPU-only, no GPU \|

	### CPU-only Deployment (Tested on Windows 10)

	\| Component \| Disk Space \|
	\|---\|---\|
	\| Python venv (PyTorch CPU + deps) \| 654 MB \|
	\| BgTTS-38M-V2 (checkpoint + code) \| 146 MB \|
	\| MioCodec (auto-downloaded, cached) \| 499 MB \|
	\| WavLM base+ (auto-downloaded, cached) \| 872 MB \|
	\| Total \| 2.12 GB \|

	No NVIDIA GPU, no CUDA, no special drivers needed. Works on any x86-64 machine with Python 3.8+.

	## Comparison with Other Models

	\| Model \| Parameters \| Size \| Languages \| Voice Cloning \| Open Source \|
	\|---\|---\|---\|---\|---\|---\|
	\| BgTTS-38M V2 \| 38M \| 146 MB \| BG + EN \| ✅ \| ✅ \|
	\| Kokoro-82M \| 82M \| ~200 MB \| Multi \| ❌ \| ✅ \|
	\| XTTS-v2 \| ~467M \| ~1.8 GB \| 16 \| ✅ \| ✅ \|
	\| CSM-1B \| 1B \| ~4 GB \| EN \| ✅ \| ✅ \|
	\| Dia-1.6B \| 1.6B \| ~6.4 GB \| EN \| ✅ \| ✅ \|

	BgTTS-38M V2 is the smallest TTS model with voice cloning we are aware of, and the only open-source TTS model with native Bulgarian language support.

	## Limitations

	- Best with sentences up to ~18 seconds. Longer texts are auto-split by `inference.py`.
	- Bulgarian quality is superior to English (82% of training data is Bulgarian).
	- Voice cloning quality depends on reference audio clarity — use clean recordings without background noise.
	- No explicit prosody control (pitch, speed) — these are implicitly learned from data.
	- Character-level tokenizer may struggle with rare Unicode characters outside the supported set.

	## License

	Apache 2.0

	## Citation

	```bibtex
	@misc{bgtts38mv2,
	title={BgTTS-38M V2: Bulgarian Text-to-Speech with Voice Cloning and Speaker Normalization},
	author={beleata74},
	year={2026},
	url={https://huggingface.co/beleata74/BgTTS-38M-V2}
	}
	```