AIT-Syn 4L — Multilingual TTS with Voice Cloning
A multilingual text-to-speech model supporting Kazakh, Russian, English, and Uzbek with cross-lingual voice cloning. Fine-tuned from Qwen3-TTS-12Hz-1.7B-Base.
Features
- 4 languages: Kazakh (kk), Russian (ru), English (en), Uzbek (uz)
- Voice cloning: clone any voice from a short reference audio (~5–10 s)
- Two cloning modes: x-vector-only (no transcript needed) or ICL (with ref transcript, higher quality)
- 12.5 Hz codec: efficient autoregressive generation
- 24 kHz output: PCM 16-bit WAV
Quick Start
Installation
pip install qwen-tts torch soundfile
Generate Speech
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
import soundfile as sf
model = Qwen3TTSModel.from_pretrained(
"nur-dev/ait-syn-4L",
dtype="bfloat16",
device_map="cuda:0",
)
model.model.eval()
# X-vector-only mode (no ref transcript needed)
wavs, sr = model.generate_voice_clone(
text="Сәлеметсіз бе, бұл сынақ сөйлем.",
language="kazakh",
ref_audio="ref_audio_kk.wav",
x_vector_only_mode=True,
non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr)
# ICL mode (provide ref transcript for better quality)
wavs, sr = model.generate_voice_clone(
text="Привет, это тестовое предложение.",
language="russian",
ref_audio="ref_audio_kk.wav",
ref_text="Бұл анықтамалық аудио.",
x_vector_only_mode=False,
non_streaming_mode=True,
)
sf.write("output_icl.wav", wavs[0], sr)
API Reference
generate_voice_clone()
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str or list[str] |
required | Text to synthesize |
language |
str |
required | Language name: kazakh, russian, english, uzbek |
ref_audio |
str or (ndarray, sr) |
required | Reference audio: file path, URL, base64, or (waveform, sample_rate) |
ref_text |
str or None |
None |
Transcript of ref audio (enables ICL mode) |
x_vector_only_mode |
bool |
False |
If True, use only x-vector speaker embedding (no ICL) |
non_streaming_mode |
bool |
False |
If True, return complete audio; if False, return generator |
temperature |
float |
0.9 |
Sampling temperature |
top_k |
int |
50 |
Top-k sampling |
top_p |
float |
1.0 |
Nucleus sampling threshold |
repetition_penalty |
float |
1.05 |
Repetition penalty |
Returns: (list[np.ndarray], int) — list of waveforms and sample rate (24000).
Voice Cloning Modes
X-vector-only (x_vector_only_mode=True)
Uses only the speaker embedding extracted from reference audio. No transcript of the reference is needed. Good for quick cloning when you don't have a transcript.
ICL Mode (x_vector_only_mode=False, provide ref_text)
In-context learning mode: the model sees both the reference audio and its transcript, producing higher-fidelity voice matching. Recommended when a transcript is available.
Serving
A FastAPI server is available for production deployment:
pip install fastapi uvicorn python-multipart soundfile
# Start server
python serve_tts.py --model nur-dev/ait-syn-4L --port 8000
# Or with uvicorn directly
CUDA_VISIBLE_DEVICES=0 TTS_MODEL_PATH=nur-dev/ait-syn-4L uvicorn serve_tts:app --host 0.0.0.0 --port 8000
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/tts |
POST | Synthesize speech (returns WAV) |
/tts/batch |
POST | Batch synthesis (returns ZIP of WAVs) |
/health |
GET | Health check |
/languages |
GET | List supported languages |
Example Request
curl -X POST http://localhost:8000/tts \
-F "text=Сәлеметсіз бе" \
-F "language=kk" \
-F "ref_audio=@ref_audio_kk.wav" \
--output output.wav
Technical Specs
| Spec | Value |
|---|---|
| Parameters | 1.7B |
| Architecture | Qwen3TTSForConditionalGeneration |
| Codec rate | 12.5 Hz (16 sub-codecs) |
| Output sample rate | 24 kHz |
| Precision | bf16 |
| Max generation length | 8192 tokens (~10 min audio) |
Reference Audio
A sample Kazakh male reference audio is included as ref_audio_kk.wav (mono, 24 kHz, ~10 s).
License
CC-BY-NC-4.0
- Downloads last month
- 9