AIT-Syn — Multilingual Text-to-Speech with Voice Cloning
AIT-Syn is a multilingual text-to-speech model supporting Kazakh, Russian, and English with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from Qwen/Qwen3-TTS-12Hz-1.7B-Base.
Supported Languages
| Language | Code |
|---|---|
| Kazakh | kazakh |
| Russian | russian |
| English | english |
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen3-TTS-12Hz-1.7B-Base |
| Parameters | 1.7B |
| Output sample rate | 24 kHz |
Installation
pip install qwen-tts torch soundfile
# Optional: faster attention
pip install flash-attn
Usage
Voice Cloning with Transcript (Recommended)
Providing the transcript of the reference audio gives the best voice matching quality:
import torch
import soundfile as sf
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
try:
import flash_attn
attn_impl = "flash_attention_2"
except ImportError:
attn_impl = "eager"
model = Qwen3TTSModel.from_pretrained(
"nur-dev/ait-syn",
dtype=torch.bfloat16,
attn_implementation=attn_impl,
device_map="cuda:0",
)
model.model.eval()
# Kazakh example
wavs, sr = model.generate_voice_clone(
text="Сәлеметсіз бе, бұл сынақ сөйлемі.",
ref_audio="reference.wav",
ref_text="Transcript of the reference audio.",
language="kazakh",
x_vector_only_mode=False,
non_streaming_mode=True,
temperature=0.9,
top_k=50,
do_sample=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
Voice Cloning without Transcript
If you only have the reference audio (no transcript):
wavs, sr = model.generate_voice_clone(
text="Hello, this is a test sentence.",
ref_audio="reference.wav",
language="english",
x_vector_only_mode=True,
non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
Russian example
wavs, sr = model.generate_voice_clone(
text="Добрый день! Это тестовое предложение на русском языке.",
ref_audio="reference.wav",
language="russian",
x_vector_only_mode=True,
non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
Generation Parameters
| Parameter | Default | Description |
|---|---|---|
temperature |
0.9 | Sampling temperature — lower = more stable, higher = more expressive |
top_k |
50 | Top-k sampling |
top_p |
1.0 | Nucleus sampling |
repetition_penalty |
1.0 | Repetition penalty |
do_sample |
True |
Sampling vs greedy decoding |
non_streaming_mode |
True |
Generate full audio before returning |
Tips
- Output audio is 24 kHz mono
- Reference audio should be clean speech, 5–15 seconds
- Use full language names:
"kazakh","russian","english"(not ISO codes) - ICL mode (
x_vector_only_mode=Falsewithref_text) gives better voice matching than x-vector-only mode
License
This model is released under CC BY-NC 4.0 (non-commercial use only).
Commercial Use
For commercial licensing, please contact: nurgaliqadyrbek@gmail.com
- Downloads last month
- 164
Model tree for nur-dev/ait-syn
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-Base