You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

AIT-Syn — Multilingual Text-to-Speech with Voice Cloning

AIT-Syn is a multilingual text-to-speech model supporting Kazakh, Russian, and English with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from Qwen/Qwen3-TTS-12Hz-1.7B-Base.

Supported Languages

Language	Code
Kazakh	`kazakh`
Russian	`russian`
English	`english`

Model Details

Property	Value
Base model	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`
Parameters	1.7B
Output sample rate	24 kHz

Installation

pip install qwen-tts torch soundfile
# Optional: faster attention
pip install flash-attn

Usage

Voice Cloning with Transcript (Recommended)

Providing the transcript of the reference audio gives the best voice matching quality:

import torch
import soundfile as sf
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

try:
    import flash_attn
    attn_impl = "flash_attention_2"
except ImportError:
    attn_impl = "eager"

model = Qwen3TTSModel.from_pretrained(
    "nur-dev/ait-syn",
    dtype=torch.bfloat16,
    attn_implementation=attn_impl,
    device_map="cuda:0",
)
model.model.eval()

# Kazakh example
wavs, sr = model.generate_voice_clone(
    text="Сәлеметсіз бе, бұл сынақ сөйлемі.",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
    language="kazakh",
    x_vector_only_mode=False,
    non_streaming_mode=True,
    temperature=0.9,
    top_k=50,
    do_sample=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Voice Cloning without Transcript

If you only have the reference audio (no transcript):

wavs, sr = model.generate_voice_clone(
    text="Hello, this is a test sentence.",
    ref_audio="reference.wav",
    language="english",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Russian example

wavs, sr = model.generate_voice_clone(
    text="Добрый день! Это тестовое предложение на русском языке.",
    ref_audio="reference.wav",
    language="russian",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Generation Parameters

Parameter	Default	Description
`temperature`	0.9	Sampling temperature — lower = more stable, higher = more expressive
`top_k`	50	Top-k sampling
`top_p`	1.0	Nucleus sampling
`repetition_penalty`	1.0	Repetition penalty
`do_sample`	`True`	Sampling vs greedy decoding
`non_streaming_mode`	`True`	Generate full audio before returning

Tips

Output audio is 24 kHz mono
Reference audio should be clean speech, 5–15 seconds
Use full language names: "kazakh", "russian", "english" (not ISO codes)
ICL mode (x_vector_only_mode=False with ref_text) gives better voice matching than x-vector-only mode

License

This model is released under CC BY-NC 4.0 (non-commercial use only).

Commercial Use

For commercial licensing, please contact: nurgaliqadyrbek@gmail.com

Downloads last month: 2

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for nur-dev/ait-syn

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-Base

Finetuned

(28)

this model