You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AIT-Syn — Multilingual Text-to-Speech with Voice Cloning

AIT-Syn is a multilingual text-to-speech model supporting Kazakh, Russian, and English with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from Qwen/Qwen3-TTS-12Hz-1.7B-Base.

Supported Languages

Language Code
Kazakh kazakh
Russian russian
English english

Model Details

Property Value
Base model Qwen/Qwen3-TTS-12Hz-1.7B-Base
Parameters 1.7B
Output sample rate 24 kHz

Installation

pip install qwen-tts torch soundfile
# Optional: faster attention
pip install flash-attn

Usage

Voice Cloning with Transcript (Recommended)

Providing the transcript of the reference audio gives the best voice matching quality:

import torch
import soundfile as sf
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

try:
    import flash_attn
    attn_impl = "flash_attention_2"
except ImportError:
    attn_impl = "eager"

model = Qwen3TTSModel.from_pretrained(
    "nur-dev/ait-syn",
    dtype=torch.bfloat16,
    attn_implementation=attn_impl,
    device_map="cuda:0",
)
model.model.eval()

# Kazakh example
wavs, sr = model.generate_voice_clone(
    text="Сәлеметсіз бе, бұл сынақ сөйлемі.",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
    language="kazakh",
    x_vector_only_mode=False,
    non_streaming_mode=True,
    temperature=0.9,
    top_k=50,
    do_sample=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Voice Cloning without Transcript

If you only have the reference audio (no transcript):

wavs, sr = model.generate_voice_clone(
    text="Hello, this is a test sentence.",
    ref_audio="reference.wav",
    language="english",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Russian example

wavs, sr = model.generate_voice_clone(
    text="Добрый день! Это тестовое предложение на русском языке.",
    ref_audio="reference.wav",
    language="russian",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Generation Parameters

Parameter Default Description
temperature 0.9 Sampling temperature — lower = more stable, higher = more expressive
top_k 50 Top-k sampling
top_p 1.0 Nucleus sampling
repetition_penalty 1.0 Repetition penalty
do_sample True Sampling vs greedy decoding
non_streaming_mode True Generate full audio before returning

Tips

  • Output audio is 24 kHz mono
  • Reference audio should be clean speech, 5–15 seconds
  • Use full language names: "kazakh", "russian", "english" (not ISO codes)
  • ICL mode (x_vector_only_mode=False with ref_text) gives better voice matching than x-vector-only mode

License

This model is released under CC BY-NC 4.0 (non-commercial use only).

Commercial Use

For commercial licensing, please contact: nurgaliqadyrbek@gmail.com

Downloads last month
164
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nur-dev/ait-syn

Finetuned
(10)
this model