ait-syn / README.md
nur-dev's picture
AIT-Syn v7: multilingual TTS (KK/RU/EN)
2a8835b
metadata
license: cc-by-nc-4.0
language:
  - kk
  - ru
  - en
tags:
  - text-to-speech
  - tts
  - voice-cloning
  - qwen3-tts
  - kazakh
  - multilingual
library_name: qwen-tts
pipeline_tag: text-to-speech
base_model: Qwen/Qwen3-TTS-12Hz-1.7B-Base

AIT-Syn — Multilingual Text-to-Speech with Voice Cloning

AIT-Syn is a multilingual text-to-speech model supporting Kazakh, Russian, and English with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from Qwen/Qwen3-TTS-12Hz-1.7B-Base.

Supported Languages

Language Code
Kazakh kazakh
Russian russian
English english

Model Details

Property Value
Base model Qwen/Qwen3-TTS-12Hz-1.7B-Base
Parameters 1.7B
Output sample rate 24 kHz

Installation

pip install qwen-tts torch soundfile
# Optional: faster attention
pip install flash-attn

Usage

Voice Cloning with Transcript (Recommended)

Providing the transcript of the reference audio gives the best voice matching quality:

import torch
import soundfile as sf
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

try:
    import flash_attn
    attn_impl = "flash_attention_2"
except ImportError:
    attn_impl = "eager"

model = Qwen3TTSModel.from_pretrained(
    "nur-dev/ait-syn",
    dtype=torch.bfloat16,
    attn_implementation=attn_impl,
    device_map="cuda:0",
)
model.model.eval()

# Kazakh example
wavs, sr = model.generate_voice_clone(
    text="Сәлеметсіз бе, бұл сынақ сөйлемі.",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
    language="kazakh",
    x_vector_only_mode=False,
    non_streaming_mode=True,
    temperature=0.9,
    top_k=50,
    do_sample=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Voice Cloning without Transcript

If you only have the reference audio (no transcript):

wavs, sr = model.generate_voice_clone(
    text="Hello, this is a test sentence.",
    ref_audio="reference.wav",
    language="english",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Russian example

wavs, sr = model.generate_voice_clone(
    text="Добрый день! Это тестовое предложение на русском языке.",
    ref_audio="reference.wav",
    language="russian",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Generation Parameters

Parameter Default Description
temperature 0.9 Sampling temperature — lower = more stable, higher = more expressive
top_k 50 Top-k sampling
top_p 1.0 Nucleus sampling
repetition_penalty 1.0 Repetition penalty
do_sample True Sampling vs greedy decoding
non_streaming_mode True Generate full audio before returning

Tips

  • Output audio is 24 kHz mono
  • Reference audio should be clean speech, 5–15 seconds
  • Use full language names: "kazakh", "russian", "english" (not ISO codes)
  • ICL mode (x_vector_only_mode=False with ref_text) gives better voice matching than x-vector-only mode

License

This model is released under CC BY-NC 4.0 (non-commercial use only).

Commercial Use

For commercial licensing, please contact: nurgaliqadyrbek@gmail.com