ait-syn / README.md

AIT-Syn v7: multilingual TTS (KK/RU/EN)

2a8835b 2 months ago

3.53 kB

license: cc-by-nc-4.0
language:
  - kk
  - ru
  - en
tags:
  - text-to-speech
  - tts
  - voice-cloning
  - qwen3-tts
  - kazakh
  - multilingual
library_name: qwen-tts
pipeline_tag: text-to-speech
base_model: Qwen/Qwen3-TTS-12Hz-1.7B-Base

AIT-Syn — Multilingual Text-to-Speech with Voice Cloning

AIT-Syn is a multilingual text-to-speech model supporting Kazakh, Russian, and English with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from Qwen/Qwen3-TTS-12Hz-1.7B-Base.

Supported Languages

Language	Code
Kazakh	`kazakh`
Russian	`russian`
English	`english`

Model Details

Property	Value
Base model	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`
Parameters	1.7B
Output sample rate	24 kHz

Installation

pip install qwen-tts torch soundfile
# Optional: faster attention
pip install flash-attn

Usage

Voice Cloning with Transcript (Recommended)

Providing the transcript of the reference audio gives the best voice matching quality:

import torch
import soundfile as sf
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

try:
    import flash_attn
    attn_impl = "flash_attention_2"
except ImportError:
    attn_impl = "eager"

model = Qwen3TTSModel.from_pretrained(
    "nur-dev/ait-syn",
    dtype=torch.bfloat16,
    attn_implementation=attn_impl,
    device_map="cuda:0",
)
model.model.eval()

# Kazakh example
wavs, sr = model.generate_voice_clone(
    text="Сәлеметсіз бе, бұл сынақ сөйлемі.",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
    language="kazakh",
    x_vector_only_mode=False,
    non_streaming_mode=True,
    temperature=0.9,
    top_k=50,
    do_sample=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Voice Cloning without Transcript

If you only have the reference audio (no transcript):

wavs, sr = model.generate_voice_clone(
    text="Hello, this is a test sentence.",
    ref_audio="reference.wav",
    language="english",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Russian example

wavs, sr = model.generate_voice_clone(
    text="Добрый день! Это тестовое предложение на русском языке.",
    ref_audio="reference.wav",
    language="russian",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")

Generation Parameters

Parameter	Default	Description
`temperature`	0.9	Sampling temperature — lower = more stable, higher = more expressive
`top_k`	50	Top-k sampling
`top_p`	1.0	Nucleus sampling
`repetition_penalty`	1.0	Repetition penalty
`do_sample`	`True`	Sampling vs greedy decoding
`non_streaming_mode`	`True`	Generate full audio before returning

Tips

Output audio is 24 kHz mono
Reference audio should be clean speech, 5–15 seconds
Use full language names: "kazakh", "russian", "english" (not ISO codes)
ICL mode (x_vector_only_mode=False with ref_text) gives better voice matching than x-vector-only mode

License

This model is released under CC BY-NC 4.0 (non-commercial use only).

Commercial Use

For commercial licensing, please contact: nurgaliqadyrbek@gmail.com