| --- |
| license: cc-by-nc-4.0 |
| language: |
| - kk |
| - ru |
| - en |
| tags: |
| - text-to-speech |
| - tts |
| - voice-cloning |
| - qwen3-tts |
| - kazakh |
| - multilingual |
| library_name: qwen-tts |
| pipeline_tag: text-to-speech |
| base_model: Qwen/Qwen3-TTS-12Hz-1.7B-Base |
| --- |
| |
| # AIT-Syn — Multilingual Text-to-Speech with Voice Cloning |
|
|
| **AIT-Syn** is a multilingual text-to-speech model supporting **Kazakh**, **Russian**, and **English** with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from `Qwen/Qwen3-TTS-12Hz-1.7B-Base`. |
|
|
| ## Supported Languages |
|
|
| | Language | Code | |
| |----------|------| |
| | Kazakh | `kazakh` | |
| | Russian | `russian` | |
| | English | `english` | |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | Base model | `Qwen/Qwen3-TTS-12Hz-1.7B-Base` | |
| | Parameters | 1.7B | |
| | Output sample rate | 24 kHz | |
|
|
| ## Installation |
|
|
| ```bash |
| pip install qwen-tts torch soundfile |
| # Optional: faster attention |
| pip install flash-attn |
| ``` |
|
|
| ## Usage |
|
|
| ### Voice Cloning with Transcript (Recommended) |
|
|
| Providing the transcript of the reference audio gives the best voice matching quality: |
|
|
| ```python |
| import torch |
| import soundfile as sf |
| from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel |
| |
| try: |
| import flash_attn |
| attn_impl = "flash_attention_2" |
| except ImportError: |
| attn_impl = "eager" |
| |
| model = Qwen3TTSModel.from_pretrained( |
| "nur-dev/ait-syn", |
| dtype=torch.bfloat16, |
| attn_implementation=attn_impl, |
| device_map="cuda:0", |
| ) |
| model.model.eval() |
| |
| # Kazakh example |
| wavs, sr = model.generate_voice_clone( |
| text="Сәлеметсіз бе, бұл сынақ сөйлемі.", |
| ref_audio="reference.wav", |
| ref_text="Transcript of the reference audio.", |
| language="kazakh", |
| x_vector_only_mode=False, |
| non_streaming_mode=True, |
| temperature=0.9, |
| top_k=50, |
| do_sample=True, |
| ) |
| sf.write("output.wav", wavs[0], sr, subtype="PCM_16") |
| ``` |
|
|
| ### Voice Cloning without Transcript |
|
|
| If you only have the reference audio (no transcript): |
|
|
| ```python |
| wavs, sr = model.generate_voice_clone( |
| text="Hello, this is a test sentence.", |
| ref_audio="reference.wav", |
| language="english", |
| x_vector_only_mode=True, |
| non_streaming_mode=True, |
| ) |
| sf.write("output.wav", wavs[0], sr, subtype="PCM_16") |
| ``` |
|
|
| ### Russian example |
|
|
| ```python |
| wavs, sr = model.generate_voice_clone( |
| text="Добрый день! Это тестовое предложение на русском языке.", |
| ref_audio="reference.wav", |
| language="russian", |
| x_vector_only_mode=True, |
| non_streaming_mode=True, |
| ) |
| sf.write("output.wav", wavs[0], sr, subtype="PCM_16") |
| ``` |
|
|
| ## Generation Parameters |
|
|
| | Parameter | Default | Description | |
| |-----------|---------|-------------| |
| | `temperature` | 0.9 | Sampling temperature — lower = more stable, higher = more expressive | |
| | `top_k` | 50 | Top-k sampling | |
| | `top_p` | 1.0 | Nucleus sampling | |
| | `repetition_penalty` | 1.0 | Repetition penalty | |
| | `do_sample` | `True` | Sampling vs greedy decoding | |
| | `non_streaming_mode` | `True` | Generate full audio before returning | |
|
|
| ## Tips |
|
|
| - Output audio is 24 kHz mono |
| - Reference audio should be clean speech, 5–15 seconds |
| - Use full language names: `"kazakh"`, `"russian"`, `"english"` (not ISO codes) |
| - ICL mode (`x_vector_only_mode=False` with `ref_text`) gives better voice matching than x-vector-only mode |
|
|
| ## License |
|
|
| This model is released under **CC BY-NC 4.0** (non-commercial use only). |
|
|
| ## Commercial Use |
|
|
| For commercial licensing, please contact: **nurgaliqadyrbek@gmail.com** |
|
|