ait-syn / README.md
nur-dev's picture
AIT-Syn v7: multilingual TTS (KK/RU/EN)
2a8835b
---
license: cc-by-nc-4.0
language:
- kk
- ru
- en
tags:
- text-to-speech
- tts
- voice-cloning
- qwen3-tts
- kazakh
- multilingual
library_name: qwen-tts
pipeline_tag: text-to-speech
base_model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
---
# AIT-Syn — Multilingual Text-to-Speech with Voice Cloning
**AIT-Syn** is a multilingual text-to-speech model supporting **Kazakh**, **Russian**, and **English** with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from `Qwen/Qwen3-TTS-12Hz-1.7B-Base`.
## Supported Languages
| Language | Code |
|----------|------|
| Kazakh | `kazakh` |
| Russian | `russian` |
| English | `english` |
## Model Details
| Property | Value |
|----------|-------|
| Base model | `Qwen/Qwen3-TTS-12Hz-1.7B-Base` |
| Parameters | 1.7B |
| Output sample rate | 24 kHz |
## Installation
```bash
pip install qwen-tts torch soundfile
# Optional: faster attention
pip install flash-attn
```
## Usage
### Voice Cloning with Transcript (Recommended)
Providing the transcript of the reference audio gives the best voice matching quality:
```python
import torch
import soundfile as sf
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
try:
import flash_attn
attn_impl = "flash_attention_2"
except ImportError:
attn_impl = "eager"
model = Qwen3TTSModel.from_pretrained(
"nur-dev/ait-syn",
dtype=torch.bfloat16,
attn_implementation=attn_impl,
device_map="cuda:0",
)
model.model.eval()
# Kazakh example
wavs, sr = model.generate_voice_clone(
text="Сәлеметсіз бе, бұл сынақ сөйлемі.",
ref_audio="reference.wav",
ref_text="Transcript of the reference audio.",
language="kazakh",
x_vector_only_mode=False,
non_streaming_mode=True,
temperature=0.9,
top_k=50,
do_sample=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
```
### Voice Cloning without Transcript
If you only have the reference audio (no transcript):
```python
wavs, sr = model.generate_voice_clone(
text="Hello, this is a test sentence.",
ref_audio="reference.wav",
language="english",
x_vector_only_mode=True,
non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
```
### Russian example
```python
wavs, sr = model.generate_voice_clone(
text="Добрый день! Это тестовое предложение на русском языке.",
ref_audio="reference.wav",
language="russian",
x_vector_only_mode=True,
non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
```
## Generation Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `temperature` | 0.9 | Sampling temperature — lower = more stable, higher = more expressive |
| `top_k` | 50 | Top-k sampling |
| `top_p` | 1.0 | Nucleus sampling |
| `repetition_penalty` | 1.0 | Repetition penalty |
| `do_sample` | `True` | Sampling vs greedy decoding |
| `non_streaming_mode` | `True` | Generate full audio before returning |
## Tips
- Output audio is 24 kHz mono
- Reference audio should be clean speech, 5–15 seconds
- Use full language names: `"kazakh"`, `"russian"`, `"english"` (not ISO codes)
- ICL mode (`x_vector_only_mode=False` with `ref_text`) gives better voice matching than x-vector-only mode
## License
This model is released under **CC BY-NC 4.0** (non-commercial use only).
## Commercial Use
For commercial licensing, please contact: **nurgaliqadyrbek@gmail.com**