| | --- |
| | language: [kk, ru, en, uz] |
| | license: cc-by-nc-4.0 |
| | pipeline_tag: text-to-speech |
| | library_name: transformers |
| | tags: |
| | - tts |
| | - voice-cloning |
| | - multilingual |
| | - kazakh |
| | - uzbek |
| | - qwen3-tts |
| | --- |
| | |
| | # AIT-Syn 4L — Multilingual TTS with Voice Cloning |
| |
|
| | A multilingual text-to-speech model supporting **Kazakh**, **Russian**, **English**, and **Uzbek** with cross-lingual voice cloning. Fine-tuned from Qwen3-TTS-12Hz-1.7B-Base. |
| |
|
| | ## Features |
| |
|
| | - **4 languages**: Kazakh (kk), Russian (ru), English (en), Uzbek (uz) |
| | - **Voice cloning**: clone any voice from a short reference audio (~5–10 s) |
| | - **Two cloning modes**: x-vector-only (no transcript needed) or ICL (with ref transcript, higher quality) |
| | - **12.5 Hz codec**: efficient autoregressive generation |
| | - **24 kHz output**: PCM 16-bit WAV |
| |
|
| | ## Quick Start |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install qwen-tts torch soundfile |
| | ``` |
| |
|
| | ### Generate Speech |
| |
|
| | ```python |
| | from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel |
| | import soundfile as sf |
| | |
| | model = Qwen3TTSModel.from_pretrained( |
| | "nur-dev/ait-syn-4L", |
| | dtype="bfloat16", |
| | device_map="cuda:0", |
| | ) |
| | model.model.eval() |
| | |
| | # X-vector-only mode (no ref transcript needed) |
| | wavs, sr = model.generate_voice_clone( |
| | text="Сәлеметсіз бе, бұл сынақ сөйлем.", |
| | language="kazakh", |
| | ref_audio="ref_audio_kk.wav", |
| | x_vector_only_mode=True, |
| | non_streaming_mode=True, |
| | ) |
| | sf.write("output.wav", wavs[0], sr) |
| | |
| | # ICL mode (provide ref transcript for better quality) |
| | wavs, sr = model.generate_voice_clone( |
| | text="Привет, это тестовое предложение.", |
| | language="russian", |
| | ref_audio="ref_audio_kk.wav", |
| | ref_text="Бұл анықтамалық аудио.", |
| | x_vector_only_mode=False, |
| | non_streaming_mode=True, |
| | ) |
| | sf.write("output_icl.wav", wavs[0], sr) |
| | ``` |
| |
|
| | ## API Reference |
| |
|
| | ### `generate_voice_clone()` |
| |
|
| | | Parameter | Type | Default | Description | |
| | |-----------|------|---------|-------------| |
| | | `text` | `str` or `list[str]` | required | Text to synthesize | |
| | | `language` | `str` | required | Language name: `kazakh`, `russian`, `english`, `uzbek` | |
| | | `ref_audio` | `str` or `(ndarray, sr)` | required | Reference audio: file path, URL, base64, or `(waveform, sample_rate)` | |
| | | `ref_text` | `str` or `None` | `None` | Transcript of ref audio (enables ICL mode) | |
| | | `x_vector_only_mode` | `bool` | `False` | If `True`, use only x-vector speaker embedding (no ICL) | |
| | | `non_streaming_mode` | `bool` | `False` | If `True`, return complete audio; if `False`, return generator | |
| | | `temperature` | `float` | `0.9` | Sampling temperature | |
| | | `top_k` | `int` | `50` | Top-k sampling | |
| | | `top_p` | `float` | `1.0` | Nucleus sampling threshold | |
| | | `repetition_penalty` | `float` | `1.05` | Repetition penalty | |
| |
|
| | **Returns**: `(list[np.ndarray], int)` — list of waveforms and sample rate (24000). |
| |
|
| | ## Voice Cloning Modes |
| |
|
| | ### X-vector-only (`x_vector_only_mode=True`) |
| | |
| | Uses only the speaker embedding extracted from reference audio. No transcript of the reference is needed. Good for quick cloning when you don't have a transcript. |
| | |
| | ### ICL Mode (`x_vector_only_mode=False`, provide `ref_text`) |
| | |
| | In-context learning mode: the model sees both the reference audio and its transcript, producing higher-fidelity voice matching. Recommended when a transcript is available. |
| | |
| | ## Serving |
| | |
| | A FastAPI server is available for production deployment: |
| | |
| | ```bash |
| | pip install fastapi uvicorn python-multipart soundfile |
| | |
| | # Start server |
| | python serve_tts.py --model nur-dev/ait-syn-4L --port 8000 |
| |
|
| | # Or with uvicorn directly |
| | CUDA_VISIBLE_DEVICES=0 TTS_MODEL_PATH=nur-dev/ait-syn-4L uvicorn serve_tts:app --host 0.0.0.0 --port 8000 |
| | ``` |
| | |
| | ### API Endpoints |
| | |
| | | Endpoint | Method | Description | |
| | |----------|--------|-------------| |
| | | `/tts` | POST | Synthesize speech (returns WAV) | |
| | | `/tts/batch` | POST | Batch synthesis (returns ZIP of WAVs) | |
| | | `/health` | GET | Health check | |
| | | `/languages` | GET | List supported languages | |
| | |
| | ### Example Request |
| | |
| | ```bash |
| | curl -X POST http://localhost:8000/tts \ |
| | -F "text=Сәлеметсіз бе" \ |
| | -F "language=kk" \ |
| | -F "ref_audio=@ref_audio_kk.wav" \ |
| | --output output.wav |
| | ``` |
| | |
| | ## Technical Specs |
| | |
| | | Spec | Value | |
| | |------|-------| |
| | | Parameters | 1.7B | |
| | | Architecture | Qwen3TTSForConditionalGeneration | |
| | | Codec rate | 12.5 Hz (16 sub-codecs) | |
| | | Output sample rate | 24 kHz | |
| | | Precision | bf16 | |
| | | Max generation length | 8192 tokens (~10 min audio) | |
| | |
| | ## Reference Audio |
| | |
| | A sample Kazakh male reference audio is included as `ref_audio_kk.wav` (mono, 24 kHz, ~10 s). |
| | |
| | ## License |
| | |
| | CC-BY-NC-4.0 |
| | |