ait-syn-4L / README.md
nur-dev's picture
Upload folder using huggingface_hub
119df2a verified
---
language: [kk, ru, en, uz]
license: cc-by-nc-4.0
pipeline_tag: text-to-speech
library_name: transformers
tags:
- tts
- voice-cloning
- multilingual
- kazakh
- uzbek
- qwen3-tts
---
# AIT-Syn 4L — Multilingual TTS with Voice Cloning
A multilingual text-to-speech model supporting **Kazakh**, **Russian**, **English**, and **Uzbek** with cross-lingual voice cloning. Fine-tuned from Qwen3-TTS-12Hz-1.7B-Base.
## Features
- **4 languages**: Kazakh (kk), Russian (ru), English (en), Uzbek (uz)
- **Voice cloning**: clone any voice from a short reference audio (~5–10 s)
- **Two cloning modes**: x-vector-only (no transcript needed) or ICL (with ref transcript, higher quality)
- **12.5 Hz codec**: efficient autoregressive generation
- **24 kHz output**: PCM 16-bit WAV
## Quick Start
### Installation
```bash
pip install qwen-tts torch soundfile
```
### Generate Speech
```python
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
import soundfile as sf
model = Qwen3TTSModel.from_pretrained(
"nur-dev/ait-syn-4L",
dtype="bfloat16",
device_map="cuda:0",
)
model.model.eval()
# X-vector-only mode (no ref transcript needed)
wavs, sr = model.generate_voice_clone(
text="Сәлеметсіз бе, бұл сынақ сөйлем.",
language="kazakh",
ref_audio="ref_audio_kk.wav",
x_vector_only_mode=True,
non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr)
# ICL mode (provide ref transcript for better quality)
wavs, sr = model.generate_voice_clone(
text="Привет, это тестовое предложение.",
language="russian",
ref_audio="ref_audio_kk.wav",
ref_text="Бұл анықтамалық аудио.",
x_vector_only_mode=False,
non_streaming_mode=True,
)
sf.write("output_icl.wav", wavs[0], sr)
```
## API Reference
### `generate_voice_clone()`
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `text` | `str` or `list[str]` | required | Text to synthesize |
| `language` | `str` | required | Language name: `kazakh`, `russian`, `english`, `uzbek` |
| `ref_audio` | `str` or `(ndarray, sr)` | required | Reference audio: file path, URL, base64, or `(waveform, sample_rate)` |
| `ref_text` | `str` or `None` | `None` | Transcript of ref audio (enables ICL mode) |
| `x_vector_only_mode` | `bool` | `False` | If `True`, use only x-vector speaker embedding (no ICL) |
| `non_streaming_mode` | `bool` | `False` | If `True`, return complete audio; if `False`, return generator |
| `temperature` | `float` | `0.9` | Sampling temperature |
| `top_k` | `int` | `50` | Top-k sampling |
| `top_p` | `float` | `1.0` | Nucleus sampling threshold |
| `repetition_penalty` | `float` | `1.05` | Repetition penalty |
**Returns**: `(list[np.ndarray], int)` — list of waveforms and sample rate (24000).
## Voice Cloning Modes
### X-vector-only (`x_vector_only_mode=True`)
Uses only the speaker embedding extracted from reference audio. No transcript of the reference is needed. Good for quick cloning when you don't have a transcript.
### ICL Mode (`x_vector_only_mode=False`, provide `ref_text`)
In-context learning mode: the model sees both the reference audio and its transcript, producing higher-fidelity voice matching. Recommended when a transcript is available.
## Serving
A FastAPI server is available for production deployment:
```bash
pip install fastapi uvicorn python-multipart soundfile
# Start server
python serve_tts.py --model nur-dev/ait-syn-4L --port 8000
# Or with uvicorn directly
CUDA_VISIBLE_DEVICES=0 TTS_MODEL_PATH=nur-dev/ait-syn-4L uvicorn serve_tts:app --host 0.0.0.0 --port 8000
```
### API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/tts` | POST | Synthesize speech (returns WAV) |
| `/tts/batch` | POST | Batch synthesis (returns ZIP of WAVs) |
| `/health` | GET | Health check |
| `/languages` | GET | List supported languages |
### Example Request
```bash
curl -X POST http://localhost:8000/tts \
-F "text=Сәлеметсіз бе" \
-F "language=kk" \
-F "ref_audio=@ref_audio_kk.wav" \
--output output.wav
```
## Technical Specs
| Spec | Value |
|------|-------|
| Parameters | 1.7B |
| Architecture | Qwen3TTSForConditionalGeneration |
| Codec rate | 12.5 Hz (16 sub-codecs) |
| Output sample rate | 24 kHz |
| Precision | bf16 |
| Max generation length | 8192 tokens (~10 min audio) |
## Reference Audio
A sample Kazakh male reference audio is included as `ref_audio_kk.wav` (mono, 24 kHz, ~10 s).
## License
CC-BY-NC-4.0