You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

AIT-Syn 4L — Multilingual TTS with Voice Cloning

A multilingual text-to-speech model supporting Kazakh, Russian, English, and Uzbek with cross-lingual voice cloning. Fine-tuned from Qwen3-TTS-12Hz-1.7B-Base.

Features

4 languages: Kazakh (kk), Russian (ru), English (en), Uzbek (uz)
Voice cloning: clone any voice from a short reference audio (~5–10 s)
Two cloning modes: x-vector-only (no transcript needed) or ICL (with ref transcript, higher quality)
12.5 Hz codec: efficient autoregressive generation
24 kHz output: PCM 16-bit WAV

Quick Start

Installation

pip install qwen-tts torch soundfile

Generate Speech

from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
import soundfile as sf

model = Qwen3TTSModel.from_pretrained(
    "nur-dev/ait-syn-4L",
    dtype="bfloat16",
    device_map="cuda:0",
)
model.model.eval()

# X-vector-only mode (no ref transcript needed)
wavs, sr = model.generate_voice_clone(
    text="Сәлеметсіз бе, бұл сынақ сөйлем.",
    language="kazakh",
    ref_audio="ref_audio_kk.wav",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr)

# ICL mode (provide ref transcript for better quality)
wavs, sr = model.generate_voice_clone(
    text="Привет, это тестовое предложение.",
    language="russian",
    ref_audio="ref_audio_kk.wav",
    ref_text="Бұл анықтамалық аудио.",
    x_vector_only_mode=False,
    non_streaming_mode=True,
)
sf.write("output_icl.wav", wavs[0], sr)

API Reference

`generate_voice_clone()`

Parameter	Type	Default	Description
`text`	`str` or `list[str]`	required	Text to synthesize
`language`	`str`	required	Language name: `kazakh`, `russian`, `english`, `uzbek`
`ref_audio`	`str` or `(ndarray, sr)`	required	Reference audio: file path, URL, base64, or `(waveform, sample_rate)`
`ref_text`	`str` or `None`	`None`	Transcript of ref audio (enables ICL mode)
`x_vector_only_mode`	`bool`	`False`	If `True`, use only x-vector speaker embedding (no ICL)
`non_streaming_mode`	`bool`	`False`	If `True`, return complete audio; if `False`, return generator
`temperature`	`float`	`0.9`	Sampling temperature
`top_k`	`int`	`50`	Top-k sampling
`top_p`	`float`	`1.0`	Nucleus sampling threshold
`repetition_penalty`	`float`	`1.05`	Repetition penalty

Returns: (list[np.ndarray], int) — list of waveforms and sample rate (24000).

Voice Cloning Modes

X-vector-only (`x_vector_only_mode=True`)

Uses only the speaker embedding extracted from reference audio. No transcript of the reference is needed. Good for quick cloning when you don't have a transcript.

ICL Mode (`x_vector_only_mode=False`, provide `ref_text`)

In-context learning mode: the model sees both the reference audio and its transcript, producing higher-fidelity voice matching. Recommended when a transcript is available.

Serving

A FastAPI server is available for production deployment:

pip install fastapi uvicorn python-multipart soundfile

# Start server
python serve_tts.py --model nur-dev/ait-syn-4L --port 8000

# Or with uvicorn directly
CUDA_VISIBLE_DEVICES=0 TTS_MODEL_PATH=nur-dev/ait-syn-4L uvicorn serve_tts:app --host 0.0.0.0 --port 8000

API Endpoints

Endpoint	Method	Description
`/tts`	POST	Synthesize speech (returns WAV)
`/tts/batch`	POST	Batch synthesis (returns ZIP of WAVs)
`/health`	GET	Health check
`/languages`	GET	List supported languages

Example Request

curl -X POST http://localhost:8000/tts \
  -F "text=Сәлеметсіз бе" \
  -F "language=kk" \
  -F "ref_audio=@ref_audio_kk.wav" \
  --output output.wav

Technical Specs

Spec	Value
Parameters	1.7B
Architecture	Qwen3TTSForConditionalGeneration
Codec rate	12.5 Hz (16 sub-codecs)
Output sample rate	24 kHz
Precision	bf16
Max generation length	8192 tokens (~10 min audio)

Reference Audio

A sample Kazakh male reference audio is included as ref_audio_kk.wav (mono, 24 kHz, ~10 s).

License

CC-BY-NC-4.0

Downloads last month: -