ait-syn-4L / README.md
nur-dev's picture
Upload folder using huggingface_hub
119df2a verified
metadata
language:
  - kk
  - ru
  - en
  - uz
license: cc-by-nc-4.0
pipeline_tag: text-to-speech
library_name: transformers
tags:
  - tts
  - voice-cloning
  - multilingual
  - kazakh
  - uzbek
  - qwen3-tts

AIT-Syn 4L — Multilingual TTS with Voice Cloning

A multilingual text-to-speech model supporting Kazakh, Russian, English, and Uzbek with cross-lingual voice cloning. Fine-tuned from Qwen3-TTS-12Hz-1.7B-Base.

Features

  • 4 languages: Kazakh (kk), Russian (ru), English (en), Uzbek (uz)
  • Voice cloning: clone any voice from a short reference audio (~5–10 s)
  • Two cloning modes: x-vector-only (no transcript needed) or ICL (with ref transcript, higher quality)
  • 12.5 Hz codec: efficient autoregressive generation
  • 24 kHz output: PCM 16-bit WAV

Quick Start

Installation

pip install qwen-tts torch soundfile

Generate Speech

from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
import soundfile as sf

model = Qwen3TTSModel.from_pretrained(
    "nur-dev/ait-syn-4L",
    dtype="bfloat16",
    device_map="cuda:0",
)
model.model.eval()

# X-vector-only mode (no ref transcript needed)
wavs, sr = model.generate_voice_clone(
    text="Сәлеметсіз бе, бұл сынақ сөйлем.",
    language="kazakh",
    ref_audio="ref_audio_kk.wav",
    x_vector_only_mode=True,
    non_streaming_mode=True,
)
sf.write("output.wav", wavs[0], sr)

# ICL mode (provide ref transcript for better quality)
wavs, sr = model.generate_voice_clone(
    text="Привет, это тестовое предложение.",
    language="russian",
    ref_audio="ref_audio_kk.wav",
    ref_text="Бұл анықтамалық аудио.",
    x_vector_only_mode=False,
    non_streaming_mode=True,
)
sf.write("output_icl.wav", wavs[0], sr)

API Reference

generate_voice_clone()

Parameter Type Default Description
text str or list[str] required Text to synthesize
language str required Language name: kazakh, russian, english, uzbek
ref_audio str or (ndarray, sr) required Reference audio: file path, URL, base64, or (waveform, sample_rate)
ref_text str or None None Transcript of ref audio (enables ICL mode)
x_vector_only_mode bool False If True, use only x-vector speaker embedding (no ICL)
non_streaming_mode bool False If True, return complete audio; if False, return generator
temperature float 0.9 Sampling temperature
top_k int 50 Top-k sampling
top_p float 1.0 Nucleus sampling threshold
repetition_penalty float 1.05 Repetition penalty

Returns: (list[np.ndarray], int) — list of waveforms and sample rate (24000).

Voice Cloning Modes

X-vector-only (x_vector_only_mode=True)

Uses only the speaker embedding extracted from reference audio. No transcript of the reference is needed. Good for quick cloning when you don't have a transcript.

ICL Mode (x_vector_only_mode=False, provide ref_text)

In-context learning mode: the model sees both the reference audio and its transcript, producing higher-fidelity voice matching. Recommended when a transcript is available.

Serving

A FastAPI server is available for production deployment:

pip install fastapi uvicorn python-multipart soundfile

# Start server
python serve_tts.py --model nur-dev/ait-syn-4L --port 8000

# Or with uvicorn directly
CUDA_VISIBLE_DEVICES=0 TTS_MODEL_PATH=nur-dev/ait-syn-4L uvicorn serve_tts:app --host 0.0.0.0 --port 8000

API Endpoints

Endpoint Method Description
/tts POST Synthesize speech (returns WAV)
/tts/batch POST Batch synthesis (returns ZIP of WAVs)
/health GET Health check
/languages GET List supported languages

Example Request

curl -X POST http://localhost:8000/tts \
  -F "text=Сәлеметсіз бе" \
  -F "language=kk" \
  -F "ref_audio=@ref_audio_kk.wav" \
  --output output.wav

Technical Specs

Spec Value
Parameters 1.7B
Architecture Qwen3TTSForConditionalGeneration
Codec rate 12.5 Hz (16 sub-codecs)
Output sample rate 24 kHz
Precision bf16
Max generation length 8192 tokens (~10 min audio)

Reference Audio

A sample Kazakh male reference audio is included as ref_audio_kk.wav (mono, 24 kHz, ~10 s).

License

CC-BY-NC-4.0