Upload folder using huggingface_hub

119df2a verified 3 days ago

4.59 kB

	---
	language: [kk, ru, en, uz]
	license: cc-by-nc-4.0
	pipeline_tag: text-to-speech
	library_name: transformers
	tags:
	- tts
	- voice-cloning
	- multilingual
	- kazakh
	- uzbek
	- qwen3-tts
	---

	# AIT-Syn 4L — Multilingual TTS with Voice Cloning

	A multilingual text-to-speech model supporting Kazakh, Russian, English, and Uzbek with cross-lingual voice cloning. Fine-tuned from Qwen3-TTS-12Hz-1.7B-Base.

	## Features

	- 4 languages: Kazakh (kk), Russian (ru), English (en), Uzbek (uz)
	- Voice cloning: clone any voice from a short reference audio (~5–10 s)
	- Two cloning modes: x-vector-only (no transcript needed) or ICL (with ref transcript, higher quality)
	- 12.5 Hz codec: efficient autoregressive generation
	- 24 kHz output: PCM 16-bit WAV

	## Quick Start

	### Installation

	```bash
	pip install qwen-tts torch soundfile
	```

	### Generate Speech

	```python
	from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
	import soundfile as sf

	model = Qwen3TTSModel.from_pretrained(
	"nur-dev/ait-syn-4L",
	dtype="bfloat16",
	device_map="cuda:0",
	)
	model.model.eval()

	# X-vector-only mode (no ref transcript needed)
	wavs, sr = model.generate_voice_clone(
	text="Сәлеметсіз бе, бұл сынақ сөйлем.",
	language="kazakh",
	ref_audio="ref_audio_kk.wav",
	x_vector_only_mode=True,
	non_streaming_mode=True,
	)
	sf.write("output.wav", wavs[0], sr)

	# ICL mode (provide ref transcript for better quality)
	wavs, sr = model.generate_voice_clone(
	text="Привет, это тестовое предложение.",
	language="russian",
	ref_audio="ref_audio_kk.wav",
	ref_text="Бұл анықтамалық аудио.",
	x_vector_only_mode=False,
	non_streaming_mode=True,
	)
	sf.write("output_icl.wav", wavs[0], sr)
	```

	## API Reference

	### `generate_voice_clone()`

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `text` \| `str` or `list[str]` \| required \| Text to synthesize \|
	\| `language` \| `str` \| required \| Language name: `kazakh`, `russian`, `english`, `uzbek` \|
	\| `ref_audio` \| `str` or `(ndarray, sr)` \| required \| Reference audio: file path, URL, base64, or `(waveform, sample_rate)` \|
	\| `ref_text` \| `str` or `None` \| `None` \| Transcript of ref audio (enables ICL mode) \|
	\| `x_vector_only_mode` \| `bool` \| `False` \| If `True`, use only x-vector speaker embedding (no ICL) \|
	\| `non_streaming_mode` \| `bool` \| `False` \| If `True`, return complete audio; if `False`, return generator \|
	\| `temperature` \| `float` \| `0.9` \| Sampling temperature \|
	\| `top_k` \| `int` \| `50` \| Top-k sampling \|
	\| `top_p` \| `float` \| `1.0` \| Nucleus sampling threshold \|
	\| `repetition_penalty` \| `float` \| `1.05` \| Repetition penalty \|

	Returns: `(list[np.ndarray], int)` — list of waveforms and sample rate (24000).

	## Voice Cloning Modes

	### X-vector-only (`x_vector_only_mode=True`)

	Uses only the speaker embedding extracted from reference audio. No transcript of the reference is needed. Good for quick cloning when you don't have a transcript.

	### ICL Mode (`x_vector_only_mode=False`, provide `ref_text`)

	In-context learning mode: the model sees both the reference audio and its transcript, producing higher-fidelity voice matching. Recommended when a transcript is available.

	## Serving

	A FastAPI server is available for production deployment:

	```bash
	pip install fastapi uvicorn python-multipart soundfile

	# Start server
	python serve_tts.py --model nur-dev/ait-syn-4L --port 8000

	# Or with uvicorn directly
	CUDA_VISIBLE_DEVICES=0 TTS_MODEL_PATH=nur-dev/ait-syn-4L uvicorn serve_tts:app --host 0.0.0.0 --port 8000
	```

	### API Endpoints

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/tts` \| POST \| Synthesize speech (returns WAV) \|
	\| `/tts/batch` \| POST \| Batch synthesis (returns ZIP of WAVs) \|
	\| `/health` \| GET \| Health check \|
	\| `/languages` \| GET \| List supported languages \|

	### Example Request

	```bash
	curl -X POST http://localhost:8000/tts \
	-F "text=Сәлеметсіз бе" \
	-F "language=kk" \
	-F "ref_audio=@ref_audio_kk.wav" \
	--output output.wav
	```

	## Technical Specs

	\| Spec \| Value \|
	\|------\|-------\|
	\| Parameters \| 1.7B \|
	\| Architecture \| Qwen3TTSForConditionalGeneration \|
	\| Codec rate \| 12.5 Hz (16 sub-codecs) \|
	\| Output sample rate \| 24 kHz \|
	\| Precision \| bf16 \|
	\| Max generation length \| 8192 tokens (~10 min audio) \|

	## Reference Audio

	A sample Kazakh male reference audio is included as `ref_audio_kk.wav` (mono, 24 kHz, ~10 s).

	## License

	CC-BY-NC-4.0