nur-dev
/

ait-syn

Model card Files Files and versions

ait-syn / README.md

nur-dev's picture

AIT-Syn v7: multilingual TTS (KK/RU/EN)

2a8835b 2 months ago

|

history blame contribute delete

3.53 kB

	---
	license: cc-by-nc-4.0
	language:
	- kk
	- ru
	- en
	tags:
	- text-to-speech
	- tts
	- voice-cloning
	- qwen3-tts
	- kazakh
	- multilingual
	library_name: qwen-tts
	pipeline_tag: text-to-speech
	base_model: Qwen/Qwen3-TTS-12Hz-1.7B-Base
	---

	# AIT-Syn — Multilingual Text-to-Speech with Voice Cloning

	AIT-Syn is a multilingual text-to-speech model supporting Kazakh, Russian, and English with voice cloning capability. Built on top of Qwen3-TTS architecture, fine-tuned from `Qwen/Qwen3-TTS-12Hz-1.7B-Base`.

	## Supported Languages

	\| Language \| Code \|
	\|----------\|------\|
	\| Kazakh \| `kazakh` \|
	\| Russian \| `russian` \|
	\| English \| `english` \|

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base model \| `Qwen/Qwen3-TTS-12Hz-1.7B-Base` \|
	\| Parameters \| 1.7B \|
	\| Output sample rate \| 24 kHz \|

	## Installation

	```bash
	pip install qwen-tts torch soundfile
	# Optional: faster attention
	pip install flash-attn
	```

	## Usage

	### Voice Cloning with Transcript (Recommended)

	Providing the transcript of the reference audio gives the best voice matching quality:

	```python
	import torch
	import soundfile as sf
	from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

	try:
	import flash_attn
	attn_impl = "flash_attention_2"
	except ImportError:
	attn_impl = "eager"

	model = Qwen3TTSModel.from_pretrained(
	"nur-dev/ait-syn",
	dtype=torch.bfloat16,
	attn_implementation=attn_impl,
	device_map="cuda:0",
	)
	model.model.eval()

	# Kazakh example
	wavs, sr = model.generate_voice_clone(
	text="Сәлеметсіз бе, бұл сынақ сөйлемі.",
	ref_audio="reference.wav",
	ref_text="Transcript of the reference audio.",
	language="kazakh",
	x_vector_only_mode=False,
	non_streaming_mode=True,
	temperature=0.9,
	top_k=50,
	do_sample=True,
	)
	sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
	```

	### Voice Cloning without Transcript

	If you only have the reference audio (no transcript):

	```python
	wavs, sr = model.generate_voice_clone(
	text="Hello, this is a test sentence.",
	ref_audio="reference.wav",
	language="english",
	x_vector_only_mode=True,
	non_streaming_mode=True,
	)
	sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
	```

	### Russian example

	```python
	wavs, sr = model.generate_voice_clone(
	text="Добрый день! Это тестовое предложение на русском языке.",
	ref_audio="reference.wav",
	language="russian",
	x_vector_only_mode=True,
	non_streaming_mode=True,
	)
	sf.write("output.wav", wavs[0], sr, subtype="PCM_16")
	```

	## Generation Parameters

	\| Parameter \| Default \| Description \|
	\|-----------\|---------\|-------------\|
	\| `temperature` \| 0.9 \| Sampling temperature — lower = more stable, higher = more expressive \|
	\| `top_k` \| 50 \| Top-k sampling \|
	\| `top_p` \| 1.0 \| Nucleus sampling \|
	\| `repetition_penalty` \| 1.0 \| Repetition penalty \|
	\| `do_sample` \| `True` \| Sampling vs greedy decoding \|
	\| `non_streaming_mode` \| `True` \| Generate full audio before returning \|

	## Tips

	- Output audio is 24 kHz mono
	- Reference audio should be clean speech, 5–15 seconds
	- Use full language names: `"kazakh"`, `"russian"`, `"english"` (not ISO codes)
	- ICL mode (`x_vector_only_mode=False` with `ref_text`) gives better voice matching than x-vector-only mode

	## License

	This model is released under CC BY-NC 4.0 (non-commercial use only).

	## Commercial Use

	For commercial licensing, please contact: nurgaliqadyrbek@gmail.com