AIT Piper Multilingual
Fast, offline multilingual text-to-speech for Kazakh, Russian, and English. The model runs through Piper/ONNX, contains 13 fixed voices, and supports incremental audio generation for real-time applications.
Model Characteristics
| Property | Value |
|---|---|
| Languages | Kazakh (kk), Russian (ru), English (en) |
| Architecture | Piper/VITS |
| Runtime | ONNX Runtime |
| Sample rate | 22,050 Hz |
| Audio | Mono PCM |
| Speakers | 13 fixed voices |
| Voice cloning | Not supported |
| Streaming | Supported |
| Model size | Approximately 74 MB |
| Phonemizers | eSpeak NG: kk, ru, en-us |
The bundled configuration uses the recommended moderately slower and more stable synthesis preset:
length_scale = 1.12
noise_scale = 0.55
noise_w_scale = 0.45
Voices
| ID | Language | Voice |
|---|---|---|
| 0 | Kazakh | kk_F1 |
| 1 | Kazakh | kk_F2 |
| 2 | Kazakh | kk_F3 |
| 3 | Kazakh | kk_M2 |
| 4 | Kazakh | kk_emo_1263201035 |
| 5 | Kazakh | kk_emo_399172782 |
| 6 | Kazakh | kk_emo_805570882 |
| 7 | Russian | ru_ruls_13587 |
| 8 | Russian | ru_ruls_295 |
| 9 | Russian | ru_ruls_8086 |
| 10 | Russian | ru_ruls_8169 |
| 11 | Russian | ru_ruls_9014 |
| 12 | English | ljspeech_F1 |
For Russian production use, start with ru_ruls_8086 or ru_ruls_8169.
Download And Install
pip install "piper-tts>=1.4,<2" onnxruntime huggingface_hub
huggingface-cli download nur-dev/ait-piper-multilingual \
--local-dir ./ait-piper-multilingual
Authentication is required because this repository is private. Keep the ONNX
and .onnx.json files together with the same base filename.
Command-Line Inference
Kazakh:
echo "Бүгін Алматыда күн ашық, ауа райы жылы болады." | piper \
--model ./ait-piper-multilingual/ait-piper-multilingual-medium.onnx \
--language kk --speaker 3 --output-file kk.wav
Russian:
echo "Сегодня в Алматы солнечно и тепло." | piper \
--model ./ait-piper-multilingual/ait-piper-multilingual-medium.onnx \
--language ru --speaker 9 --output-file ru.wav
English:
echo "Real-time speech synthesis is ready." | piper \
--model ./ait-piper-multilingual/ait-piper-multilingual-medium.onnx \
--language en --speaker 12 --output-file en.wav
Always pass the correct language. The language selects the phonemizer; the speaker ID selects the voice.
Python Inference
import wave
from piper import PiperVoice, SynthesisConfig
model = "./ait-piper-multilingual/ait-piper-multilingual-medium.onnx"
voice = PiperVoice.load(model)
config = SynthesisConfig(
speaker_id=3,
language="kk",
length_scale=1.12,
noise_scale=0.55,
noise_w_scale=0.45,
)
with wave.open("output.wav", "wb") as wav_file:
voice.synthesize_wav(
"Қазақ тіліндегі дыбыстау жүйесі жұмыс істеп тұр.",
wav_file,
syn_config=config,
)
Load PiperVoice once at application startup and reuse it for requests.
Streaming Inference
synthesize yields signed 16-bit mono PCM chunks at 22,050 Hz:
from piper import PiperVoice, SynthesisConfig
voice = PiperVoice.load(
"./ait-piper-multilingual/ait-piper-multilingual-medium.onnx"
)
config = SynthesisConfig(speaker_id=9, language="ru")
for chunk in voice.synthesize(
"Потоковый синтез речи готов к работе.",
syn_config=config,
):
send_audio(
chunk.audio_int16_bytes,
sample_rate=chunk.sample_rate,
sample_width=chunk.sample_width,
channels=chunk.sample_channels,
)
Chunk boundaries follow sentence boundaries. Split long input into complete sentences for lower first-audio latency.
HTTP Server
The repository includes serve.py, a minimal FastAPI service
with WAV and streaming PCM endpoints.
cd ait-piper-multilingual
pip install -r requirements.txt
uvicorn serve:app --host 0.0.0.0 --port 8000 --workers 1
Generate a WAV file:
curl -s http://localhost:8000/v1/audio/speech \
-H 'Content-Type: application/json' \
-d '{
"text": "Бүгін Алматыда ауа райы жылы.",
"language": "kk",
"speaker": "kk_M2"
}' \
--output output.wav
Stream raw signed 16-bit little-endian PCM:
curl -s http://localhost:8000/v1/audio/stream \
-H 'Content-Type: application/json' \
-d '{
"text": "Real-time streaming speech is ready.",
"language": "en",
"speaker": "ljspeech_F1"
}' | ffplay -f s16le -ar 22050 -ac 1 -
Use one server worker per loaded model instance. Scale with multiple service processes or containers when concurrent throughput is required.
Best Practices
- Pass
languageexplicitly for every request. - Use a speaker assigned to the selected language.
- Preserve punctuation to improve pauses and phrasing.
- Write numbers, dates, currencies, and abbreviations as words when exact pronunciation matters.
- Use proper Kazakh letters such as
ә,ғ,қ,ң,ө,ұ,ү,һ, andі. - Split long paragraphs into complete sentences and stream them in order.
- Reuse the loaded ONNX session; do not reload the model per request.
- Synthesize at 22,050 Hz and resample afterward if another rate is required.
Scope And Limitations
The model provides fixed speaker identities and does not clone arbitrary voices. Mixed-language text inside one request is not recommended; synthesize each language segment with its matching language and speaker. Pronunciation of unusual names, abbreviations, and ambiguous Russian word stress may require text normalization or explicit phonetic input.