You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

AIT Piper Multilingual

Fast, offline multilingual text-to-speech for Kazakh, Russian, and English. The model runs through Piper/ONNX, contains 13 fixed voices, and supports incremental audio generation for real-time applications.

Model Characteristics

Property	Value
Languages	Kazakh (`kk`), Russian (`ru`), English (`en`)
Architecture	Piper/VITS
Runtime	ONNX Runtime
Sample rate	22,050 Hz
Audio	Mono PCM
Speakers	13 fixed voices
Voice cloning	Not supported
Streaming	Supported
Model size	Approximately 74 MB
Phonemizers	eSpeak NG: `kk`, `ru`, `en-us`

The bundled configuration uses the recommended moderately slower and more stable synthesis preset:

length_scale = 1.12
noise_scale = 0.55
noise_w_scale = 0.45

Voices

ID	Language	Voice
0	Kazakh	`kk_F1`
1	Kazakh	`kk_F2`
2	Kazakh	`kk_F3`
3	Kazakh	`kk_M2`
4	Kazakh	`kk_emo_1263201035`
5	Kazakh	`kk_emo_399172782`
6	Kazakh	`kk_emo_805570882`
7	Russian	`ru_ruls_13587`
8	Russian	`ru_ruls_295`
9	Russian	`ru_ruls_8086`
10	Russian	`ru_ruls_8169`
11	Russian	`ru_ruls_9014`
12	English	`ljspeech_F1`

For Russian production use, start with ru_ruls_8086 or ru_ruls_8169.

Download And Install

pip install "piper-tts>=1.4,<2" onnxruntime huggingface_hub

huggingface-cli download nur-dev/ait-piper-multilingual \
  --local-dir ./ait-piper-multilingual

Authentication is required because this repository is private. Keep the ONNX and .onnx.json files together with the same base filename.

Command-Line Inference

Kazakh:

echo "Бүгін Алматыда күн ашық, ауа райы жылы болады." | piper \
  --model ./ait-piper-multilingual/ait-piper-multilingual-medium.onnx \
  --language kk --speaker 3 --output-file kk.wav

Russian:

echo "Сегодня в Алматы солнечно и тепло." | piper \
  --model ./ait-piper-multilingual/ait-piper-multilingual-medium.onnx \
  --language ru --speaker 9 --output-file ru.wav

English:

echo "Real-time speech synthesis is ready." | piper \
  --model ./ait-piper-multilingual/ait-piper-multilingual-medium.onnx \
  --language en --speaker 12 --output-file en.wav

Always pass the correct language. The language selects the phonemizer; the speaker ID selects the voice.

Python Inference

import wave

from piper import PiperVoice, SynthesisConfig

model = "./ait-piper-multilingual/ait-piper-multilingual-medium.onnx"
voice = PiperVoice.load(model)

config = SynthesisConfig(
    speaker_id=3,
    language="kk",
    length_scale=1.12,
    noise_scale=0.55,
    noise_w_scale=0.45,
)

with wave.open("output.wav", "wb") as wav_file:
    voice.synthesize_wav(
        "Қазақ тіліндегі дыбыстау жүйесі жұмыс істеп тұр.",
        wav_file,
        syn_config=config,
    )

Load PiperVoice once at application startup and reuse it for requests.

Streaming Inference

synthesize yields signed 16-bit mono PCM chunks at 22,050 Hz:

from piper import PiperVoice, SynthesisConfig

voice = PiperVoice.load(
    "./ait-piper-multilingual/ait-piper-multilingual-medium.onnx"
)
config = SynthesisConfig(speaker_id=9, language="ru")

for chunk in voice.synthesize(
    "Потоковый синтез речи готов к работе.",
    syn_config=config,
):
    send_audio(
        chunk.audio_int16_bytes,
        sample_rate=chunk.sample_rate,
        sample_width=chunk.sample_width,
        channels=chunk.sample_channels,
    )

Chunk boundaries follow sentence boundaries. Split long input into complete sentences for lower first-audio latency.

HTTP Server

The repository includes serve.py, a minimal FastAPI service with WAV and streaming PCM endpoints.

cd ait-piper-multilingual
pip install -r requirements.txt
uvicorn serve:app --host 0.0.0.0 --port 8000 --workers 1

Generate a WAV file:

curl -s http://localhost:8000/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "Бүгін Алматыда ауа райы жылы.",
    "language": "kk",
    "speaker": "kk_M2"
  }' \
  --output output.wav

Stream raw signed 16-bit little-endian PCM:

curl -s http://localhost:8000/v1/audio/stream \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "Real-time streaming speech is ready.",
    "language": "en",
    "speaker": "ljspeech_F1"
  }' | ffplay -f s16le -ar 22050 -ac 1 -

Use one server worker per loaded model instance. Scale with multiple service processes or containers when concurrent throughput is required.

Best Practices

Pass language explicitly for every request.
Use a speaker assigned to the selected language.
Preserve punctuation to improve pauses and phrasing.
Write numbers, dates, currencies, and abbreviations as words when exact pronunciation matters.
Use proper Kazakh letters such as ә, ғ, қ, ң, ө, ұ, ү, һ, and і.
Split long paragraphs into complete sentences and stream them in order.
Reuse the loaded ONNX session; do not reload the model per request.
Synthesize at 22,050 Hz and resample afterward if another rate is required.

Scope And Limitations

The model provides fixed speaker identities and does not clone arbitrary voices. Mixed-language text inside one request is not recommended; synthesize each language segment with its matching language and speaker. Pronunciation of unusual names, abbreviations, and ambiguous Russian word stress may require text normalization or explicit phonetic input.

Downloads last month: -; Downloads are not tracked for this model. How to track