You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

AIT Piper Multilingual

Fast, offline multilingual text-to-speech for Kazakh, Russian, and English. The model runs through Piper/ONNX, contains 13 fixed voices, and supports incremental audio generation for real-time applications.

Model Characteristics

Property Value
Languages Kazakh (kk), Russian (ru), English (en)
Architecture Piper/VITS
Runtime ONNX Runtime
Sample rate 22,050 Hz
Audio Mono PCM
Speakers 13 fixed voices
Voice cloning Not supported
Streaming Supported
Model size Approximately 74 MB
Phonemizers eSpeak NG: kk, ru, en-us

The bundled configuration uses the recommended moderately slower and more stable synthesis preset:

length_scale = 1.12
noise_scale = 0.55
noise_w_scale = 0.45

Voices

ID Language Voice
0 Kazakh kk_F1
1 Kazakh kk_F2
2 Kazakh kk_F3
3 Kazakh kk_M2
4 Kazakh kk_emo_1263201035
5 Kazakh kk_emo_399172782
6 Kazakh kk_emo_805570882
7 Russian ru_ruls_13587
8 Russian ru_ruls_295
9 Russian ru_ruls_8086
10 Russian ru_ruls_8169
11 Russian ru_ruls_9014
12 English ljspeech_F1

For Russian production use, start with ru_ruls_8086 or ru_ruls_8169.

Download And Install

pip install "piper-tts>=1.4,<2" onnxruntime huggingface_hub

huggingface-cli download nur-dev/ait-piper-multilingual \
  --local-dir ./ait-piper-multilingual

Authentication is required because this repository is private. Keep the ONNX and .onnx.json files together with the same base filename.

Command-Line Inference

Kazakh:

echo "Бүгін Алматыда күн ашық, ауа райы жылы болады." | piper \
  --model ./ait-piper-multilingual/ait-piper-multilingual-medium.onnx \
  --language kk --speaker 3 --output-file kk.wav

Russian:

echo "Сегодня в Алматы солнечно и тепло." | piper \
  --model ./ait-piper-multilingual/ait-piper-multilingual-medium.onnx \
  --language ru --speaker 9 --output-file ru.wav

English:

echo "Real-time speech synthesis is ready." | piper \
  --model ./ait-piper-multilingual/ait-piper-multilingual-medium.onnx \
  --language en --speaker 12 --output-file en.wav

Always pass the correct language. The language selects the phonemizer; the speaker ID selects the voice.

Python Inference

import wave

from piper import PiperVoice, SynthesisConfig

model = "./ait-piper-multilingual/ait-piper-multilingual-medium.onnx"
voice = PiperVoice.load(model)

config = SynthesisConfig(
    speaker_id=3,
    language="kk",
    length_scale=1.12,
    noise_scale=0.55,
    noise_w_scale=0.45,
)

with wave.open("output.wav", "wb") as wav_file:
    voice.synthesize_wav(
        "Қазақ тіліндегі дыбыстау жүйесі жұмыс істеп тұр.",
        wav_file,
        syn_config=config,
    )

Load PiperVoice once at application startup and reuse it for requests.

Streaming Inference

synthesize yields signed 16-bit mono PCM chunks at 22,050 Hz:

from piper import PiperVoice, SynthesisConfig

voice = PiperVoice.load(
    "./ait-piper-multilingual/ait-piper-multilingual-medium.onnx"
)
config = SynthesisConfig(speaker_id=9, language="ru")

for chunk in voice.synthesize(
    "Потоковый синтез речи готов к работе.",
    syn_config=config,
):
    send_audio(
        chunk.audio_int16_bytes,
        sample_rate=chunk.sample_rate,
        sample_width=chunk.sample_width,
        channels=chunk.sample_channels,
    )

Chunk boundaries follow sentence boundaries. Split long input into complete sentences for lower first-audio latency.

HTTP Server

The repository includes serve.py, a minimal FastAPI service with WAV and streaming PCM endpoints.

cd ait-piper-multilingual
pip install -r requirements.txt
uvicorn serve:app --host 0.0.0.0 --port 8000 --workers 1

Generate a WAV file:

curl -s http://localhost:8000/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "Бүгін Алматыда ауа райы жылы.",
    "language": "kk",
    "speaker": "kk_M2"
  }' \
  --output output.wav

Stream raw signed 16-bit little-endian PCM:

curl -s http://localhost:8000/v1/audio/stream \
  -H 'Content-Type: application/json' \
  -d '{
    "text": "Real-time streaming speech is ready.",
    "language": "en",
    "speaker": "ljspeech_F1"
  }' | ffplay -f s16le -ar 22050 -ac 1 -

Use one server worker per loaded model instance. Scale with multiple service processes or containers when concurrent throughput is required.

Best Practices

  • Pass language explicitly for every request.
  • Use a speaker assigned to the selected language.
  • Preserve punctuation to improve pauses and phrasing.
  • Write numbers, dates, currencies, and abbreviations as words when exact pronunciation matters.
  • Use proper Kazakh letters such as ә, ғ, қ, ң, ө, ұ, ү, һ, and і.
  • Split long paragraphs into complete sentences and stream them in order.
  • Reuse the loaded ONNX session; do not reload the model per request.
  • Synthesize at 22,050 Hz and resample afterward if another rate is required.

Scope And Limitations

The model provides fixed speaker identities and does not clone arbitrary voices. Mixed-language text inside one request is not recommended; synthesize each language segment with its matching language and speaker. Pronunciation of unusual names, abbreviations, and ambiguous Russian word stress may require text normalization or explicit phonetic input.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support