QORA-TTS - Native Rust Text-to-Speech Engine

Pure Rust text-to-speech synthesis engine with voice cloning. No Python runtime, no CUDA, no external dependencies. Single executable + quantized weights = portable TTS on any machine.

Overview

Property Value
Engine QORA-TTS (Pure Rust)
Base Model Qwen3-TTS 1.7B
Parameters ~1.84B (Talker) + ~150M (Code Predictor) + Decoder
Quantization Q4 (4-bit symmetric, group_size=32)
Model Size 1.5 GB (Q4)
Executable 4.2 MB
Sample Rate 24 kHz (16-bit PCM WAV)
Languages 12 (English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, + 2 dialects)
Voice Cloning 25 reference voices included (WAV files)
Platform Windows x86_64 (CPU-only)

Architecture

Talker (Main Model) - 28-layer Transformer

Component Details
Layers 28 decoder layers
Hidden Size 2,048
Attention Heads 16 (Query) / 8 (KV) - Grouped Query Attention
Head Dimension 128
MLP (Intermediate) 6,144 (SwiGLU)
Text Vocabulary 151,936 tokens
Codec Vocabulary 3,072 codes
Max Context 32,768 tokens
RoPE Theta 1,000,000 (multimodal, interleaved)

Code Predictor - 5-layer Transformer

Component Details
Layers 5
Hidden Size 1,024
Attention Heads 16 (Query) / 8 (KV)
Intermediate 3,072
Code Groups 16 (generates codes 1-15 in parallel from code 0)

Speech Decoder (Vocos)

Component Details
Architecture 8-layer Transformer + 2x Upsampling + Vocos vocoder
Codebook 512-dim embeddings
Output 24 kHz 16-bit PCM WAV

Speaker Encoder (Voice Cloning)

Component Details
Architecture Res2Net-ECAPA-TDNN
Blocks 3 Res2Net blocks (7 branches each)
Embedding Dim 2,048
Input 128-bin mel-spectrogram (24 kHz)
Attention Squeeze-Excitation (SE) blocks

Synthesis Pipeline

Text โ†’ Tokenize โ†’ Talker (28 layers, autoregressive)
                       โ†“
                  Code 0 sequence (12.5 Hz)
                       โ†“
              Code Predictor (5 layers)
                       โ†“
              16 code groups per timestep
                       โ†“
    Reference Audio โ†’ Speaker Encoder โ†’ Voice Embedding (2048 dims)
                       โ†“
              Speech Decoder (Vocos)
                       โ†“
              24 kHz WAV audio

Files in Repository

model/
โ”œโ”€โ”€ qora-tts.exe              # 4.2 MB - Inference engine with voice cloning
โ”œโ”€โ”€ model.exe                 # 537 KB - Model compression tool
โ”œโ”€โ”€ model_compressed.qora-tts # 1.5 GB - Q4 quantized weights
โ”œโ”€โ”€ tokenizer.json            # 11 MB - Tokenizer (151,936 vocab)
โ”œโ”€โ”€ config.json               # 4.4 KB - Model configuration
โ”œโ”€โ”€ RUN.bat                   # 538 B - Quick start script
โ”œโ”€โ”€ README.md                 # This file
โ””โ”€โ”€ voices/                   # 9.3 MB - Reference voices (25 WAV files)
    โ”œโ”€โ”€ adam.wav, anushri.wav, beth.wav
    โ”œโ”€โ”€ caty.wav, charles.wav, cherie.wav
    โ”œโ”€โ”€ david.wav, ember.wav, faith.wav
    โ”œโ”€โ”€ hale.wav, heisenberg.wav, hope.wav
    โ”œโ”€โ”€ jessica.wav, joe.wav, kea.wav
    โ”œโ”€โ”€ luna.wav, peter.wav, quentin.wav
    โ”œโ”€โ”€ riya.wav, sagar.wav, steven.wav
    โ”œโ”€โ”€ titan.wav, true.wav, velvety.wav
    โ””โ”€โ”€ vidhi.wav

Usage

Quick Start

# Easy way - use batch script
RUN.bat "Your text here" luna

Basic Usage

# Basic TTS with voice cloning
qora-tts.exe --load model_compressed.qora-tts --text "Hello, how are you today?" --ref-audio voices/luna.wav

# Choose different voice and output file
qora-tts.exe --load model_compressed.qora-tts --text "Good morning!" --ref-audio voices/adam.wav --output greeting.wav

# Use Heisenberg voice
qora-tts.exe --load model_compressed.qora-tts --text "Say my name" --ref-audio voices/heisenberg.wav --output walter.wav

# Clone your own voice (provide any 24kHz WAV file)
qora-tts.exe --load model_compressed.qora-tts --text "Custom voice" --ref-audio my_recording.wav --output custom.wav

# Control audio length (codes โ‰ˆ seconds ร— 12.5)
qora-tts.exe --load model_compressed.qora-tts --text "Short text" --ref-audio voices/luna.wav --max-codes 200

Voice Cloning

The system uses a Res2Net-ECAPA-TDNN speaker encoder to extract 2048-dimensional voice embeddings from reference audio. These embeddings control the voice characteristics in the generated speech.

25 Included Voices

Type Voices
Female (13) luna, anushri, beth, caty, cherie, ember, faith, hope, jessica, kea, riya, vidhi, velvety
Male (12) adam, charles, david, hale, heisenberg, joe, peter, quentin, sagar, steven, titan, true

All reference voices are 24kHz WAV files in the voices/ folder.

Reference Audio Requirements

  • Format: 24 kHz WAV (mono or stereo)
  • Length: 3-10 seconds recommended
  • Quality: Clean speech, minimal background noise

CLI Arguments

Flag Default Description
--load <path> - Load from .qora-tts binary (fast, ~1-2s)
--text <text> "Hello, how are you today?" Text to synthesize
--ref-audio <path> - Reference audio for voice cloning (24kHz WAV)
--output <path> output.wav Output WAV file path
--max-codes <n> 500 Max audio code timesteps (~n/12.5 seconds)
--speaker <name> ryan Fallback speaker if no ref-audio provided
--language <name> english Target language

Available Speakers

Speaker Language ID
ryan English 3061
serena English 3066
vivian English 3065
aiden English 3062
eric English 3063
uncle_fu Chinese 3057
ono_anna Japanese 3064
sohee Korean 3067
dylan Beijing Dialect 3060

Supported Languages

English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, Beijing Dialect, Sichuan Dialect

Performance Benchmarks

Test Hardware: Windows 11, CPU-only (no GPU acceleration)

Inference Speed

Phase Time Details
Model Load ~1.2s From .qora-tts binary (1.5 GB Q4)
Speaker Encoder ~32s Load ECAPA-TDNN weights
Voice Extraction ~5-10s Extract 2048-dim embedding from WAV
Prefill ~4-18s Process prompt tokens
Code Generation ~0.4-0.5 codes/s Autoregressive generation
Audio Decode ~10-15 min Vocos decoder (8 transformer + upsampling)
Memory 1,511 MB Total loaded model size

Audio Output

Metric Value
Sample Rate 24,000 Hz
Bit Depth 16-bit PCM
Format WAV
Audio Rate 12.5 Hz codec = ~40s audio from 500 codes
Channels Mono

Test Results

Test 1: Short Greeting (Speaker: Ryan)

Input: "Hello, how are you today?"

Metric Value
Prompt Tokens 17
Load Time 22.9s
Prefill Time 17.7s
Code Steps 484
Code Generation 1,197.3s (0.4 codes/s)
Audio Duration ~38.7s
Result WAV generated successfully

Test 2: Pangram (Speaker: Serena)

Input: "The quick brown fox jumps over the lazy dog."

Metric Value
Prompt Tokens 20
Load Time 18.9s
Prefill Time 19.2s
Code Steps 499
Code Generation 1,196.4s (0.4 codes/s)
Audio Duration ~39.9s
Result WAV generated successfully

Test 3: Science Text (Speaker: Vivian)

Input: "Quantum computing represents a fundamental shift in how we process information."

Metric Value
Prompt Tokens 23
Load Time 20.8s
Prefill Time 18.2s
Code Steps 484
Code Generation 1,193.6s (0.4 codes/s)
Audio Duration ~38.7s
Result WAV generated successfully

Test Summary

Test Speaker Text Length Codes Audio Length Status
Greeting Ryan 25 chars 484 ~38.7s PASS
Pangram Serena 44 chars 499 ~39.9s PASS
Science Vivian 79 chars 484 ~38.7s PASS

All three tests generated valid WAV audio files at 24 kHz with different speakers.

Compression Tool

Use model.exe to convert safetensors to Q4 binary format:

model.exe <input_dir> <output.qora-tts>

Example:

model.exe original_model model_compressed.qora-tts

Results:

  • Size: 3.8 GB โ†’ 1.5 GB (2.5ร— compression)
  • Loading: 150s โ†’ 1.2s (125ร— faster)
  • Quality: ~95% of original (minimal loss)

Technical Details

Quantization

Q4 Format:

  • 4-bit symmetric quantization
  • Group size: 32 elements
  • Formula: value = (quantized - 8) ร— scale
  • Storage: 2 values per byte (4 bits each)

Memory Breakdown:

  • Talker: 950 MB (Q4)
  • Code Predictor: 125 MB (Q4)
  • Speech Decoder: 434 MB (Q4)
  • Speaker Encoder: 45 MB (Q4)
  • Total: 1,554 MB

Generation Parameters

Parameter Default Description
Temperature 0.9 Sampling randomness
Top-K 50 Top-K sampling
Repetition Penalty 1.05 Prevents repetitive codes

Special Tokens

Token ID Purpose
TTS BOS 151,672 Start of TTS sequence
TTS EOS 151,673 End of TTS sequence
Codec BOS 2,149 Start of codec sequence
Codec EOS 2,150 End of codec sequence

QORA Model Family

Engine Model Params Size (Q4) Purpose
QORA SmolLM3-3B 3.07B 1.68 GB Text generation, reasoning, chat
QORA-TTS Qwen3-TTS 1.84B 1.5 GB Text-to-speech synthesis with voice cloning
QORA-Vision (Image) SigLIP 2 Base 86M 58 MB Image embeddings, zero-shot classification
QORA-Vision (Video) ViViT Base 89M 60 MB Video action classification

All engines are pure Rust, CPU-only, single-binary executables with no Python dependencies.

System Requirements

  • Windows 10/11 (x86_64)
  • 2GB+ RAM (4GB recommended)
  • CPU with AVX2 support
  • No GPU required
  • No Python runtime needed
  • No external dependencies

Features

  • Pure Rust implementation (no Python!)
  • Q4 quantization (75% smaller, 125ร— faster loading)
  • Real-time voice cloning from any WAV file
  • 25 pre-loaded reference voices
  • CPU-only inference (no CUDA required)
  • Single executable + model (fully portable)
  • Multi-language support (12 languages)
  • High-quality 24 kHz output

Built with QORA - Pure Rust AI Inference Model: Qwen3-TTS-1.7B (Apache 2.0 License) Framework: Burn (Rust ML framework)

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support