QORA-TTS - Native Rust Text-to-Speech Engine

Pure Rust text-to-speech synthesis engine with voice cloning. No Python runtime, no CUDA, no external dependencies. Single executable + quantized weights = portable TTS on any machine.

Overview

Property	Value
Engine	QORA-TTS (Pure Rust)
Base Model	Qwen3-TTS 1.7B
Parameters	~1.84B (Talker) + ~150M (Code Predictor) + Decoder
Quantization	Q4 (4-bit symmetric, group_size=32)
Model Size	1.5 GB (Q4)
Executable	4.2 MB
Sample Rate	24 kHz (16-bit PCM WAV)
Languages	12 (English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, + 2 dialects)
Voice Cloning	25 reference voices included (WAV files)
Platform	Windows x86_64 (CPU-only)

Architecture

Talker (Main Model) - 28-layer Transformer

Component	Details
Layers	28 decoder layers
Hidden Size	2,048
Attention Heads	16 (Query) / 8 (KV) - Grouped Query Attention
Head Dimension	128
MLP (Intermediate)	6,144 (SwiGLU)
Text Vocabulary	151,936 tokens
Codec Vocabulary	3,072 codes
Max Context	32,768 tokens
RoPE Theta	1,000,000 (multimodal, interleaved)

Code Predictor - 5-layer Transformer

Component	Details
Layers	5
Hidden Size	1,024
Attention Heads	16 (Query) / 8 (KV)
Intermediate	3,072
Code Groups	16 (generates codes 1-15 in parallel from code 0)

Speech Decoder (Vocos)

Component	Details
Architecture	8-layer Transformer + 2x Upsampling + Vocos vocoder
Codebook	512-dim embeddings
Output	24 kHz 16-bit PCM WAV

Speaker Encoder (Voice Cloning)

Component	Details
Architecture	Res2Net-ECAPA-TDNN
Blocks	3 Res2Net blocks (7 branches each)
Embedding Dim	2,048
Input	128-bin mel-spectrogram (24 kHz)
Attention	Squeeze-Excitation (SE) blocks

Synthesis Pipeline

Text → Tokenize → Talker (28 layers, autoregressive)
                       ↓
                  Code 0 sequence (12.5 Hz)
                       ↓
              Code Predictor (5 layers)
                       ↓
              16 code groups per timestep
                       ↓
    Reference Audio → Speaker Encoder → Voice Embedding (2048 dims)
                       ↓
              Speech Decoder (Vocos)
                       ↓
              24 kHz WAV audio

Files in Repository

model/
├── qora-tts.exe              # 4.2 MB - Inference engine with voice cloning
├── model.exe                 # 537 KB - Model compression tool
├── model_compressed.qora-tts # 1.5 GB - Q4 quantized weights
├── tokenizer.json            # 11 MB - Tokenizer (151,936 vocab)
├── config.json               # 4.4 KB - Model configuration
├── RUN.bat                   # 538 B - Quick start script
├── README.md                 # This file
└── voices/                   # 9.3 MB - Reference voices (25 WAV files)
    ├── adam.wav, anushri.wav, beth.wav
    ├── caty.wav, charles.wav, cherie.wav
    ├── david.wav, ember.wav, faith.wav
    ├── hale.wav, heisenberg.wav, hope.wav
    ├── jessica.wav, joe.wav, kea.wav
    ├── luna.wav, peter.wav, quentin.wav
    ├── riya.wav, sagar.wav, steven.wav
    ├── titan.wav, true.wav, velvety.wav
    └── vidhi.wav

Usage

Quick Start

# Easy way - use batch script
RUN.bat "Your text here" luna

Basic Usage

# Basic TTS with voice cloning
qora-tts.exe --load model_compressed.qora-tts --text "Hello, how are you today?" --ref-audio voices/luna.wav

# Choose different voice and output file
qora-tts.exe --load model_compressed.qora-tts --text "Good morning!" --ref-audio voices/adam.wav --output greeting.wav

# Use Heisenberg voice
qora-tts.exe --load model_compressed.qora-tts --text "Say my name" --ref-audio voices/heisenberg.wav --output walter.wav

# Clone your own voice (provide any 24kHz WAV file)
qora-tts.exe --load model_compressed.qora-tts --text "Custom voice" --ref-audio my_recording.wav --output custom.wav

# Control audio length (codes ≈ seconds × 12.5)
qora-tts.exe --load model_compressed.qora-tts --text "Short text" --ref-audio voices/luna.wav --max-codes 200

Voice Cloning

The system uses a Res2Net-ECAPA-TDNN speaker encoder to extract 2048-dimensional voice embeddings from reference audio. These embeddings control the voice characteristics in the generated speech.

25 Included Voices

Type	Voices
Female (13)	luna, anushri, beth, caty, cherie, ember, faith, hope, jessica, kea, riya, vidhi, velvety
Male (12)	adam, charles, david, hale, heisenberg, joe, peter, quentin, sagar, steven, titan, true

All reference voices are 24kHz WAV files in the voices/ folder.

Reference Audio Requirements

Format: 24 kHz WAV (mono or stereo)
Length: 3-10 seconds recommended
Quality: Clean speech, minimal background noise

CLI Arguments

Flag	Default	Description
`--load <path>`	-	Load from .qora-tts binary (fast, ~1-2s)
`--text <text>`	"Hello, how are you today?"	Text to synthesize
`--ref-audio <path>`	-	Reference audio for voice cloning (24kHz WAV)
`--output <path>`	output.wav	Output WAV file path
`--max-codes <n>`	500	Max audio code timesteps (~n/12.5 seconds)
`--speaker <name>`	ryan	Fallback speaker if no ref-audio provided
`--language <name>`	english	Target language

Available Speakers

Speaker	Language	ID
ryan	English	3061
serena	English	3066
vivian	English	3065
aiden	English	3062
eric	English	3063
uncle_fu	Chinese	3057
ono_anna	Japanese	3064
sohee	Korean	3067
dylan	Beijing Dialect	3060

Supported Languages

English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, Beijing Dialect, Sichuan Dialect

Performance Benchmarks

Test Hardware: Windows 11, CPU-only (no GPU acceleration)

Inference Speed

Phase	Time	Details
Model Load	~1.2s	From .qora-tts binary (1.5 GB Q4)
Speaker Encoder	~32s	Load ECAPA-TDNN weights
Voice Extraction	~5-10s	Extract 2048-dim embedding from WAV
Prefill	~4-18s	Process prompt tokens
Code Generation	~0.4-0.5 codes/s	Autoregressive generation
Audio Decode	~10-15 min	Vocos decoder (8 transformer + upsampling)
Memory	1,511 MB	Total loaded model size

Audio Output

Metric	Value
Sample Rate	24,000 Hz
Bit Depth	16-bit PCM
Format	WAV
Audio Rate	12.5 Hz codec = ~40s audio from 500 codes
Channels	Mono

Test Results

Test 1: Short Greeting (Speaker: Ryan)

Input: "Hello, how are you today?"

Metric	Value
Prompt Tokens	17
Load Time	22.9s
Prefill Time	17.7s
Code Steps	484
Code Generation	1,197.3s (0.4 codes/s)
Audio Duration	~38.7s
Result	WAV generated successfully

Test 2: Pangram (Speaker: Serena)

Input: "The quick brown fox jumps over the lazy dog."

Metric	Value
Prompt Tokens	20
Load Time	18.9s
Prefill Time	19.2s
Code Steps	499
Code Generation	1,196.4s (0.4 codes/s)
Audio Duration	~39.9s
Result	WAV generated successfully

Test 3: Science Text (Speaker: Vivian)

Input: "Quantum computing represents a fundamental shift in how we process information."

Metric	Value
Prompt Tokens	23
Load Time	20.8s
Prefill Time	18.2s
Code Steps	484
Code Generation	1,193.6s (0.4 codes/s)
Audio Duration	~38.7s
Result	WAV generated successfully

Test Summary

Test	Speaker	Text Length	Codes	Audio Length	Status
Greeting	Ryan	25 chars	484	~38.7s	PASS
Pangram	Serena	44 chars	499	~39.9s	PASS
Science	Vivian	79 chars	484	~38.7s	PASS

All three tests generated valid WAV audio files at 24 kHz with different speakers.

Compression Tool

Use model.exe to convert safetensors to Q4 binary format:

model.exe <input_dir> <output.qora-tts>

Example:

model.exe original_model model_compressed.qora-tts

Results:

Size: 3.8 GB → 1.5 GB (2.5× compression)
Loading: 150s → 1.2s (125× faster)
Quality: ~95% of original (minimal loss)

Technical Details

Quantization

Q4 Format:

4-bit symmetric quantization
Group size: 32 elements
Formula: value = (quantized - 8) × scale
Storage: 2 values per byte (4 bits each)

Memory Breakdown:

Talker: 950 MB (Q4)
Code Predictor: 125 MB (Q4)
Speech Decoder: 434 MB (Q4)
Speaker Encoder: 45 MB (Q4)
Total: 1,554 MB

Generation Parameters

Parameter	Default	Description
Temperature	0.9	Sampling randomness
Top-K	50	Top-K sampling
Repetition Penalty	1.05	Prevents repetitive codes

Special Tokens

Token	ID	Purpose
TTS BOS	151,672	Start of TTS sequence
TTS EOS	151,673	End of TTS sequence
Codec BOS	2,149	Start of codec sequence
Codec EOS	2,150	End of codec sequence

QORA Model Family

Engine	Model	Params	Size (Q4)	Purpose
QORA	SmolLM3-3B	3.07B	1.68 GB	Text generation, reasoning, chat
QORA-TTS	Qwen3-TTS	1.84B	1.5 GB	Text-to-speech synthesis with voice cloning
QORA-Vision (Image)	SigLIP 2 Base	86M	58 MB	Image embeddings, zero-shot classification
QORA-Vision (Video)	ViViT Base	89M	60 MB	Video action classification

All engines are pure Rust, CPU-only, single-binary executables with no Python dependencies.

System Requirements

Windows 10/11 (x86_64)
2GB+ RAM (4GB recommended)
CPU with AVX2 support
No GPU required
No Python runtime needed
No external dependencies

Features

Pure Rust implementation (no Python!)
Q4 quantization (75% smaller, 125× faster loading)
Real-time voice cloning from any WAV file
25 pre-loaded reference voices
CPU-only inference (no CUDA required)
Single executable + model (fully portable)
Multi-language support (12 languages)
High-quality 24 kHz output

Built with QORA - Pure Rust AI Inference Model: Qwen3-TTS-1.7B (Apache 2.0 License) Framework: Burn (Rust ML framework)

Downloads last month: 44