QORA-TTS - Native Rust Text-to-Speech Engine
Pure Rust text-to-speech synthesis engine with voice cloning. No Python runtime, no CUDA, no external dependencies. Single executable + quantized weights = portable TTS on any machine.
Overview
| Property |
Value |
| Engine |
QORA-TTS (Pure Rust) |
| Base Model |
Qwen3-TTS 1.7B |
| Parameters |
~1.84B (Talker) + ~150M (Code Predictor) + Decoder |
| Quantization |
Q4 (4-bit symmetric, group_size=32) |
| Model Size |
1.5 GB (Q4) |
| Executable |
4.2 MB |
| Sample Rate |
24 kHz (16-bit PCM WAV) |
| Languages |
12 (English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, + 2 dialects) |
| Voice Cloning |
25 reference voices included (WAV files) |
| Platform |
Windows x86_64 (CPU-only) |
Architecture
Talker (Main Model) - 28-layer Transformer
| Component |
Details |
| Layers |
28 decoder layers |
| Hidden Size |
2,048 |
| Attention Heads |
16 (Query) / 8 (KV) - Grouped Query Attention |
| Head Dimension |
128 |
| MLP (Intermediate) |
6,144 (SwiGLU) |
| Text Vocabulary |
151,936 tokens |
| Codec Vocabulary |
3,072 codes |
| Max Context |
32,768 tokens |
| RoPE Theta |
1,000,000 (multimodal, interleaved) |
Code Predictor - 5-layer Transformer
| Component |
Details |
| Layers |
5 |
| Hidden Size |
1,024 |
| Attention Heads |
16 (Query) / 8 (KV) |
| Intermediate |
3,072 |
| Code Groups |
16 (generates codes 1-15 in parallel from code 0) |
Speech Decoder (Vocos)
| Component |
Details |
| Architecture |
8-layer Transformer + 2x Upsampling + Vocos vocoder |
| Codebook |
512-dim embeddings |
| Output |
24 kHz 16-bit PCM WAV |
Speaker Encoder (Voice Cloning)
| Component |
Details |
| Architecture |
Res2Net-ECAPA-TDNN |
| Blocks |
3 Res2Net blocks (7 branches each) |
| Embedding Dim |
2,048 |
| Input |
128-bin mel-spectrogram (24 kHz) |
| Attention |
Squeeze-Excitation (SE) blocks |
Synthesis Pipeline
Text โ Tokenize โ Talker (28 layers, autoregressive)
โ
Code 0 sequence (12.5 Hz)
โ
Code Predictor (5 layers)
โ
16 code groups per timestep
โ
Reference Audio โ Speaker Encoder โ Voice Embedding (2048 dims)
โ
Speech Decoder (Vocos)
โ
24 kHz WAV audio
Files in Repository
model/
โโโ qora-tts.exe # 4.2 MB - Inference engine with voice cloning
โโโ model.exe # 537 KB - Model compression tool
โโโ model_compressed.qora-tts # 1.5 GB - Q4 quantized weights
โโโ tokenizer.json # 11 MB - Tokenizer (151,936 vocab)
โโโ config.json # 4.4 KB - Model configuration
โโโ RUN.bat # 538 B - Quick start script
โโโ README.md # This file
โโโ voices/ # 9.3 MB - Reference voices (25 WAV files)
โโโ adam.wav, anushri.wav, beth.wav
โโโ caty.wav, charles.wav, cherie.wav
โโโ david.wav, ember.wav, faith.wav
โโโ hale.wav, heisenberg.wav, hope.wav
โโโ jessica.wav, joe.wav, kea.wav
โโโ luna.wav, peter.wav, quentin.wav
โโโ riya.wav, sagar.wav, steven.wav
โโโ titan.wav, true.wav, velvety.wav
โโโ vidhi.wav
Usage
Quick Start
RUN.bat "Your text here" luna
Basic Usage
qora-tts.exe --load model_compressed.qora-tts --text "Hello, how are you today?" --ref-audio voices/luna.wav
qora-tts.exe --load model_compressed.qora-tts --text "Good morning!" --ref-audio voices/adam.wav --output greeting.wav
qora-tts.exe --load model_compressed.qora-tts --text "Say my name" --ref-audio voices/heisenberg.wav --output walter.wav
qora-tts.exe --load model_compressed.qora-tts --text "Custom voice" --ref-audio my_recording.wav --output custom.wav
qora-tts.exe --load model_compressed.qora-tts --text "Short text" --ref-audio voices/luna.wav --max-codes 200
Voice Cloning
The system uses a Res2Net-ECAPA-TDNN speaker encoder to extract 2048-dimensional voice embeddings from reference audio. These embeddings control the voice characteristics in the generated speech.
25 Included Voices
| Type |
Voices |
| Female (13) |
luna, anushri, beth, caty, cherie, ember, faith, hope, jessica, kea, riya, vidhi, velvety |
| Male (12) |
adam, charles, david, hale, heisenberg, joe, peter, quentin, sagar, steven, titan, true |
All reference voices are 24kHz WAV files in the voices/ folder.
Reference Audio Requirements
- Format: 24 kHz WAV (mono or stereo)
- Length: 3-10 seconds recommended
- Quality: Clean speech, minimal background noise
CLI Arguments
| Flag |
Default |
Description |
--load <path> |
- |
Load from .qora-tts binary (fast, ~1-2s) |
--text <text> |
"Hello, how are you today?" |
Text to synthesize |
--ref-audio <path> |
- |
Reference audio for voice cloning (24kHz WAV) |
--output <path> |
output.wav |
Output WAV file path |
--max-codes <n> |
500 |
Max audio code timesteps (~n/12.5 seconds) |
--speaker <name> |
ryan |
Fallback speaker if no ref-audio provided |
--language <name> |
english |
Target language |
Available Speakers
| Speaker |
Language |
ID |
| ryan |
English |
3061 |
| serena |
English |
3066 |
| vivian |
English |
3065 |
| aiden |
English |
3062 |
| eric |
English |
3063 |
| uncle_fu |
Chinese |
3057 |
| ono_anna |
Japanese |
3064 |
| sohee |
Korean |
3067 |
| dylan |
Beijing Dialect |
3060 |
Supported Languages
English, Chinese, German, Italian, Portuguese, Spanish, Japanese, Korean, French, Russian, Beijing Dialect, Sichuan Dialect
Performance Benchmarks
Test Hardware: Windows 11, CPU-only (no GPU acceleration)
Inference Speed
| Phase |
Time |
Details |
| Model Load |
~1.2s |
From .qora-tts binary (1.5 GB Q4) |
| Speaker Encoder |
~32s |
Load ECAPA-TDNN weights |
| Voice Extraction |
~5-10s |
Extract 2048-dim embedding from WAV |
| Prefill |
~4-18s |
Process prompt tokens |
| Code Generation |
~0.4-0.5 codes/s |
Autoregressive generation |
| Audio Decode |
~10-15 min |
Vocos decoder (8 transformer + upsampling) |
| Memory |
1,511 MB |
Total loaded model size |
Audio Output
| Metric |
Value |
| Sample Rate |
24,000 Hz |
| Bit Depth |
16-bit PCM |
| Format |
WAV |
| Audio Rate |
12.5 Hz codec = ~40s audio from 500 codes |
| Channels |
Mono |
Test Results
Test 1: Short Greeting (Speaker: Ryan)
Input: "Hello, how are you today?"
| Metric |
Value |
| Prompt Tokens |
17 |
| Load Time |
22.9s |
| Prefill Time |
17.7s |
| Code Steps |
484 |
| Code Generation |
1,197.3s (0.4 codes/s) |
| Audio Duration |
~38.7s |
| Result |
WAV generated successfully |
Test 2: Pangram (Speaker: Serena)
Input: "The quick brown fox jumps over the lazy dog."
| Metric |
Value |
| Prompt Tokens |
20 |
| Load Time |
18.9s |
| Prefill Time |
19.2s |
| Code Steps |
499 |
| Code Generation |
1,196.4s (0.4 codes/s) |
| Audio Duration |
~39.9s |
| Result |
WAV generated successfully |
Test 3: Science Text (Speaker: Vivian)
Input: "Quantum computing represents a fundamental shift in how we process information."
| Metric |
Value |
| Prompt Tokens |
23 |
| Load Time |
20.8s |
| Prefill Time |
18.2s |
| Code Steps |
484 |
| Code Generation |
1,193.6s (0.4 codes/s) |
| Audio Duration |
~38.7s |
| Result |
WAV generated successfully |
Test Summary
| Test |
Speaker |
Text Length |
Codes |
Audio Length |
Status |
| Greeting |
Ryan |
25 chars |
484 |
~38.7s |
PASS |
| Pangram |
Serena |
44 chars |
499 |
~39.9s |
PASS |
| Science |
Vivian |
79 chars |
484 |
~38.7s |
PASS |
All three tests generated valid WAV audio files at 24 kHz with different speakers.
Compression Tool
Use model.exe to convert safetensors to Q4 binary format:
model.exe <input_dir> <output.qora-tts>
Example:
model.exe original_model model_compressed.qora-tts
Results:
- Size: 3.8 GB โ 1.5 GB (2.5ร compression)
- Loading: 150s โ 1.2s (125ร faster)
- Quality: ~95% of original (minimal loss)
Technical Details
Quantization
Q4 Format:
- 4-bit symmetric quantization
- Group size: 32 elements
- Formula:
value = (quantized - 8) ร scale
- Storage: 2 values per byte (4 bits each)
Memory Breakdown:
- Talker: 950 MB (Q4)
- Code Predictor: 125 MB (Q4)
- Speech Decoder: 434 MB (Q4)
- Speaker Encoder: 45 MB (Q4)
- Total: 1,554 MB
Generation Parameters
| Parameter |
Default |
Description |
| Temperature |
0.9 |
Sampling randomness |
| Top-K |
50 |
Top-K sampling |
| Repetition Penalty |
1.05 |
Prevents repetitive codes |
Special Tokens
| Token |
ID |
Purpose |
| TTS BOS |
151,672 |
Start of TTS sequence |
| TTS EOS |
151,673 |
End of TTS sequence |
| Codec BOS |
2,149 |
Start of codec sequence |
| Codec EOS |
2,150 |
End of codec sequence |
QORA Model Family
| Engine |
Model |
Params |
Size (Q4) |
Purpose |
| QORA |
SmolLM3-3B |
3.07B |
1.68 GB |
Text generation, reasoning, chat |
| QORA-TTS |
Qwen3-TTS |
1.84B |
1.5 GB |
Text-to-speech synthesis with voice cloning |
| QORA-Vision (Image) |
SigLIP 2 Base |
86M |
58 MB |
Image embeddings, zero-shot classification |
| QORA-Vision (Video) |
ViViT Base |
89M |
60 MB |
Video action classification |
All engines are pure Rust, CPU-only, single-binary executables with no Python dependencies.
System Requirements
- Windows 10/11 (x86_64)
- 2GB+ RAM (4GB recommended)
- CPU with AVX2 support
- No GPU required
- No Python runtime needed
- No external dependencies
Features
- Pure Rust implementation (no Python!)
- Q4 quantization (75% smaller, 125ร faster loading)
- Real-time voice cloning from any WAV file
- 25 pre-loaded reference voices
- CPU-only inference (no CUDA required)
- Single executable + model (fully portable)
- Multi-language support (12 languages)
- High-quality 24 kHz output
Built with QORA - Pure Rust AI Inference
Model: Qwen3-TTS-1.7B (Apache 2.0 License)
Framework: Burn (Rust ML framework)