Orbination Whisper AI
Quantization-aware compression of whisper-large-v3-turbo to a compact 368 MB, multilingual,
CPU/GPU speech-to-text model (GGUF / whisper.cpp).
These are quantized GGUF checkpoints of a fine-tuned whisper-large-v3-turbo, produced with
Q3_K-matched quantization-aware training (QAT) so that accuracy survives 3-bit quantization.
A companion Go runtime (CPU/GPU hybrid, no PyTorch at runtime) is on GitHub.
โก๏ธ Code, Go runtime & prebuilt binaries: https://github.com/amichail-1/Orbination-Whisper-AI
Files
| File | Size | Role |
|---|---|---|
ggml-large-v3-turbo-q3_k.bin |
368 MB | smallest |
ggml-large-v3-turbo-q4_k.bin |
474 MB | balanced |
ggml-large-v3-turbo-q5_k.bin |
574 MB | best accuracy |
Results โ WER on held-out FLEURS (real speech), beam search
| Model | Size | English | Spanish | French | Greek |
|---|---|---|---|---|---|
| Q3_K | 368 MB | 0.065 | 0.050 | 0.065 | 0.148 |
| Q4_K | 474 MB | 0.062 | 0.048 | 0.063 | 0.124 |
| Q5_K | 574 MB | 0.061 | 0.047 | 0.061 | 0.110 |
| FP16 (upper bound) | 1.6 GB | 0.061 | 0.046 | 0.060 | 0.108 |
High-resource languages stay essentially flat across precisions; the custom kernel's largest gains appear on quantization-sensitive content (Greek: 0.285 โ 0.148 at equal size).
Method (short)
whisper-large-v3-turbo has a shallow 4-layer decoder, so naive โค3-bit quantization collapses it.
We train with the exact ggml Q3_K quantizer in the forward pass (straight-through estimator on
the backward) plus teacher distillation from the FP16 model. Because training == deployment, the
exported standard Q3_K GGUF deploys at the trained error rate with no train/inference gap. Decoding
uses beam search (size 5), which removes the repetition loops that inflate greedy WER.
The 368 MB floor is set by the token-embedding quantization (whisper.cpp compresses the 253 MB embedding to 3-bit); use Q4_K/Q5_K to give it more bits and lower WER further.
Usage (whisper.cpp)
# download a model
huggingface-cli download antoniosmich/Orbination-Whisper-AI ggml-large-v3-turbo-q3_k.bin --local-dir .
# run with whisper.cpp (16 kHz mono WAV)
./whisper-cli -m ggml-large-v3-turbo-q3_k.bin -bs 5 -l en audio.wav
Or use the Orbination Go runtime (CPU/GPU hybrid, CLI + HTTP server) from the GitHub repo.
License & attribution
MIT ยฉ 2026 Leia Enterprise Solutions (www.leia.gr) โ an
Orbination application (www.orbination.com).
Built on openai/whisper and
ggerganov/whisper.cpp; evaluated on
FLEURS.
Model tree for antoniosmich/Orbination-Whisper-AI
Base model
openai/whisper-large-v3