Orpheus 3B β GGUF (ggml-quantised)
GGUF / ggml conversion of canopylabs/orpheus-3b-0.1-ft (sourced via the non-gated unsloth/orpheus-3b-0.1-ft mirror) for use with CrispStrobe/CrispASR.
Orpheus 3B is a Llama-3.2-3B-Instruct talker finetuned to emit <custom_token_N> codec tokens that the SNAC 24 kHz codec decodes back to speech. Distributed under the Llama-3.2 community license ("Built with Llama"). 8 fixed English speakers (tara, leah, jess, leo, dan, mia, zac, zoe).
Pair this with the SNAC codec at cstr/snac-24khz-GGUF β the talker outputs codec tokens but doesn't render audio without it.
Files
| File | Quant | Size | Notes |
|---|---|---|---|
orpheus-3b-base-f16.gguf |
F16 | 6.2 GB | Reference quality |
orpheus-3b-base-q8_0.gguf |
Q8_0 | 3.4 GB | Recommended β ASR roundtrip word-exact vs F16 |
The talker LM is sensitive to peaked codec distributions, so we ship F16 + Q8_0 only. Sub-Q8 quants tend to break the SNAC super-frame slot pattern and produce gibberish even when the LM perplexity remains plausible.
Quick start
# 1. Build CrispASR
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --target crispasr
# 2. Pull the talker + the SNAC codec
huggingface-cli download cstr/orpheus-3b-base-GGUF orpheus-3b-base-q8_0.gguf --local-dir .
huggingface-cli download cstr/snac-24khz-GGUF snac-24khz.gguf --local-dir .
# 3. Synthesise
./build/bin/crispasr --backend orpheus \
-m orpheus-3b-base-q8_0.gguf \
--codec-model snac-24khz.gguf \
--voice tara \
--temperature 0.6 \
--tts "Hello, my name is Tara." \
--tts-output hello.wav
24 kHz mono WAV. --voice <name> picks one of the 8 baked speakers; --temperature 0.6 is the upstream engine_class.py default and is required β greedy decoding (--temperature 0) enters a 7-slot loop after a few super-frames and produces unusable audio.
For auto-download simply pass -m auto:
./build/bin/crispasr --backend orpheus -m auto \
--voice leo --temperature 0.6 \
--tts "Auto-download fetches both files." \
--tts-output out.wav
Quality verification
ASR roundtrip via cstr/parakeet-tdt-0.6b-v3-GGUF on F16, voice tara:
| Synthesised text | Parakeet output |
|---|---|
"Hello, my name is Tara." |
"Hello, my name is Tara." (verbatim) |
Q8_0 produces ASR-identical output on the same prompt. Validation script:
crispasr --backend orpheus -m orpheus-3b-base-q8_0.gguf \
--codec-model snac-24khz.gguf --voice tara --temperature 0.6 \
--tts "Hello, my name is Tara." --tts-output orpheus_test.wav
crispasr --backend parakeet -m parakeet-tdt-0.6b-v3-q4_k.gguf \
-f orpheus_test.wav --no-prints
# β Hello, my name is Tara.
Architecture
| Component | Details |
|---|---|
| Talker LM | Llama-3.2-3B-Instruct (28 layers, 3072 hidden, 24 heads, 8 KV heads, head_dim=128, vocab 128256 + 7Γ4096 codec tokens) |
| RoPE | NEOX, theta=500000 |
| Codec | hubertsiuzdak/snac_24khz (RVQ, 3 codebooks Γ 4096) β separate GGUF |
| Sampling | temperature=0.6 + top-k by default; greedy is unstable |
| Audio | 24 kHz mono float32 PCM |
The talker emits a stream of <custom_token_N> LM tokens; every 7 emitted tokens form one "super-frame" that de-interleaves into 1 codes_0 / 2 codes_1 / 4 codes_2 entries (per orpheus_tts_pypi/orpheus_tts/decoder.py). 4 super-frames cover 16 SNAC frames (Γ 512-sample hop = 8192 PCM samples at 24 kHz).
Prompt format (verbatim from canopyai/Orpheus-TTS)
[audio_start=128259, BOS=128000, ...tokenize("{name}: {text}")...,
eot_id=128009, audio_eot=128260, audio_eom=128261, audio_end=128257]
The Llama-3 BOS at position 1 is critical. Without it the talker still emits well-structured super-frames but the audio is semantically garbage. The CrispASR runtime handles this for you β direct callers of orpheus_synthesize_codes need to mirror the layout.
Stop policy
Stop on audio_end=128257 or on >4 consecutive non-codec tokens. Don't stop on audio_pre_end=128009 or audio_end_b=128261 β those overlap with Llama-3 specials in the prompt and text_N<10 reserved markers in the custom_token block; the upstream tokens_decoder filters them silently rather than terminating on them.
Conversion
python models/convert-orpheus-to-gguf.py \
--input unsloth/orpheus-3b-0.1-ft \
--output orpheus-3b-ft-f16.gguf \
--outtype f16
build/bin/crispasr-quantize orpheus-3b-ft-f16.gguf orpheus-3b-base-q8_0.gguf q8_0
The converter sets GGUFWriter(use_temp_file=False) because the True path buffers tensor data via tempfile.SpooledTemporaryFile and collapses throughput on near-full external disks (/Volumes/backups at 100% saw multi-MB/s spooling). The direct write holds the full tensor list in RAM during emit but completes in ~30 s on the 6.6 GB f16.
Drop-in checkpoint variants
The orpheus runtime is checkpoint-agnostic β same arch, same prompt format, same SNAC codec. Future GGUF mirrors of:
SebastianBodza/Kartoffel_Orpheus_*(German finetunes, 26 fixed speakers)lex-au/Orpheus-3b-German-FT-Q8_0.gguf
are checkpoint swaps. They reuse this same SNAC codec.
Attribution
- Talker base model:
canopylabs/orpheus-3b-0.1-ft(Llama-3.2 community license). canopylabs / canopyai. - Non-gated mirror used for conversion:
unsloth/orpheus-3b-0.1-ft. - Llama base:
meta-llama/Llama-3.2-3B-Instructβ Llama-3.2 community license. - SNAC codec:
hubertsiuzdak/snac_24khz(MIT) β seecstr/snac-24khz-GGUF. - Reference TTS engine: canopyai/Orpheus-TTS (
engine_class.py:_format_prompt,decoder.py). - GGUF conversion + ggml runtime:
CrispStrobe/CrispASRβ seesrc/orpheus.cpp,src/orpheus_snac.cpp,models/convert-orpheus-to-gguf.py.
License
Llama-3.2 community license (inherited from the base talker). Includes the Acceptable Use Policy and the "Built with Llama" attribution requirement. Commercial use is permitted under the community license terms; review canopylabs/orpheus-3b-0.1-ft and the Llama-3.2 license before redistribution.
The SNAC codec is MIT and ships separately under cstr/snac-24khz-GGUF.
- Downloads last month
- 186
8-bit
16-bit
Model tree for cstr/orpheus-3b-base-GGUF
Base model
meta-llama/Llama-3.2-3B-Instruct