metadata
license: mit
tags:
- text-to-speech
- tts
- bark
- gguf
- crispasr
language:
- en
- de
- es
- fr
- it
- ja
- ko
- pl
- pt
- ru
- tr
- zh
Bark Small — GGUF
Suno Bark (MIT license) converted to GGUF for native C++ inference with CrispASR.
Model details
- Architecture: 3-stage hierarchical transformer (semantic → coarse → fine) + EnCodec decoder
- Parameters: ~300M total across 3 GPT-2 sub-models
- Output: 24 kHz mono PCM
- Languages: 13 languages with pre-trained speaker prompts
- German speakers:
v2/de_speaker_0throughv2/de_speaker_9 - License: MIT
Quantization table
| File | Quant | Size | Quality |
|---|---|---|---|
bark-small-f16.gguf |
F16 | 809 MB | Reference |
bark-small-q8_0.gguf |
Q8_0 | 435 MB | Near-lossless |
bark-small-q4_k.gguf |
Q4_K | 235 MB | Good for real-time |
All variants pack the 3 sub-models (text/semantic, coarse acoustic, fine acoustic) + EnCodec decoder into a single GGUF file. No companion model needed.
Usage with CrispASR
# Auto-download and synthesize
crispasr --backend bark -m auto --tts "Hello, how are you today?" --tts-output hello.wav
# With a specific quantization
crispasr --backend bark -m bark-small-q4_k.gguf --tts "The quick brown fox" --tts-output fox.wav
# With a German speaker prompt (when supported)
crispasr --backend bark -m bark-small-q8_0.gguf --tts "Hallo Welt" --voice v2/de_speaker_3 --tts-output hallo.wav
Conversion
Produced with:
python models/convert-bark-to-gguf.py --output bark-small-f16.gguf
crispasr-quantize bark-small-f16.gguf bark-small-q8_0.gguf q8_0
crispasr-quantize bark-small-f16.gguf bark-small-q4_k.gguf q4_k
Architecture details
Stage 1 — Semantic model
- GPT-2 (12 layers, 768-d) generating semantic tokens from text
- BERT WordPiece tokenizer (119547 vocab)
- Output: up to 768 semantic tokens
Stage 2 — Coarse acoustic model
- GPT-2 (12 layers, 1024-d) converting semantic → coarse EnCodec codes
- Alternates codebook 0/1 prediction
- Output: 2 × ~384 coarse tokens
Stage 3 — Fine acoustic model
- Non-causal GPT-2 (12 layers, 1024-d)
- Fills codebooks 2-7 from codebooks 0-1
- Output: 8 codebooks × 384 timesteps
EnCodec decoder
- 8-codebook RVQ (1024 entries each)
- SEANet CNN decoder with ELU activation
- Upsample ratios [8, 5, 4, 2] → 24 kHz