| --- |
| license: mit |
| tags: |
| - text-to-speech |
| - tts |
| - bark |
| - gguf |
| - crispasr |
| language: |
| - en |
| - de |
| - es |
| - fr |
| - it |
| - ja |
| - ko |
| - pl |
| - pt |
| - ru |
| - tr |
| - zh |
| --- |
| |
| # Bark Small β GGUF |
|
|
| [Suno Bark](https://github.com/suno-ai/bark) (MIT license) converted to GGUF for native C++ inference with [CrispASR](https://github.com/CrispStrobe/CrispASR). |
|
|
| ## Model details |
|
|
| - **Architecture**: 3-stage hierarchical transformer (semantic β coarse β fine) + EnCodec decoder |
| - **Parameters**: ~300M total across 3 GPT-2 sub-models |
| - **Output**: 24 kHz mono PCM |
| - **Languages**: 13 languages with pre-trained speaker prompts |
| - **German speakers**: `v2/de_speaker_0` through `v2/de_speaker_9` |
| - **License**: MIT |
|
|
| ## Quantization table |
|
|
| | File | Quant | Size | Quality | |
| |------|-------|------|---------| |
| | `bark-small-f16.gguf` | F16 | 809 MB | Reference | |
| | `bark-small-q8_0.gguf` | Q8_0 | 435 MB | Near-lossless | |
| | `bark-small-q4_k.gguf` | Q4_K | 235 MB | Good for real-time | |
| |
| All variants pack the 3 sub-models (text/semantic, coarse acoustic, fine acoustic) + EnCodec decoder into a single GGUF file. No companion model needed. |
| |
| ## Usage with CrispASR |
| |
| ```bash |
| # Auto-download and synthesize |
| crispasr --backend bark -m auto --tts "Hello, how are you today?" --tts-output hello.wav |
| |
| # With a specific quantization |
| crispasr --backend bark -m bark-small-q4_k.gguf --tts "The quick brown fox" --tts-output fox.wav |
|
|
| # With a German speaker prompt (when supported) |
| crispasr --backend bark -m bark-small-q8_0.gguf --tts "Hallo Welt" --voice v2/de_speaker_3 --tts-output hallo.wav |
| ``` |
| |
| ## Conversion |
| |
| Produced with: |
| ```bash |
| python models/convert-bark-to-gguf.py --output bark-small-f16.gguf |
| crispasr-quantize bark-small-f16.gguf bark-small-q8_0.gguf q8_0 |
| crispasr-quantize bark-small-f16.gguf bark-small-q4_k.gguf q4_k |
| ``` |
| |
| ## Architecture details |
| |
| ### Stage 1 β Semantic model |
| - GPT-2 (12 layers, 768-d) generating semantic tokens from text |
| - BERT WordPiece tokenizer (119547 vocab) |
| - Output: up to 768 semantic tokens |
| |
| ### Stage 2 β Coarse acoustic model |
| - GPT-2 (12 layers, 1024-d) converting semantic β coarse EnCodec codes |
| - Alternates codebook 0/1 prediction |
| - Output: 2 Γ ~384 coarse tokens |
| |
| ### Stage 3 β Fine acoustic model |
| - Non-causal GPT-2 (12 layers, 1024-d) |
| - Fills codebooks 2-7 from codebooks 0-1 |
| - Output: 8 codebooks Γ 384 timesteps |
| |
| ### EnCodec decoder |
| - 8-codebook RVQ (1024 entries each) |
| - SEANet CNN decoder with ELU activation |
| - Upsample ratios [8, 5, 4, 2] β 24 kHz |
| |
| ## Credits |
| |
| - Original model: [Suno AI](https://github.com/suno-ai/bark) (MIT) |
| - GGUF conversion + C++ runtime: [CrispASR](https://github.com/CrispStrobe/CrispASR) |
| |