---
license: mit
tags:
  - text-to-speech
  - tts
  - bark
  - gguf
  - crispasr
language:
  - en
  - de
  - es
  - fr
  - it
  - ja
  - ko
  - pl
  - pt
  - ru
  - tr
  - zh
---

# Bark Small — GGUF

[Suno Bark](https://github.com/suno-ai/bark) (MIT license) converted to GGUF for native C++ inference with [CrispASR](https://github.com/CrispStrobe/CrispASR).

## Model details

- **Architecture**: 3-stage hierarchical transformer (semantic → coarse → fine) + EnCodec decoder
- **Parameters**: ~300M total across 3 GPT-2 sub-models
- **Output**: 24 kHz mono PCM
- **Languages**: 13 languages with pre-trained speaker prompts
- **German speakers**: `v2/de_speaker_0` through `v2/de_speaker_9`
- **License**: MIT

## Quantization table

| File | Quant | Size | Quality |
|------|-------|------|---------|
| `bark-small-f16.gguf` | F16 | 809 MB | Reference |
| `bark-small-q8_0.gguf` | Q8_0 | 435 MB | Near-lossless |
| `bark-small-q4_k.gguf` | Q4_K | 235 MB | Good for real-time |

All variants pack the 3 sub-models (text/semantic, coarse acoustic, fine acoustic) + EnCodec decoder into a single GGUF file. No companion model needed.

## Usage with CrispASR

```bash
# Auto-download and synthesize
crispasr --backend bark -m auto --tts "Hello, how are you today?" --tts-output hello.wav

# With a specific quantization
crispasr --backend bark -m bark-small-q4_k.gguf --tts "The quick brown fox" --tts-output fox.wav

# With a German speaker prompt (when supported)
crispasr --backend bark -m bark-small-q8_0.gguf --tts "Hallo Welt" --voice v2/de_speaker_3 --tts-output hallo.wav
```

## Conversion

Produced with:
```bash
python models/convert-bark-to-gguf.py --output bark-small-f16.gguf
crispasr-quantize bark-small-f16.gguf bark-small-q8_0.gguf q8_0
crispasr-quantize bark-small-f16.gguf bark-small-q4_k.gguf q4_k
```

## Architecture details

### Stage 1 — Semantic model
- GPT-2 (12 layers, 768-d) generating semantic tokens from text
- BERT WordPiece tokenizer (119547 vocab)
- Output: up to 768 semantic tokens

### Stage 2 — Coarse acoustic model
- GPT-2 (12 layers, 1024-d) converting semantic → coarse EnCodec codes
- Alternates codebook 0/1 prediction
- Output: 2 × ~384 coarse tokens

### Stage 3 — Fine acoustic model
- Non-causal GPT-2 (12 layers, 1024-d)
- Fills codebooks 2-7 from codebooks 0-1
- Output: 8 codebooks × 384 timesteps

### EnCodec decoder
- 8-codebook RVQ (1024 entries each)
- SEANet CNN decoder with ELU activation
- Upsample ratios [8, 5, 4, 2] → 24 kHz

## Credits

- Original model: [Suno AI](https://github.com/suno-ai/bark) (MIT)
- GGUF conversion + C++ runtime: [CrispASR](https://github.com/CrispStrobe/CrispASR)