bark-small-GGUF / README.md
cstr's picture
Upload folder using huggingface_hub
4a6231b verified
---
license: mit
tags:
- text-to-speech
- tts
- bark
- gguf
- crispasr
language:
- en
- de
- es
- fr
- it
- ja
- ko
- pl
- pt
- ru
- tr
- zh
---
# Bark Small β€” GGUF
[Suno Bark](https://github.com/suno-ai/bark) (MIT license) converted to GGUF for native C++ inference with [CrispASR](https://github.com/CrispStrobe/CrispASR).
## Model details
- **Architecture**: 3-stage hierarchical transformer (semantic β†’ coarse β†’ fine) + EnCodec decoder
- **Parameters**: ~300M total across 3 GPT-2 sub-models
- **Output**: 24 kHz mono PCM
- **Languages**: 13 languages with pre-trained speaker prompts
- **German speakers**: `v2/de_speaker_0` through `v2/de_speaker_9`
- **License**: MIT
## Quantization table
| File | Quant | Size | Quality |
|------|-------|------|---------|
| `bark-small-f16.gguf` | F16 | 809 MB | Reference |
| `bark-small-q8_0.gguf` | Q8_0 | 435 MB | Near-lossless |
| `bark-small-q4_k.gguf` | Q4_K | 235 MB | Good for real-time |
All variants pack the 3 sub-models (text/semantic, coarse acoustic, fine acoustic) + EnCodec decoder into a single GGUF file. No companion model needed.
## Usage with CrispASR
```bash
# Auto-download and synthesize
crispasr --backend bark -m auto --tts "Hello, how are you today?" --tts-output hello.wav
# With a specific quantization
crispasr --backend bark -m bark-small-q4_k.gguf --tts "The quick brown fox" --tts-output fox.wav
# With a German speaker prompt (when supported)
crispasr --backend bark -m bark-small-q8_0.gguf --tts "Hallo Welt" --voice v2/de_speaker_3 --tts-output hallo.wav
```
## Conversion
Produced with:
```bash
python models/convert-bark-to-gguf.py --output bark-small-f16.gguf
crispasr-quantize bark-small-f16.gguf bark-small-q8_0.gguf q8_0
crispasr-quantize bark-small-f16.gguf bark-small-q4_k.gguf q4_k
```
## Architecture details
### Stage 1 β€” Semantic model
- GPT-2 (12 layers, 768-d) generating semantic tokens from text
- BERT WordPiece tokenizer (119547 vocab)
- Output: up to 768 semantic tokens
### Stage 2 β€” Coarse acoustic model
- GPT-2 (12 layers, 1024-d) converting semantic β†’ coarse EnCodec codes
- Alternates codebook 0/1 prediction
- Output: 2 Γ— ~384 coarse tokens
### Stage 3 β€” Fine acoustic model
- Non-causal GPT-2 (12 layers, 1024-d)
- Fills codebooks 2-7 from codebooks 0-1
- Output: 8 codebooks Γ— 384 timesteps
### EnCodec decoder
- 8-codebook RVQ (1024 entries each)
- SEANet CNN decoder with ELU activation
- Upsample ratios [8, 5, 4, 2] β†’ 24 kHz
## Credits
- Original model: [Suno AI](https://github.com/suno-ai/bark) (MIT)
- GGUF conversion + C++ runtime: [CrispASR](https://github.com/CrispStrobe/CrispASR)