bark-small-GGUF / README.md

cstr

Upload folder using huggingface_hub

4a6231b verified 3 days ago

preview code

raw

history blame contribute delete

2.67 kB

metadata

license: mit
tags:
  - text-to-speech
  - tts
  - bark
  - gguf
  - crispasr
language:
  - en
  - de
  - es
  - fr
  - it
  - ja
  - ko
  - pl
  - pt
  - ru
  - tr
  - zh

Bark Small — GGUF

Suno Bark (MIT license) converted to GGUF for native C++ inference with CrispASR.

Model details

Architecture: 3-stage hierarchical transformer (semantic → coarse → fine) + EnCodec decoder
Parameters: ~300M total across 3 GPT-2 sub-models
Output: 24 kHz mono PCM
Languages: 13 languages with pre-trained speaker prompts
German speakers: v2/de_speaker_0 through v2/de_speaker_9
License: MIT

Quantization table

File	Quant	Size	Quality
`bark-small-f16.gguf`	F16	809 MB	Reference
`bark-small-q8_0.gguf`	Q8_0	435 MB	Near-lossless
`bark-small-q4_k.gguf`	Q4_K	235 MB	Good for real-time

All variants pack the 3 sub-models (text/semantic, coarse acoustic, fine acoustic) + EnCodec decoder into a single GGUF file. No companion model needed.

Usage with CrispASR

# Auto-download and synthesize
crispasr --backend bark -m auto --tts "Hello, how are you today?" --tts-output hello.wav

# With a specific quantization
crispasr --backend bark -m bark-small-q4_k.gguf --tts "The quick brown fox" --tts-output fox.wav

# With a German speaker prompt (when supported)
crispasr --backend bark -m bark-small-q8_0.gguf --tts "Hallo Welt" --voice v2/de_speaker_3 --tts-output hallo.wav

Conversion

Produced with:

python models/convert-bark-to-gguf.py --output bark-small-f16.gguf
crispasr-quantize bark-small-f16.gguf bark-small-q8_0.gguf q8_0
crispasr-quantize bark-small-f16.gguf bark-small-q4_k.gguf q4_k

Architecture details

Stage 1 — Semantic model

GPT-2 (12 layers, 768-d) generating semantic tokens from text
BERT WordPiece tokenizer (119547 vocab)
Output: up to 768 semantic tokens

Stage 2 — Coarse acoustic model

GPT-2 (12 layers, 1024-d) converting semantic → coarse EnCodec codes
Alternates codebook 0/1 prediction
Output: 2 × ~384 coarse tokens

Stage 3 — Fine acoustic model

Non-causal GPT-2 (12 layers, 1024-d)
Fills codebooks 2-7 from codebooks 0-1
Output: 8 codebooks × 384 timesteps

EnCodec decoder

8-codebook RVQ (1024 entries each)
SEANet CNN decoder with ELU activation
Upsample ratios [8, 5, 4, 2] → 24 kHz

Credits

Original model: Suno AI (MIT)
GGUF conversion + C++ runtime: CrispASR