cstr
/

bark-small-GGUF

Model card Files Files and versions

bark-small-GGUF / README.md

cstr's picture

Upload folder using huggingface_hub

4a6231b verified 6 days ago

|

history blame contribute delete

2.67 kB

	---
	license: mit
	tags:
	- text-to-speech
	- tts
	- bark
	- gguf
	- crispasr
	language:
	- en
	- de
	- es
	- fr
	- it
	- ja
	- ko
	- pl
	- pt
	- ru
	- tr
	- zh
	---

	# Bark Small — GGUF

	[Suno Bark](https://github.com/suno-ai/bark) (MIT license) converted to GGUF for native C++ inference with [CrispASR](https://github.com/CrispStrobe/CrispASR).

	## Model details

	- Architecture: 3-stage hierarchical transformer (semantic → coarse → fine) + EnCodec decoder
	- Parameters: ~300M total across 3 GPT-2 sub-models
	- Output: 24 kHz mono PCM
	- Languages: 13 languages with pre-trained speaker prompts
	- German speakers: `v2/de_speaker_0` through `v2/de_speaker_9`
	- License: MIT

	## Quantization table

	\| File \| Quant \| Size \| Quality \|
	\|------\|-------\|------\|---------\|
	\| `bark-small-f16.gguf` \| F16 \| 809 MB \| Reference \|
	\| `bark-small-q8_0.gguf` \| Q8_0 \| 435 MB \| Near-lossless \|
	\| `bark-small-q4_k.gguf` \| Q4_K \| 235 MB \| Good for real-time \|

	All variants pack the 3 sub-models (text/semantic, coarse acoustic, fine acoustic) + EnCodec decoder into a single GGUF file. No companion model needed.

	## Usage with CrispASR

	```bash
	# Auto-download and synthesize
	crispasr --backend bark -m auto --tts "Hello, how are you today?" --tts-output hello.wav

	# With a specific quantization
	crispasr --backend bark -m bark-small-q4_k.gguf --tts "The quick brown fox" --tts-output fox.wav

	# With a German speaker prompt (when supported)
	crispasr --backend bark -m bark-small-q8_0.gguf --tts "Hallo Welt" --voice v2/de_speaker_3 --tts-output hallo.wav
	```

	## Conversion

	Produced with:
	```bash
	python models/convert-bark-to-gguf.py --output bark-small-f16.gguf
	crispasr-quantize bark-small-f16.gguf bark-small-q8_0.gguf q8_0
	crispasr-quantize bark-small-f16.gguf bark-small-q4_k.gguf q4_k
	```

	## Architecture details

	### Stage 1 — Semantic model
	- GPT-2 (12 layers, 768-d) generating semantic tokens from text
	- BERT WordPiece tokenizer (119547 vocab)
	- Output: up to 768 semantic tokens

	### Stage 2 — Coarse acoustic model
	- GPT-2 (12 layers, 1024-d) converting semantic → coarse EnCodec codes
	- Alternates codebook 0/1 prediction
	- Output: 2 × ~384 coarse tokens

	### Stage 3 — Fine acoustic model
	- Non-causal GPT-2 (12 layers, 1024-d)
	- Fills codebooks 2-7 from codebooks 0-1
	- Output: 8 codebooks × 384 timesteps

	### EnCodec decoder
	- 8-codebook RVQ (1024 entries each)
	- SEANet CNN decoder with ELU activation
	- Upsample ratios [8, 5, 4, 2] → 24 kHz

	## Credits

	- Original model: [Suno AI](https://github.com/suno-ai/bark) (MIT)
	- GGUF conversion + C++ runtime: [CrispASR](https://github.com/CrispStrobe/CrispASR)