KlonAudio
Open-source text-to-speech for European languages with voice cloning
About This Model
KlonAudio is a fork of kugelaudio/kugelaudio-0-open with restored voice cloning capabilities.
What's Different from the Original
The original KugelAudio model removed voice cloning functionality to reduce VRAM usage. This fork restores the full dual-encoder architecture (acoustic + semantic tokenizers) that enables:
- โจ Voice cloning from audio samples (5-10 seconds)
- ๐ญ Three pre-encoded German voices (radio, angry, old_lady)
- ๐ฆ Ready-to-use voice samples
- โ๏ธ Complete configuration files (preprocessor_config.json, tokenizer_config.json)
All credit for the base model training goes to the KugelAudio team (Kajo Kratzenstein, Carlos Menke). This fork simply re-enables features that existed in the original architecture.
Base model: kugelaudio/kugelaudio-0-open
Installation & Download
โ ๏ธ Important: This model is 18GB. We highly recommend pre-downloading it using the methods below for faster, more reliable downloads. Without pre-downloading, the first from_pretrained() call may be very slow or timeout.
Prerequisites
# Install git-xet for faster cloning (https://hf.co/docs/hub/git-xet)
brew install git-xet
git xet install
# Install HuggingFace CLI
curl -LsSf https://hf.co/cli/install.sh | bash
Option 1: Clone the Full Repository
# Clone with all model files (18GB download)
git clone https://huggingface.co/Roland-JAAI/klonaudio
Option 2: Clone Without Large Files (Recommended)
# Clone without downloading large files initially - just their pointers
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Roland-JAAI/klonaudio
# Then download the model files using HF CLI (faster and more reliable)
hf auth login --token <your-token>
hf download Roland-JAAI/klonaudio
Get your token: Create a free HuggingFace account and generate a token at https://huggingface.co/settings/tokens (read access is sufficient).
Option 3: Download Only Model Files
# Authenticate
hf auth login --token <your-token>
# Download just the model files to HuggingFace cache
hf download Roland-JAAI/klonaudio
After downloading, from_pretrained() will use the cached files instantly.
Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
# Device selection (CUDA > MPS > CPU)
if torch.cuda.is_available():
device = "cuda"
dtype = torch.bfloat16
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
device = "mps"
dtype = torch.float32 # MPS doesn't support bfloat16 well
else:
device = "cpu"
dtype = torch.float32
print(f"Using device: {device}")
# Load model and processor (uses cached files if you pre-downloaded)
model = AutoModelForCausalLM.from_pretrained(
"Roland-JAAI/klonaudio",
trust_remote_code=True,
torch_dtype=dtype,
).to(device)
model.eval()
processor = AutoProcessor.from_pretrained(
"Roland-JAAI/klonaudio",
trust_remote_code=True
)
# See available pre-encoded voices
print(processor.get_available_voices()) # ["radio", "angry", "old_lady"]
# Generate speech with a named voice
inputs = processor(
text="Guten Abend. Hier sind die Nachrichten.",
voice="radio",
return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, cfg_scale=3.0, max_new_tokens=2048)
# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")
Voice Cloning
Clone any voice from a reference audio file:
# Clone from audio file (requires encoders - don't call strip_encoders())
inputs = processor(
text="Your text here",
voice_prompt="path/to/reference_audio.wav",
return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, cfg_scale=3.0, max_new_tokens=2048)
processor.save_audio(outputs.speech_outputs[0], "cloned_output.wav")
Pre-encoded Voices
This model includes three pre-encoded German voices:
| Voice | Description | Best For |
|---|---|---|
radio |
Professional radio announcer | Default/professional content |
angry |
Angry, frustrated speech | Emotional dialogue |
old_lady |
Gentle elderly female | Storytelling/warm content |
Supported Languages
23 European languages with varying quality based on training data representation:
๐ฉ๐ช German | ๐ฌ๐ง English | ๐ช๐ธ Spanish | ๐ซ๐ท French | ๐ฎ๐น Italian | ๐ต๐น Portuguese | ๐ณ๐ฑ Dutch | ๐ต๐ฑ Polish | ๐ท๐บ Russian | ๐บ๐ฆ Ukrainian | ๐จ๐ฟ Czech | ๐ท๐ด Romanian | ๐ญ๐บ Hungarian | ๐ธ๐ช Swedish | ๐ฉ๐ฐ Danish | ๐ซ๐ฎ Finnish | ๐ณ๐ด Norwegian | ๐ฌ๐ท Greek | ๐ง๐ฌ Bulgarian | ๐ธ๐ฐ Slovak | ๐ญ๐ท Croatian | ๐ท๐ธ Serbian | ๐น๐ท Turkish
Note: Quality varies by language. German, Spanish, French, and English have the best coverage from ~200,000 hours of training data (YODAS2 dataset).
Model Details
- Base Model: kugelaudio/kugelaudio-0-open
- Architecture: AR + Diffusion hybrid (based on Microsoft VibeVoice)
- Parameters: 7B
- Model Size: ~18GB
- Training Data: ~200,000 hours from YODAS2 dataset
- Training Hardware: 8x NVIDIA H100 GPUs
- Training Duration: 5 days
- License: MIT
Differences from Base Model
This fork differs from kugelaudio/kugelaudio-0-open in the following ways:
- Voice Cloning Restored: Re-enabled acoustic and semantic encoders
- New Voice Files: Added three German pre-encoded voices (radio, angry, old_lady)
- Configuration Files: Added missing
preprocessor_config.jsonandtokenizer_config.json - Voice Samples: Included sample audio for each voice
- Documentation: Updated examples and documentation for voice cloning
The model weights themselves are identical to the base model. We only added the voice files and configurations that were missing.
Citation
@misc{klonaudio2026,
title={KlonAudio: Open-Source TTS with Voice Cloning for European Languages},
author={Roland Becker},
year={2026},
url={https://github.com/RolandJAAI/klonaudio}
}
@misc{kugelaudio2025,
title={KugelAudio: Open-Source Text-to-Speech Model},
author={Kajo Kratzenstein and Carlos Menke},
year={2025},
url={https://github.com/Kugelaudio/kugelaudio-open}
}
Acknowledgments
- KugelAudio Team (Kajo Kratzenstein, Carlos Menke): For training the excellent base model and open-sourcing it under MIT license
- Microsoft VibeVoice: For the original architecture with dual encoders
- YODAS2 Dataset: For providing multilingual training data
- HuggingFace: For model hosting and the transformers library
Links
- ๐ค Base Model
- ๐ GitHub Repository
- ๐ Documentation
- ๐ Report Issues
- ๐ข JUST ADD AI GmbH
- Downloads last month
- -
Model tree for Roland-JAAI/klonaudio
Base model
kugelaudio/kugelaudio-0-open