KlonAudio

Open-source text-to-speech for European languages with voice cloning

About This Model

KlonAudio is a fork of kugelaudio/kugelaudio-0-open with restored voice cloning capabilities.

What's Different from the Original

The original KugelAudio model removed voice cloning functionality to reduce VRAM usage. This fork restores the full dual-encoder architecture (acoustic + semantic tokenizers) that enables:

✨ Voice cloning from audio samples (5-10 seconds)
🎭 Three pre-encoded German voices (radio, angry, old_lady)
📦 Ready-to-use voice samples
⚙️ Complete configuration files (preprocessor_config.json, tokenizer_config.json)

All credit for the base model training goes to the KugelAudio team (Kajo Kratzenstein, Carlos Menke). This fork simply re-enables features that existed in the original architecture.

Base model: kugelaudio/kugelaudio-0-open

Installation & Download

⚠️ Important: This model is 18GB. We highly recommend pre-downloading it using the methods below for faster, more reliable downloads. Without pre-downloading, the first from_pretrained() call may be very slow or timeout.

Prerequisites

# Install git-xet for faster cloning (https://hf.co/docs/hub/git-xet)
brew install git-xet
git xet install

# Install HuggingFace CLI
curl -LsSf https://hf.co/cli/install.sh | bash

Option 1: Clone the Full Repository

# Clone with all model files (18GB download)
git clone https://huggingface.co/Roland-JAAI/klonaudio

Option 2: Clone Without Large Files (Recommended)

# Clone without downloading large files initially - just their pointers
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Roland-JAAI/klonaudio

# Then download the model files using HF CLI (faster and more reliable)
hf auth login --token <your-token>
hf download Roland-JAAI/klonaudio

Get your token: Create a free HuggingFace account and generate a token at https://huggingface.co/settings/tokens (read access is sufficient).

Option 3: Download Only Model Files

# Authenticate
hf auth login --token <your-token>

# Download just the model files to HuggingFace cache
hf download Roland-JAAI/klonaudio

After downloading, from_pretrained() will use the cached files instantly.

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

# Device selection (CUDA > MPS > CPU)
if torch.cuda.is_available():
    device = "cuda"
    dtype = torch.bfloat16
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
    dtype = torch.float32  # MPS doesn't support bfloat16 well
else:
    device = "cpu"
    dtype = torch.float32

print(f"Using device: {device}")

# Load model and processor (uses cached files if you pre-downloaded)
model = AutoModelForCausalLM.from_pretrained(
    "Roland-JAAI/klonaudio",
    trust_remote_code=True,
    torch_dtype=dtype,
).to(device)
model.eval()

processor = AutoProcessor.from_pretrained(
    "Roland-JAAI/klonaudio",
    trust_remote_code=True
)

# See available pre-encoded voices
print(processor.get_available_voices())  # ["radio", "angry", "old_lady"]

# Generate speech with a named voice
inputs = processor(
    text="Guten Abend. Hier sind die Nachrichten.",
    voice="radio",
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0, max_new_tokens=2048)

# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")

Voice Cloning

Clone any voice from a reference audio file:

# Clone from audio file (requires encoders - don't call strip_encoders())
inputs = processor(
    text="Your text here",
    voice_prompt="path/to/reference_audio.wav",
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0, max_new_tokens=2048)

processor.save_audio(outputs.speech_outputs[0], "cloned_output.wav")

Pre-encoded Voices

This model includes three pre-encoded German voices:

Voice	Description	Best For
`radio`	Professional radio announcer	Default/professional content
`angry`	Angry, frustrated speech	Emotional dialogue
`old_lady`	Gentle elderly female	Storytelling/warm content

Supported Languages

23 European languages with varying quality based on training data representation:

Note: Quality varies by language. German, Spanish, French, and English have the best coverage from ~200,000 hours of training data (YODAS2 dataset).

Model Details

Base Model: kugelaudio/kugelaudio-0-open
Architecture: AR + Diffusion hybrid (based on Microsoft VibeVoice)
Parameters: 7B
Model Size: ~18GB
Training Data: ~200,000 hours from YODAS2 dataset
Training Hardware: 8x NVIDIA H100 GPUs
Training Duration: 5 days
License: MIT

Differences from Base Model

This fork differs from kugelaudio/kugelaudio-0-open in the following ways:

Voice Cloning Restored: Re-enabled acoustic and semantic encoders
New Voice Files: Added three German pre-encoded voices (radio, angry, old_lady)
Configuration Files: Added missing preprocessor_config.json and tokenizer_config.json
Voice Samples: Included sample audio for each voice
Documentation: Updated examples and documentation for voice cloning

The model weights themselves are identical to the base model. We only added the voice files and configurations that were missing.

Citation

@misc{klonaudio2026,
  title={KlonAudio: Open-Source TTS with Voice Cloning for European Languages},
  author={Roland Becker},
  year={2026},
  url={https://github.com/RolandJAAI/klonaudio}
}

@misc{kugelaudio2025,
  title={KugelAudio: Open-Source Text-to-Speech Model},
  author={Kajo Kratzenstein and Carlos Menke},
  year={2025},
  url={https://github.com/Kugelaudio/kugelaudio-open}
}

Acknowledgments

KugelAudio Team (Kajo Kratzenstein, Carlos Menke): For training the excellent base model and open-sourcing it under MIT license
Microsoft VibeVoice: For the original architecture with dual encoders
YODAS2 Dataset: For providing multilingual training data
HuggingFace: For model hosting and the transformers library

Model tree for Roland-JAAI/klonaudio

Base model

kugelaudio/kugelaudio-0-open

Finetuned

(2)

this model

Roland-JAAI
/

klonaudio