KlonAudio

Open-source text-to-speech for European languages with voice cloning

About This Model

KlonAudio is a fork of kugelaudio/kugelaudio-0-open with restored voice cloning capabilities.

What's Different from the Original

The original KugelAudio model removed voice cloning functionality to reduce VRAM usage. This fork restores the full dual-encoder architecture (acoustic + semantic tokenizers) that enables:

  • โœจ Voice cloning from audio samples (5-10 seconds)
  • ๐ŸŽญ Three pre-encoded German voices (radio, angry, old_lady)
  • ๐Ÿ“ฆ Ready-to-use voice samples
  • โš™๏ธ Complete configuration files (preprocessor_config.json, tokenizer_config.json)

All credit for the base model training goes to the KugelAudio team (Kajo Kratzenstein, Carlos Menke). This fork simply re-enables features that existed in the original architecture.

Base model: kugelaudio/kugelaudio-0-open

Installation & Download

โš ๏ธ Important: This model is 18GB. We highly recommend pre-downloading it using the methods below for faster, more reliable downloads. Without pre-downloading, the first from_pretrained() call may be very slow or timeout.

Prerequisites

# Install git-xet for faster cloning (https://hf.co/docs/hub/git-xet)
brew install git-xet
git xet install

# Install HuggingFace CLI
curl -LsSf https://hf.co/cli/install.sh | bash

Option 1: Clone the Full Repository

# Clone with all model files (18GB download)
git clone https://huggingface.co/Roland-JAAI/klonaudio

Option 2: Clone Without Large Files (Recommended)

# Clone without downloading large files initially - just their pointers
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Roland-JAAI/klonaudio

# Then download the model files using HF CLI (faster and more reliable)
hf auth login --token <your-token>
hf download Roland-JAAI/klonaudio

Get your token: Create a free HuggingFace account and generate a token at https://huggingface.co/settings/tokens (read access is sufficient).

Option 3: Download Only Model Files

# Authenticate
hf auth login --token <your-token>

# Download just the model files to HuggingFace cache
hf download Roland-JAAI/klonaudio

After downloading, from_pretrained() will use the cached files instantly.

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

# Device selection (CUDA > MPS > CPU)
if torch.cuda.is_available():
    device = "cuda"
    dtype = torch.bfloat16
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
    dtype = torch.float32  # MPS doesn't support bfloat16 well
else:
    device = "cpu"
    dtype = torch.float32

print(f"Using device: {device}")

# Load model and processor (uses cached files if you pre-downloaded)
model = AutoModelForCausalLM.from_pretrained(
    "Roland-JAAI/klonaudio",
    trust_remote_code=True,
    torch_dtype=dtype,
).to(device)
model.eval()

processor = AutoProcessor.from_pretrained(
    "Roland-JAAI/klonaudio",
    trust_remote_code=True
)

# See available pre-encoded voices
print(processor.get_available_voices())  # ["radio", "angry", "old_lady"]

# Generate speech with a named voice
inputs = processor(
    text="Guten Abend. Hier sind die Nachrichten.",
    voice="radio",
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0, max_new_tokens=2048)

# Save audio
processor.save_audio(outputs.speech_outputs[0], "output.wav")

Voice Cloning

Clone any voice from a reference audio file:

# Clone from audio file (requires encoders - don't call strip_encoders())
inputs = processor(
    text="Your text here",
    voice_prompt="path/to/reference_audio.wav",
    return_tensors="pt"
)
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, cfg_scale=3.0, max_new_tokens=2048)

processor.save_audio(outputs.speech_outputs[0], "cloned_output.wav")

Pre-encoded Voices

This model includes three pre-encoded German voices:

Voice Description Best For
radio Professional radio announcer Default/professional content
angry Angry, frustrated speech Emotional dialogue
old_lady Gentle elderly female Storytelling/warm content

Supported Languages

23 European languages with varying quality based on training data representation:

๐Ÿ‡ฉ๐Ÿ‡ช German | ๐Ÿ‡ฌ๐Ÿ‡ง English | ๐Ÿ‡ช๐Ÿ‡ธ Spanish | ๐Ÿ‡ซ๐Ÿ‡ท French | ๐Ÿ‡ฎ๐Ÿ‡น Italian | ๐Ÿ‡ต๐Ÿ‡น Portuguese | ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch | ๐Ÿ‡ต๐Ÿ‡ฑ Polish | ๐Ÿ‡ท๐Ÿ‡บ Russian | ๐Ÿ‡บ๐Ÿ‡ฆ Ukrainian | ๐Ÿ‡จ๐Ÿ‡ฟ Czech | ๐Ÿ‡ท๐Ÿ‡ด Romanian | ๐Ÿ‡ญ๐Ÿ‡บ Hungarian | ๐Ÿ‡ธ๐Ÿ‡ช Swedish | ๐Ÿ‡ฉ๐Ÿ‡ฐ Danish | ๐Ÿ‡ซ๐Ÿ‡ฎ Finnish | ๐Ÿ‡ณ๐Ÿ‡ด Norwegian | ๐Ÿ‡ฌ๐Ÿ‡ท Greek | ๐Ÿ‡ง๐Ÿ‡ฌ Bulgarian | ๐Ÿ‡ธ๐Ÿ‡ฐ Slovak | ๐Ÿ‡ญ๐Ÿ‡ท Croatian | ๐Ÿ‡ท๐Ÿ‡ธ Serbian | ๐Ÿ‡น๐Ÿ‡ท Turkish

Note: Quality varies by language. German, Spanish, French, and English have the best coverage from ~200,000 hours of training data (YODAS2 dataset).

Model Details

  • Base Model: kugelaudio/kugelaudio-0-open
  • Architecture: AR + Diffusion hybrid (based on Microsoft VibeVoice)
  • Parameters: 7B
  • Model Size: ~18GB
  • Training Data: ~200,000 hours from YODAS2 dataset
  • Training Hardware: 8x NVIDIA H100 GPUs
  • Training Duration: 5 days
  • License: MIT

Differences from Base Model

This fork differs from kugelaudio/kugelaudio-0-open in the following ways:

  1. Voice Cloning Restored: Re-enabled acoustic and semantic encoders
  2. New Voice Files: Added three German pre-encoded voices (radio, angry, old_lady)
  3. Configuration Files: Added missing preprocessor_config.json and tokenizer_config.json
  4. Voice Samples: Included sample audio for each voice
  5. Documentation: Updated examples and documentation for voice cloning

The model weights themselves are identical to the base model. We only added the voice files and configurations that were missing.

Citation

@misc{klonaudio2026,
  title={KlonAudio: Open-Source TTS with Voice Cloning for European Languages},
  author={Roland Becker},
  year={2026},
  url={https://github.com/RolandJAAI/klonaudio}
}

@misc{kugelaudio2025,
  title={KugelAudio: Open-Source Text-to-Speech Model},
  author={Kajo Kratzenstein and Carlos Menke},
  year={2025},
  url={https://github.com/Kugelaudio/kugelaudio-open}
}

Acknowledgments

  • KugelAudio Team (Kajo Kratzenstein, Carlos Menke): For training the excellent base model and open-sourcing it under MIT license
  • Microsoft VibeVoice: For the original architecture with dual encoders
  • YODAS2 Dataset: For providing multilingual training data
  • HuggingFace: For model hosting and the transformers library

Links

Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Roland-JAAI/klonaudio

Finetuned
(2)
this model