kajode Cursor commited on
Commit
22d6ed9
·
0 Parent(s):

KugelAudio-0-Open: TTS for European languages

Browse files

Remove voice cloning, use pre-encoded voices instead.
Voices are stored as .pt files in voices/ folder.

Co-authored-by: Cursor <cursoragent@cursor.com>

.gitattributes ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ samples/255_Durchsage.wav filter=lfs diff=lfs merge=lfs -text
37
+ samples/260_Lisa.wav filter=lfs diff=lfs merge=lfs -text
38
+ samples/270_Friedrich_Sänger.wav filter=lfs diff=lfs merge=lfs -text
39
+ samples/281_Suffi_Thomas.wav filter=lfs diff=lfs merge=lfs -text
40
+ samples/258_Lukas_der_Flüsterer.wav filter=lfs diff=lfs merge=lfs -text
41
+ samples/277_Radio_Lars.wav filter=lfs diff=lfs merge=lfs -text
42
+ samples/261_Sauerer_Felix.wav filter=lfs diff=lfs merge=lfs -text
43
+ samples/266_Petra_die_Vorleserin.wav filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,375 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - fr
6
+ - es
7
+ - it
8
+ - pt
9
+ - nl
10
+ - pl
11
+ - ru
12
+ - uk
13
+ - cs
14
+ - ro
15
+ - hu
16
+ - sv
17
+ - da
18
+ - fi
19
+ - no
20
+ - el
21
+ - bg
22
+ - sk
23
+ - hr
24
+ - sr
25
+ - tr
26
+ license: mit
27
+ tags:
28
+ - text-to-speech
29
+ - tts
30
+ - speech-synthesis
31
+ - audio-generation
32
+ - european-languages
33
+ - diffusion
34
+ - autoregressive
35
+ pipeline_tag: text-to-speech
36
+ inference: false
37
+ model-index:
38
+ - name: kugelaudio-0-open
39
+ results:
40
+ - task:
41
+ type: text-to-speech
42
+ dataset:
43
+ type: custom
44
+ name: YODAS2
45
+ metrics:
46
+ - type: win-rate
47
+ value: 78.0
48
+ name: Human Preference vs ElevenLabs
49
+ ---
50
+
51
+ # 🎙️ KugelAudio-0-Open
52
+
53
+ **Open-source text-to-speech for European languages**
54
+ 7B parameter model powered by an AR + Diffusion architecture
55
+
56
+ <p align="center">
57
+ <a href="https://github.com/Kugelaudio/kugelaudio-open"><img src="https://img.shields.io/badge/GitHub-Source_Code-black" alt="GitHub Source Code"></a>
58
+ <a href="https://kugelaudio.com"><img src="https://img.shields.io/badge/🌐-Website-blue" alt="KugelAudio Website"></a>
59
+ </p>
60
+
61
+ <table align="center" style="border-collapse: collapse; border: none;">
62
+ <tr style="border: none;">
63
+ <td style="border: none; padding: 0 20px;">
64
+ <a href="https://kugelaudio.com">
65
+ <img src="https://www.kugelaudio.com/logos/Logo%20Short.svg" alt="KugelAudio" style="height: 60px; width: auto;">
66
+ </a>
67
+ </td>
68
+ <td style="border: none; padding: 0 20px;">
69
+ <a href="https://hpi.de/ki-servicezentrum/">
70
+ <img src="https://docs.sc.hpi.de/attachments/aisc/aisc-logo.png" alt="KI-Servicezentrum Berlin-Brandenburg" style="height: 60px; width: auto;">
71
+ </a>
72
+ </td>
73
+ <td style="border: none; padding: 0 20px;">
74
+ <a href="https://www.bmftr.bund.de">
75
+ <img src="https://hpi.de/fileadmin/_processed_/a/3/csm_BMFTR_de_Web_RGB_gef_durch_cd1f5345bd.jpg" alt="Gefördert durch BMFTR" style="height: 60px; width: auto;">
76
+ </a>
77
+ </td>
78
+ </tr>
79
+ </table>
80
+
81
+ License: MIT Python 3.10+ Hosted API
82
+
83
+ KugelAudio KI-Servicezentrum Berlin-Brandenburg Gefördert durch BMFTR
84
+
85
+ ---
86
+
87
+ ## Motivation
88
+
89
+ **Open-source text-to-speech models for European languages are significantly lagging behind.** While English TTS has seen remarkable progress, speakers of German, French, Spanish, Polish, and dozens of other European languages have been underserved by the open-source community.
90
+
91
+ KugelAudio aims to change this. Building on the excellent foundation laid by the [VibeVoice team at Microsoft](https://github.com/microsoft/VibeVoice), we've trained a model specifically focused on European language coverage, using approximately **200,000 hours** of highly pre-processed and enhanced speech data from the [YODAS2 dataset](https://huggingface.co/datasets/espnet/yodas).
92
+
93
+ ## 🏆 Benchmark Results: Outperforming ElevenLabs
94
+
95
+ **KugelAudio achieves state-of-the-art performance**, beating industry leaders including ElevenLabs in rigorous human preference testing. This breakthrough demonstrates that open-source models can now rival - and surpass - the best commercial TTS systems.
96
+
97
+ ### Human Preference Benchmark (A/B Testing)
98
+
99
+ We conducted extensive A/B testing with **339 human evaluations** to compare KugelAudio against leading TTS models. Participants listened to a reference voice sample, then compared outputs from two models and selected which sounded more human and closer to the original voice.
100
+
101
+ ### German Language Evaluation
102
+
103
+ The evaluation specifically focused on **German language samples** with diverse emotional expressions and speaking styles:
104
+
105
+ * **Neutral Speech**: Standard conversational tones
106
+ * **Shouting**: High-intensity, elevated volume speech
107
+ * **Singing**: Melodic and rhythmic speech patterns
108
+ * **Drunken Voice**: Slurred and irregular speech characteristics
109
+
110
+ These diverse test cases demonstrate the model's capability to handle a wide range of speaking styles beyond standard narration.
111
+
112
+ ### OpenSkill Ranking Results
113
+
114
+ | Rank | Model | Score | Record | Win Rate |
115
+ |------|-------|-------|--------|----------|
116
+ | 🥇 1 | **KugelAudio** | **26** | 71W / 20L / 23T | **78.0%** |
117
+ | 🥈 2 | ElevenLabs Multi v2 | 25 | 56W / 34L / 22T | 62.2% |
118
+ | 🥉 3 | ElevenLabs v3 | 21 | 64W / 34L / 16T | 65.3% |
119
+ | 4 | Cartesia | 21 | 55W / 38L / 19T | 59.1% |
120
+ | 5 | VibeVoice | 10 | 30W / 74L / 8T | 28.8% |
121
+ | 6 | CosyVoice v3 | 9 | 15W / 91L / 8T | 14.2% |
122
+
123
+ _Based on 339 evaluations using Bayesian skill-rating system (OpenSkill)_
124
+
125
+ ## Audio Samples
126
+
127
+ Listen to KugelAudio's diverse voice capabilities across different speaking styles and languages:
128
+
129
+ ### German Voice Samples
130
+
131
+ | Sample | Description | Audio Player |
132
+ |--------|-------------|--------------|
133
+ | **Whispering** | Soft whispering voice | <audio controls><source src="https://huggingface.co/kugelaudio/kugelaudio-0-open/resolve/main/samples/258_Lukas_der_Flüsterer.wav" type="audio/wav"></audio> |
134
+ | **Female Narrator** | Professional female reader voice | <audio controls><source src="https://huggingface.co/kugelaudio/kugelaudio-0-open/resolve/main/samples/266_Petra_die_Vorleserin.wav" type="audio/wav"></audio> |
135
+ | **Angry Voice** | Irritated and frustrated speech | <audio controls><source src="https://huggingface.co/kugelaudio/kugelaudio-0-open/resolve/main/samples/261_Sauerer_Felix.wav" type="audio/wav"></audio> |
136
+ | **Radio Announcer** | Professional radio broadcast voice | <audio controls><source src="https://huggingface.co/kugelaudio/kugelaudio-0-open/resolve/main/samples/277_Radio_Lars.wav" type="audio/wav"></audio> |
137
+
138
+ *All samples are generated using pre-encoded voice embeddings.*
139
+
140
+ ### Training Details
141
+
142
+ - **Base Model**: [Microsoft VibeVoice](https://github.com/microsoft/VibeVoice)
143
+ - **Training Data**: ~200,000 hours from [YODAS2](https://huggingface.co/datasets/espnet/yodas)
144
+ - **Hardware**: 8x NVIDIA H100 GPUs
145
+ - **Training Duration**: 5 days
146
+
147
+ ### Supported Languages
148
+
149
+ This model supports the following European languages:
150
+
151
+ | Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |
152
+ |----------|------|------|----------|------|------|----------|------|------|
153
+ | English | en | 🇺🇸 | German | de | 🇩🇪 | French | fr | 🇫🇷 |
154
+ | Spanish | es | 🇪🇸 | Italian | it | 🇮🇹 | Portuguese | pt | 🇵🇹 |
155
+ | Dutch | nl | 🇳🇱 | Polish | pl | 🇵🇱 | Russian | ru | 🇷🇺 |
156
+ | Ukrainian | uk | 🇺🇦 | Czech | cs | 🇨🇿 | Romanian | ro | 🇷🇴 |
157
+ | Hungarian | hu | 🇭🇺 | Swedish | sv | 🇸🇪 | Danish | da | 🇩🇰 |
158
+ | Finnish | fi | 🇫🇮 | Norwegian | no | 🇳🇴 | Greek | el | 🇬🇷 |
159
+ | Bulgarian | bg | 🇧🇬 | Slovak | sk | 🇸🇰 | Croatian | hr | 🇭🇷 |
160
+ | Serbian | sr | 🇷🇸 | Turkish | tr | 🇹🇷 | | | |
161
+
162
+ > **📊 Language Coverage Disclaimer**: Quality varies significantly by language. Spanish, French, English, and German have the strongest representation in our training data (~200,000 hours from YODAS2). Other languages may have reduced quality, prosody, or vocabulary coverage depending on their availability in the training dataset.
163
+
164
+ ### Model Specifications
165
+
166
+ | Property | Value |
167
+ | --------------------- | --------------------------------------------------------------------------- |
168
+ | **Parameters** | 7B |
169
+ | **Architecture** | AR + Diffusion (Qwen2.5-7B backbone) |
170
+ | **Base Model** | [Microsoft VibeVoice](https://github.com/microsoft/VibeVoice) |
171
+ | **Audio Sample Rate** | 24kHz |
172
+ | **Audio Format** | Mono, float32 |
173
+ | **VRAM Required** | \~19GB |
174
+ | **Training Hardware** | 8x NVIDIA H100 |
175
+ | **Training Duration** | 5 days |
176
+ | **Training Data** | \~200,000 hours from [YODAS2](https://huggingface.co/datasets/espnet/yodas) |
177
+
178
+ ## Quick Start
179
+
180
+ ### Installation
181
+
182
+ ```bash
183
+ # Install with pip
184
+ pip install kugelaudio-open
185
+
186
+ # Or with uv (recommended)
187
+ uv pip install kugelaudio-open
188
+ ```
189
+
190
+ ### Basic Usage
191
+
192
+ ```python
193
+ from kugelaudio_open import (
194
+ KugelAudioForConditionalGenerationInference,
195
+ KugelAudioProcessor,
196
+ )
197
+ import torch
198
+
199
+ # Load model
200
+ device = "cuda" if torch.cuda.is_available() else "cpu"
201
+ model = KugelAudioForConditionalGenerationInference.from_pretrained(
202
+ "kugelaudio/kugelaudio-0-open",
203
+ torch_dtype=torch.bfloat16,
204
+ ).to(device)
205
+ model.eval()
206
+
207
+ processor = KugelAudioProcessor.from_pretrained("kugelaudio/kugelaudio-0-open")
208
+
209
+ # Strip encoder weights to save VRAM (only decoders needed for inference)
210
+ model.model.strip_encoders()
211
+
212
+ # See available voices
213
+ print(processor.get_available_voices()) # ["default", "warm", "clear"]
214
+
215
+ # Generate speech with a specific voice
216
+ inputs = processor(text="Hallo Welt! Das ist KugelAudio.", voice="default", return_tensors="pt")
217
+ inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
218
+
219
+ with torch.no_grad():
220
+ outputs = model.generate(**inputs, cfg_scale=3.0)
221
+
222
+ # Save audio
223
+ processor.save_audio(outputs.speech_outputs[0], "output.wav")
224
+ ```
225
+
226
+ ### Voices
227
+
228
+ KugelAudio provides pre-encoded voices that can be selected by name. The voices are stored as `.pt` files in the `voices/` folder and are automatically downloaded when needed.
229
+
230
+ ```python
231
+ # List available voices
232
+ voices = processor.get_available_voices()
233
+ print(voices) # ["default", "warm", "clear"]
234
+
235
+ # Generate with a specific voice
236
+ inputs = processor(text="Hallo, das ist eine warme Stimme!", voice="warm", return_tensors="pt")
237
+ inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
238
+
239
+ with torch.no_grad():
240
+ outputs = model.generate(**inputs, cfg_scale=3.0)
241
+
242
+ processor.save_audio(outputs.speech_outputs[0], "warm_voice_output.wav")
243
+ ```
244
+
245
+ > **Note:** Voice cloning from raw audio is not supported in this open-source release. Only the pre-encoded voices listed in `voices/voices.json` are available.
246
+
247
+ ### Generation Parameters
248
+
249
+ | Parameter | Default | Description |
250
+ | ---------------- | ------- | -------------------------------------------------------------------------- |
251
+ | cfg\_scale | 3.0 | Classifier-free guidance scale (1.0-10.0). Higher = more adherence to text |
252
+ | max\_new\_tokens | 2048 | Maximum number of tokens to generate |
253
+ | do\_sample | False | Whether to use sampling (vs greedy decoding) |
254
+ | temperature | 1.0 | Sampling temperature (if do_sample=True) |
255
+
256
+ ## Architecture
257
+
258
+ KugelAudio uses a hybrid **Autoregressive + Diffusion** architecture based on Microsoft's VibeVoice:
259
+
260
+ ```
261
+ Text Input → Qwen2.5-7B Backbone → Diffusion Head → Acoustic Decoder → Audio Output
262
+
263
+ Pre-encoded Voice Embedding
264
+ ```
265
+
266
+ 1. **Text Encoder**: Qwen2.5-7B language model encodes input text
267
+ 2. **Diffusion Head**: Predicts speech latents using denoising diffusion (20 steps)
268
+ 3. **Acoustic Decoder**: Hierarchical convolutional decoder converts latents to 24kHz audio
269
+
270
+ ## Audio Watermarking
271
+
272
+ All audio generated by this model is automatically watermarked using Facebook's AudioSeal. The watermark is:
273
+
274
+ * **Imperceptible**: No audible difference in audio quality
275
+ * **Robust**: Survives compression, resampling, and editing
276
+ * **Detectable**: Can verify if audio was generated by KugelAudio
277
+
278
+ ### Verify Watermark
279
+
280
+ ```python
281
+ from kugelaudio_open.watermark import AudioWatermark
282
+
283
+ watermark = AudioWatermark()
284
+ result = watermark.detect(audio, sample_rate=24000)
285
+
286
+ print(f"Watermark detected: {result.detected}")
287
+ print(f"Confidence: {result.confidence:.1%}")
288
+ ```
289
+
290
+ ## Intended Use
291
+
292
+ ### ✅ Appropriate Uses
293
+
294
+ * **Accessibility**: Text-to-speech for visually impaired users
295
+ * **Content Creation**: Podcasts, videos, audiobooks, e-learning
296
+ * **Voice Assistants**: Chatbots and virtual assistants
297
+ * **Language Learning**: Pronunciation practice and language education
298
+ * **Creative Projects**: With proper consent and attribution
299
+
300
+ ### ❌ Prohibited Uses
301
+
302
+ * Creating deepfakes or misleading content
303
+ * Impersonating individuals without explicit consent
304
+ * Fraud, deception, or scams
305
+ * Harassment or abuse
306
+ * Any illegal activities
307
+
308
+ ## Limitations
309
+
310
+ * **VRAM Requirements**: Requires \~19GB VRAM for inference (less with `strip_encoders()`)
311
+ * **Speed**: Approximately 1.0x real-time on modern GPUs
312
+ * **Language Quality Variation**: Quality may vary across languages based on training data distribution
313
+
314
+ ## Hosted API
315
+
316
+ For production use without managing infrastructure, use our hosted API at kugelaudio.com:
317
+
318
+ * ⚡ **Ultra-low latency**: <100ms end-to-end
319
+ * 🌍 **Global edge deployment**
320
+ * 🔧 **Zero setup required**
321
+ * 📈 **Auto-scaling**
322
+
323
+ ```python
324
+ from kugelaudio import KugelAudio
325
+
326
+ client = KugelAudio(api_key="your_api_key")
327
+ audio = client.tts.generate(text="Hello from KugelAudio!", model="kugel-1-turbo")
328
+ audio.save("output.wav")
329
+ ```
330
+
331
+ ## Acknowledgments
332
+
333
+ This model would not have been possible without the contributions of many individuals and organizations:
334
+
335
+ * **Microsoft VibeVoice Team**: For the excellent foundation architecture that this model builds upon
336
+ * **YODAS2 Dataset**: For providing the large-scale multilingual speech data
337
+ * **Qwen Team**: For the powerful language model backbone
338
+ * **Facebook AudioSeal**: For the audio watermarking technology
339
+
340
+ ### Special Thanks
341
+
342
+ * **Carlos Menke**: For his invaluable efforts in gathering the first datasets and extensive work benchmarking the model
343
+ * **AI Service Center Berlin-Brandenburg (KI-Servicezentrum)**: For providing the GPU resources (8x H100) that made training this model possible
344
+
345
+ ## Citation
346
+
347
+ ```bibtex
348
+ @software{kugelaudio2026,
349
+ title = {KugelAudio: Open-Source Text-to-Speech for European Languages},
350
+ author = {Kratzenstein, Kajo and Menke, Carlos},
351
+ year = {2026},
352
+ institution = {Hasso-Plattner-Institut},
353
+ url = {https://huggingface.co/kugelaudio/kugelaudio-0-open}
354
+ }
355
+ ```
356
+
357
+ ## License
358
+
359
+ This model is released under the MIT License.
360
+
361
+ ## Author
362
+
363
+ **Kajo Kratzenstein**
364
+ 📧 [kajo@kugelaudio.com](mailto:kajo@kugelaudio.com)
365
+ 🌐 [kugelaudio.com](https://kugelaudio.com)
366
+
367
+ **Carlos Menke**
368
+
369
+ ---
370
+
371
+ **Funding Notice**
372
+
373
+ Das zugrunde liegende Vorhaben wurde mit Mitteln des Bundesministeriums für Forschung, Technologie und Raumfahrt unter dem Förderkennzeichen »KI-Servicezentrum Berlin-Brandenburg« 16IS22092 gefördert.
374
+
375
+ _This project was funded by the German Federal Ministry of Research, Technology and Space under the funding code "AI Service Center Berlin-Brandenburg" 16IS22092._
config.json ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "acostic_vae_dim": 64,
3
+ "acoustic_tokenizer_config": {
4
+ "causal": true,
5
+ "channels": 1,
6
+ "conv_bias": true,
7
+ "conv_norm": "none",
8
+ "corpus_normalize": 0.0,
9
+ "decoder_depths": null,
10
+ "decoder_n_filters": 32,
11
+ "decoder_ratios": [
12
+ 8,
13
+ 5,
14
+ 5,
15
+ 4,
16
+ 2,
17
+ 2
18
+ ],
19
+ "disable_last_norm": true,
20
+ "encoder_depths": "3-3-3-3-3-3-8",
21
+ "encoder_n_filters": 32,
22
+ "encoder_ratios": [
23
+ 8,
24
+ 5,
25
+ 5,
26
+ 4,
27
+ 2,
28
+ 2
29
+ ],
30
+ "fix_std": 0.5,
31
+ "layer_scale_init_value": 1e-06,
32
+ "layernorm": "RMSNorm",
33
+ "layernorm_elementwise_affine": true,
34
+ "layernorm_eps": 1e-05,
35
+ "mixer_layer": "depthwise_conv",
36
+ "model_type": "kugelaudio_acoustic_tokenizer",
37
+ "pad_mode": "constant",
38
+ "std_dist_type": "gaussian",
39
+ "torch_dtype": "bfloat16",
40
+ "vae_dim": 64,
41
+ "weight_init_value": 0.01
42
+ },
43
+ "acoustic_vae_dim": 64,
44
+ "architectures": [
45
+ "KugelAudioForConditionalGeneration"
46
+ ],
47
+ "decoder_config": {
48
+ "attention_dropout": 0.0,
49
+ "hidden_act": "silu",
50
+ "hidden_size": 3584,
51
+ "initializer_range": 0.02,
52
+ "intermediate_size": 18944,
53
+ "max_position_embeddings": 32768,
54
+ "max_window_layers": 28,
55
+ "model_type": "qwen2",
56
+ "num_attention_heads": 28,
57
+ "num_hidden_layers": 28,
58
+ "num_key_value_heads": 4,
59
+ "rms_norm_eps": 1e-06,
60
+ "rope_scaling": null,
61
+ "rope_theta": 1000000.0,
62
+ "sliding_window": null,
63
+ "torch_dtype": "bfloat16",
64
+ "use_cache": true,
65
+ "use_mrope": false,
66
+ "use_sliding_window": false,
67
+ "vocab_size": 152064
68
+ },
69
+ "diffusion_head_config": {
70
+ "ddpm_algorithm_type": "sde-dpmsolver++",
71
+ "ddpm_batch_mul": 4,
72
+ "ddpm_beta_schedule": "cosine",
73
+ "ddpm_num_inference_steps": 20,
74
+ "ddpm_num_steps": 1000,
75
+ "diffusion_type": "ddpm",
76
+ "head_ffn_ratio": 3.0,
77
+ "head_layers": 4,
78
+ "hidden_size": 3584,
79
+ "latent_size": 64,
80
+ "model_type": "kugelaudio_diffusion_head",
81
+ "prediction_type": "v_prediction",
82
+ "rms_norm_eps": 1e-05,
83
+ "speech_vae_dim": 64,
84
+ "torch_dtype": "bfloat16"
85
+ },
86
+ "model_type": "kugelaudio",
87
+ "semantic_tokenizer_config": {
88
+ "causal": true,
89
+ "channels": 1,
90
+ "conv_bias": true,
91
+ "conv_norm": "none",
92
+ "corpus_normalize": 0.0,
93
+ "disable_last_norm": true,
94
+ "encoder_depths": "3-3-3-3-3-3-8",
95
+ "encoder_n_filters": 32,
96
+ "encoder_ratios": [
97
+ 8,
98
+ 5,
99
+ 5,
100
+ 4,
101
+ 2,
102
+ 2
103
+ ],
104
+ "fix_std": 0,
105
+ "layer_scale_init_value": 1e-06,
106
+ "layernorm": "RMSNorm",
107
+ "layernorm_elementwise_affine": true,
108
+ "layernorm_eps": 1e-05,
109
+ "mixer_layer": "depthwise_conv",
110
+ "model_type": "kugelaudio_semantic_tokenizer",
111
+ "pad_mode": "constant",
112
+ "std_dist_type": "none",
113
+ "torch_dtype": "bfloat16",
114
+ "vae_dim": 128,
115
+ "weight_init_value": 0.01
116
+ },
117
+ "semantic_vae_dim": 128,
118
+ "tie_word_embeddings": false,
119
+ "torch_dtype": "bfloat16",
120
+ "transformers_version": "4.52.0.dev0",
121
+ "ddpm_inference_steps": 20
122
+ }
generation_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.52.0.dev0"
4
+ }
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b4209041f36d8de076ceaad966b026e951d2e58337466f653a1bbf3142c8ab10
3
+ size 4877662532
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d236dd36482846352f570d21685661d7683bade9926e0a4d855f43ba1c4ea148
3
+ size 4932752840
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e5feac1d14fd375e1e0b1ceeab31cd130517280f66dc630aac6f5b42c6795b9
3
+ size 4982901128
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3bbbeec7dacf727e2b086f3ef7796b17bcf00c8c25834a27df2031a1cf3774ea
3
+ size 3893553730
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
samples/258_Lukas_der_Flüsterer.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:75309e892ff871fce2ee1d4f971ce1c168e4b65290b2498e8869561da3bf69cf
3
+ size 320044
samples/261_Sauerer_Felix.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5b098fca73349f848e69a02fb13a647853caaf4cec4c5756a7ce4e0e99c3fc7
3
+ size 185644
samples/266_Petra_die_Vorleserin.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b24761c0a67b3164bc088291b524203287c67b51e777b3df9e86433e76c2dc45
3
+ size 256044
samples/277_Radio_Lars.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e1844e12b490ee1aa37117c9e5139164f51d97e85b0ae60e81a8c10990924f4e
3
+ size 313644
voices/voices.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "default": {
3
+ "file": "default.pt",
4
+ "description": "Default neutral voice",
5
+ "language": "en"
6
+ },
7
+ "warm": {
8
+ "file": "warm.pt",
9
+ "description": "Warm, friendly voice",
10
+ "language": "en"
11
+ },
12
+ "clear": {
13
+ "file": "clear.pt",
14
+ "description": "Clear, professional voice",
15
+ "language": "en"
16
+ }
17
+ }