You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Fine-tuned Spark-TTS 0.5B: German Emotional Speech Synthesis

This repository contains a fine-tuned implementation of the Spark-TTS (0.5B) model, specialized for German speech synthesis with advanced support for emotional cues and non-verbal audio tokens.

The model was fine-tuned using LoRA (Low-Rank Adaptation) on the curated Vishalshendge3198/Dataset_eleven_v3 German dataset containing high-quality audio with diverse emotional expressions.

This is a 4-bit merged standalone model — no separate adapters needed.

🚀 Key Highlights

57.14% Loss Improvement: Reduced test loss from 10.0074 (Base) to 4.2891 (Fine-tuned)
Emotional Support: Handles stylistic tags like [happy], [angry], [thoughtful], and more
Non-Verbal Tokens: Accurately synthesizes non-speech sounds like [sighs], [laughter], [yawn], [growl]
Architecture: Spark-TTS 0.5B (Qwen2-based) — Merged 4-bit Standalone
Efficient Training: Only 3.22 GB peak GPU memory during fine-tuning

🎭 Supported Tags

Use square brackets [tag] in your prompts for fine-grained emotional/paralinguistic control:

Category	Tags
Emotions	`[happy]`, `[angry]`, `[sad]`, `[thoughtful]`, `[neutral]`, `[sleepy]`, `[whisper]`, `[worried]`, `[annoyed]`, `[surprised]`, `[fearful]`, `[contemptuous]`, `[disgusted]`
Paralinguistic	`[sighs]`, `[laughter]`, `[cry]`, `[growl]`, `[sob]`, `[cheer]`, `[breath]`, `[pause]`, `[grit]`, `[yawn]`, `[mumble]`, `[sniffles]`, `[exhales sharply]`, `[inhales deeply]`, `[chuckles]`, `[tremble]`, `[sigh]`

🏋️ Training Details

Parameter	Value
Base Model	SparkAudio/Spark-TTS-0.5B
Dataset	Vishalshendge3198/Dataset_eleven_v3
Train / Val / Test Split	1926 / 241 / 241 samples
Learning Rate	5e-4
LoRA Rank (R)	64
LoRA Alpha	64
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Epochs	3
Batch Size	1 (grad accumulation: 2, effective: 2)
Precision	4-bit (bitsandbytes / unsloth)
Framework	Unsloth 2026.1 + HuggingFace Transformers
Training Time	~17.5 minutes
Peak GPU Memory	3.22 GB

🔊 Inference Example

This is a standalone merged model. Load it directly without needing separate adapters.

import sys
sys.path.append("Spark-TTS")  # Clone from https://github.com/SparkAudio/Spark-TTS

import torch
import re
import soundfile as sf
from unsloth import FastLanguageModel
from sparktts.models.audio_tokenizer import BiCodecTokenizer

MODEL_NAME = "Vishalshendge3198/spark_tts_finetune_4bit"
SPARK_TTS_MODEL_DIR = "Spark-TTS-0.5B"  # Local directory of Spark-TTS weights
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

# Load audio tokenizer
audio_tokenizer = BiCodecTokenizer(SPARK_TTS_MODEL_DIR, device=DEVICE)

# Prepare prompt (Spark-TTS format)
text = "[happy] Das ist ja wunderbar, endlich klappt es!"
prompt = "".join([
    "<|task_tts|>",
    "<|start_content|>",
    text,
    "<|end_content|>",
    "<|start_global_token|>"
])

inputs = tokenizer([prompt], return_tensors="pt").to(DEVICE)

# Generate
generated_ids = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode tokens to audio
generated_ids_trimmed = generated_ids[:, inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]

# Extract semantic tokens
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
pred_semantic_ids = torch.tensor([int(t) for t in semantic_matches]).long().unsqueeze(0).to(DEVICE)

# Extract global tokens (Spark-TTS expects 32)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
global_ids = [int(t) for t in global_matches][:32]
global_ids += [0] * (32 - len(global_ids))
pred_global_ids = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to(DEVICE)

# Detokenize to waveform
waveform = audio_tokenizer.detokenize(pred_global_ids.squeeze(0), pred_semantic_ids)
sf.write("output.wav", waveform, audio_tokenizer.config.get("sample_rate", 16000))
print("✅ Audio saved to output.wav")

📊 Performance

Metric	Base Model (0.5B)	Fine-tuned (German)	Improvement
Validation Loss	~10.0074 (estimate)	4.3125	56.9%
Test Loss	~10.0074 (estimate)	4.2891	57.14%
German Emotional Prosody	Basic	Advanced	High

📜 Credits

Developed by Vishal Shendge as part of a German TTS fine-tuning research project using the Spark-TTS architecture by SparkAudio.
Special thanks to the Unsloth team for providing the efficient fine-tuning framework.

Downloads last month: -

Safetensors

Model size

0.5B params

Tensor type

F32