You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Fine-tuned Spark-TTS 0.5B: German Emotional Speech Synthesis

This repository contains a fine-tuned implementation of the Spark-TTS (0.5B) model, specialized for German speech synthesis with advanced support for emotional cues and non-verbal audio tokens.

The model was fine-tuned using LoRA (Low-Rank Adaptation) on the curated Vishalshendge3198/Dataset_eleven_v3 German dataset containing high-quality audio with diverse emotional expressions.

This is a 4-bit merged standalone model β€” no separate adapters needed.


πŸš€ Key Highlights

  • 57.14% Loss Improvement: Reduced test loss from 10.0074 (Base) to 4.2891 (Fine-tuned)
  • Emotional Support: Handles stylistic tags like [happy], [angry], [thoughtful], and more
  • Non-Verbal Tokens: Accurately synthesizes non-speech sounds like [sighs], [laughter], [yawn], [growl]
  • Architecture: Spark-TTS 0.5B (Qwen2-based) β€” Merged 4-bit Standalone
  • Efficient Training: Only 3.22 GB peak GPU memory during fine-tuning

🎭 Supported Tags

Use square brackets [tag] in your prompts for fine-grained emotional/paralinguistic control:

Category Tags
Emotions [happy], [angry], [sad], [thoughtful], [neutral], [sleepy], [whisper], [worried], [annoyed], [surprised], [fearful], [contemptuous], [disgusted]
Paralinguistic [sighs], [laughter], [cry], [growl], [sob], [cheer], [breath], [pause], [grit], [yawn], [mumble], [sniffles], [exhales sharply], [inhales deeply], [chuckles], [tremble], [sigh]

πŸ‹οΈ Training Details

Parameter Value
Base Model SparkAudio/Spark-TTS-0.5B
Dataset Vishalshendge3198/Dataset_eleven_v3
Train / Val / Test Split 1926 / 241 / 241 samples
Learning Rate 5e-4
LoRA Rank (R) 64
LoRA Alpha 64
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Epochs 3
Batch Size 1 (grad accumulation: 2, effective: 2)
Precision 4-bit (bitsandbytes / unsloth)
Framework Unsloth 2026.1 + HuggingFace Transformers
Training Time ~17.5 minutes
Peak GPU Memory 3.22 GB

πŸ”Š Inference Example

This is a standalone merged model. Load it directly without needing separate adapters.

import sys
sys.path.append("Spark-TTS")  # Clone from https://github.com/SparkAudio/Spark-TTS

import torch
import re
import soundfile as sf
from unsloth import FastLanguageModel
from sparktts.models.audio_tokenizer import BiCodecTokenizer

MODEL_NAME = "Vishalshendge3198/spark_tts_finetune_4bit"
SPARK_TTS_MODEL_DIR = "Spark-TTS-0.5B"  # Local directory of Spark-TTS weights
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

# Load audio tokenizer
audio_tokenizer = BiCodecTokenizer(SPARK_TTS_MODEL_DIR, device=DEVICE)

# Prepare prompt (Spark-TTS format)
text = "[happy] Das ist ja wunderbar, endlich klappt es!"
prompt = "".join([
    "<|task_tts|>",
    "<|start_content|>",
    text,
    "<|end_content|>",
    "<|start_global_token|>"
])

inputs = tokenizer([prompt], return_tensors="pt").to(DEVICE)

# Generate
generated_ids = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode tokens to audio
generated_ids_trimmed = generated_ids[:, inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]

# Extract semantic tokens
semantic_matches = re.findall(r"<\|bicodec_semantic_(\d+)\|>", predicts_text)
pred_semantic_ids = torch.tensor([int(t) for t in semantic_matches]).long().unsqueeze(0).to(DEVICE)

# Extract global tokens (Spark-TTS expects 32)
global_matches = re.findall(r"<\|bicodec_global_(\d+)\|>", predicts_text)
global_ids = [int(t) for t in global_matches][:32]
global_ids += [0] * (32 - len(global_ids))
pred_global_ids = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to(DEVICE)

# Detokenize to waveform
waveform = audio_tokenizer.detokenize(pred_global_ids.squeeze(0), pred_semantic_ids)
sf.write("output.wav", waveform, audio_tokenizer.config.get("sample_rate", 16000))
print("βœ… Audio saved to output.wav")

πŸ“Š Performance

Metric Base Model (0.5B) Fine-tuned (German) Improvement
Validation Loss ~10.0074 (estimate) 4.3125 56.9%
Test Loss ~10.0074 (estimate) 4.2891 57.14%
German Emotional Prosody Basic Advanced High

πŸ“œ Credits

Developed by Vishal Shendge as part of a German TTS fine-tuning research project using the Spark-TTS architecture by SparkAudio.
Special thanks to the Unsloth team for providing the efficient fine-tuning framework.

Downloads last month
59
Safetensors
Model size
0.5B params
Tensor type
F32
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support