Finatts-enhanced / README.md
Featherlabs's picture
Create README.md
a091db0 verified
metadata
language:
  - en
license: apache-2.0
base_model: SparkAudio/Spark-TTS-0.5B
datasets:
  - MrDragonFox/Elise
tags:
  - tts
  - text-to-speech
  - spark-tts
  - voice-cloning
  - emotion-tags
  - unsloth
  - trl
  - sft
  - featherlabs
  - audio
  - amd-mi300x
library_name: transformers
pipeline_tag: text-to-speech

๐Ÿ”Š Finatts Enhanced

High-fidelity voice cloning โ€” fine-tuned Spark-TTS v2

Text-to-Speech ยท Voice Cloning ยท Emotion Tags ยท Portable Voice Profile

License Base Model Dataset Parameters

Built by Featherlabs ยท Operated by Owlkun


โœจ What is Finatts Enhanced?

Finatts Enhanced is an improved 507M-parameter text-to-speech model built on Spark-TTS-0.5B, fine-tuned for high-fidelity single-speaker voice cloning with emotion tag support.

Compared to the original Finatts, this version features 3ร— the training, a more stable learning rate, and a portable voice profile (elise_voice.safetensors) โ€” no reference audio needed at inference time.

Improvements over v1

Setting Finatts v1 Finatts Enhanced
Epochs 2 6
Learning rate 1e-4 5e-5
Warmup steps 20 50
Weight decay 0.001 0.01
Emotion tags โŒ โœ…
Voice profile โŒ โœ… elise_voice.safetensors
Final loss 5.827 5.806

๐ŸŽฏ Built For

Capability Description
๐ŸŽ™๏ธ Voice Cloning Clone Elise's voice โ€” no reference audio required
๐ŸŽญ Emotion Tags <laughs> <giggles> <whispers> <sighs> <chuckles> <long pause>
๐Ÿ“ Text-to-Speech Convert text to natural, expressive speech
๐Ÿ“ฆ Portable Profile Load elise_voice.safetensors โ€” deploy anywhere

๐Ÿ‹๏ธ Training Details

PropertyValue
Base modelSparkAudio/Spark-TTS-0.5B
LLM backboneQwen2-0.5B (507M params)
DatasetMrDragonFox/Elise (1,195 samples, ~3h)
Training typeFull Supervised Fine-Tuning (SFT)
Epochs6
Batch size8 (effective 16 with grad accum)
Learning rate5e-5
Warmup steps50
Weight decay0.01
Context length4,096 tokens
PrecisionBF16
OptimizerAdamW (torch fused)
LR schedulerCosine
FrameworkUnsloth + TRL (SFTTrainer)
HardwareAMD MI300X (192GB HBM3)

๐Ÿ“Š Training Metrics

Metric Value
Final loss 5.806
Training time 144s (2.4 min)
Peak VRAM 22.5 GB (11.7% of 192GB)
Trainable params 506,634,112 (100%)
Total steps 450

Training Loss Curve

Model converges from ~6.9 โ†’ ~5.8 over 450 steps โ€” 3ร— more convergence than v1:

Step Loss Step Loss Step Loss
1 6.90 150 5.79 300 5.74
50 5.82 200 5.76 400 5.77
100 5.77 250 5.73 450 5.81

๐Ÿš€ Quick Start

Prerequisites

pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile safetensors

# Clone Spark-TTS for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS

Inference with Elise Voice Profile

import torch, re, sys
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import snapshot_download, hf_hub_download
from safetensors.torch import load_file
import json

sys.path.append("Spark-TTS")
from sparktts.models.audio_tokenizer import BiCodecTokenizer

MODEL_ID = "Featherlabs/Finatts-enhanced"

# Load LLM
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()

# Load BiCodec
snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")

# Load Elise voice profile (global token IDs โ€” no reference audio needed)
profile_path = hf_hub_download(MODEL_ID, "elise_voice_profile.json")
with open(profile_path) as f:
    profile = json.load(f)
elise_global_ids       = profile["global_token_ids"]
elise_global_token_str = profile["global_token_str"]


@torch.inference_mode()
def generate_speech(text, temperature=0.8, top_k=40, top_p=0.92):
    prompt = "".join([
        "<|task_tts|>",
        "<|start_content|>", text, "<|end_content|>",
        "<|start_global_token|>",
        elise_global_token_str,       # Elise's voice injected here
        "<|end_global_token|>",
        "<|start_semantic_token|>",
    ])
    inputs    = tokenizer([prompt], return_tensors="pt").to("cuda")
    generated = model.generate(
        **inputs, max_new_tokens=2048,
        do_sample=True, temperature=temperature,
        top_k=top_k, top_p=top_p,
        eos_token_id=tokenizer.eos_token_id,
    )
    out = tokenizer.batch_decode(
        generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=False
    )[0]
    sem = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", out)]
    if not sem:
        return None
    pred_sem    = torch.tensor(sem, dtype=torch.long).unsqueeze(0).to("cuda")
    pred_global = torch.tensor(elise_global_ids, dtype=torch.long).unsqueeze(0).to("cuda")
    audio_tokenizer.model.to("cuda")
    return audio_tokenizer.detokenize(pred_global, pred_sem).squeeze().cpu().numpy()


# Try emotion tags!
texts = [
    "Hey there! My name is Elise, nice to meet you.",
    "<laughs> Oh my gosh, I can't believe that actually worked!",
    "<whispers> Come closer... I have a secret to tell you.",
    "<sighs> Some days just feel heavier than others.",
]
for i, text in enumerate(texts):
    wav = generate_speech(text)
    if wav is not None:
        sf.write(f"output_{i+1}.wav", wav, 16000)
        print(f"โœ… output_{i+1}.wav")

๐ŸŽญ Emotion Tags

The Elise dataset includes inline emotion tags captured from real speech. Place them anywhere in your text:

Tag Effect
<laughs> Lighter, brighter intonation
<giggles> Playful, uptick in pitch
<whispers> Softer, breathier delivery
<sighs> Drawn-out, melancholic tone
<chuckles> Gentle amusement
<long pause> Extended pause in speech

Note: Tags produce intonation variation rather than literal acoustic sounds (e.g., actual giggling audio). For acoustic emotion effects, see Orpheus-TTS.


๐Ÿ—๏ธ Architecture

Text + Emotion Tags
        โ†“
  [LLM: Qwen2-0.5B]
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”
  Global tokens   Semantic tokens
  (speaker ID)    (content + prosody)
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         [BiCodec Decoder]
                โ†“
           Waveform 16kHz
Component Details
LLM Qwen2-0.5B (507M params)
BiCodec Neural audio codec โ€” global + semantic tokenization
Wav2Vec2 wav2vec2-large-xlsr-53 โ€” feature extraction
Sample rate 16kHz
Voice profile elise_voice.safetensors โ€” 1024-dim d-vector

๐Ÿ“ฆ Repository Files

File Description
model.safetensors Fine-tuned LLM weights (966MB, 16-bit merged)
elise_voice.safetensors Elise speaker d-vector (1024-dim, avg of 10 clips)
tokenizer.json Tokenizer including BiCodec special tokens
config.json Model configuration

For inference you also need:

File Source
BiCodec model unsloth/Spark-TTS-0.5B
Spark-TTS code SparkAudio/Spark-TTS

โš ๏ธ Limitations

  • English only โ€” only tested with English text inputs
  • Single speaker โ€” optimized for Elise; base model multi-speaker may be degraded
  • 16kHz output โ€” use audiosr for upsampling to 44.1kHz
  • Emotion intensity โ€” tags produce subtle intonation changes, not acoustic emotion sounds
  • ROCm-trained โ€” tested on AMD MI300X; CUDA users may need minor env adjustments

๐Ÿ”ฎ What's Next

  • ๐Ÿ”Š Super-resolution โ€” integrate audiosr for 44.1kHz HD output
  • ๐Ÿ—ฃ๏ธ Multi-speaker โ€” train on multiple voices
  • ๐Ÿ“ˆ Larger dataset โ€” more hours of Elise audio for stronger emotion control
  • ๐ŸŽญ Acoustic emotions โ€” explore Orpheus-style explicit emotion tokens

๐Ÿ“œ License

Apache 2.0 โ€” consistent with Spark-TTS-0.5B.


Built with โค๏ธ by Featherlabs

Operated by Owlkun