Create README.md

a091db0 verified 6 days ago

9.95 kB

language:
  - en
license: apache-2.0
base_model: SparkAudio/Spark-TTS-0.5B
datasets:
  - MrDragonFox/Elise
tags:
  - tts
  - text-to-speech
  - spark-tts
  - voice-cloning
  - emotion-tags
  - unsloth
  - trl
  - sft
  - featherlabs
  - audio
  - amd-mi300x
library_name: transformers
pipeline_tag: text-to-speech

🔊 Finatts Enhanced

High-fidelity voice cloning — fine-tuned Spark-TTS v2

Text-to-Speech · Voice Cloning · Emotion Tags · Portable Voice Profile

Built by Featherlabs · Operated by Owlkun

✨ What is Finatts Enhanced?

Finatts Enhanced is an improved 507M-parameter text-to-speech model built on Spark-TTS-0.5B, fine-tuned for high-fidelity single-speaker voice cloning with emotion tag support.

Compared to the original Finatts, this version features 3× the training, a more stable learning rate, and a portable voice profile (elise_voice.safetensors) — no reference audio needed at inference time.

Improvements over v1

Setting	Finatts v1	Finatts Enhanced
Epochs	2	6
Learning rate	1e-4	5e-5
Warmup steps	20	50
Weight decay	0.001	0.01
Emotion tags	❌	✅
Voice profile	❌	✅ `elise_voice.safetensors`
Final loss	5.827	5.806

🎯 Built For

Capability	Description
🎙️ Voice Cloning	Clone Elise's voice — no reference audio required
🎭 Emotion Tags	`<laughs>` `<giggles>` `<whispers>` `<sighs>` `<chuckles>` `<long pause>`
📝 Text-to-Speech	Convert text to natural, expressive speech
📦 Portable Profile	Load `elise_voice.safetensors` — deploy anywhere

🏋️ Training Details

Property	Value
Base model	SparkAudio/Spark-TTS-0.5B
LLM backbone	Qwen2-0.5B (507M params)
Dataset	MrDragonFox/Elise (1,195 samples, ~3h)
Training type	Full Supervised Fine-Tuning (SFT)
Epochs	6
Batch size	8 (effective 16 with grad accum)
Learning rate	5e-5
Warmup steps	50
Weight decay	0.01
Context length	4,096 tokens
Precision	BF16
Optimizer	AdamW (torch fused)
LR scheduler	Cosine
Framework	Unsloth + TRL (SFTTrainer)
Hardware	AMD MI300X (192GB HBM3)

📊 Training Metrics

Metric	Value
Final loss	5.806
Training time	144s (2.4 min)
Peak VRAM	22.5 GB (11.7% of 192GB)
Trainable params	506,634,112 (100%)
Total steps	450

Training Loss Curve

Model converges from ~6.9 → ~5.8 over 450 steps — 3× more convergence than v1:

Step	Loss	Step	Loss	Step	Loss
1	6.90	150	5.79	300	5.74
50	5.82	200	5.76	400	5.77
100	5.77	250	5.73	450	5.81

🚀 Quick Start

Prerequisites

pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile safetensors

# Clone Spark-TTS for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS

Inference with Elise Voice Profile

import torch, re, sys
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import snapshot_download, hf_hub_download
from safetensors.torch import load_file
import json

sys.path.append("Spark-TTS")
from sparktts.models.audio_tokenizer import BiCodecTokenizer

MODEL_ID = "Featherlabs/Finatts-enhanced"

# Load LLM
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()

# Load BiCodec
snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")

# Load Elise voice profile (global token IDs — no reference audio needed)
profile_path = hf_hub_download(MODEL_ID, "elise_voice_profile.json")
with open(profile_path) as f:
    profile = json.load(f)
elise_global_ids       = profile["global_token_ids"]
elise_global_token_str = profile["global_token_str"]


@torch.inference_mode()
def generate_speech(text, temperature=0.8, top_k=40, top_p=0.92):
    prompt = "".join([
        "<|task_tts|>",
        "<|start_content|>", text, "<|end_content|>",
        "<|start_global_token|>",
        elise_global_token_str,       # Elise's voice injected here
        "<|end_global_token|>",
        "<|start_semantic_token|>",
    ])
    inputs    = tokenizer([prompt], return_tensors="pt").to("cuda")
    generated = model.generate(
        **inputs, max_new_tokens=2048,
        do_sample=True, temperature=temperature,
        top_k=top_k, top_p=top_p,
        eos_token_id=tokenizer.eos_token_id,
    )
    out = tokenizer.batch_decode(
        generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=False
    )[0]
    sem = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", out)]
    if not sem:
        return None
    pred_sem    = torch.tensor(sem, dtype=torch.long).unsqueeze(0).to("cuda")
    pred_global = torch.tensor(elise_global_ids, dtype=torch.long).unsqueeze(0).to("cuda")
    audio_tokenizer.model.to("cuda")
    return audio_tokenizer.detokenize(pred_global, pred_sem).squeeze().cpu().numpy()


# Try emotion tags!
texts = [
    "Hey there! My name is Elise, nice to meet you.",
    "<laughs> Oh my gosh, I can't believe that actually worked!",
    "<whispers> Come closer... I have a secret to tell you.",
    "<sighs> Some days just feel heavier than others.",
]
for i, text in enumerate(texts):
    wav = generate_speech(text)
    if wav is not None:
        sf.write(f"output_{i+1}.wav", wav, 16000)
        print(f"✅ output_{i+1}.wav")

🎭 Emotion Tags

The Elise dataset includes inline emotion tags captured from real speech. Place them anywhere in your text:

Tag	Effect
`<laughs>`	Lighter, brighter intonation
`<giggles>`	Playful, uptick in pitch
`<whispers>`	Softer, breathier delivery
`<sighs>`	Drawn-out, melancholic tone
`<chuckles>`	Gentle amusement
`<long pause>`	Extended pause in speech

Note: Tags produce intonation variation rather than literal acoustic sounds (e.g., actual giggling audio). For acoustic emotion effects, see Orpheus-TTS.

🏗️ Architecture

Text + Emotion Tags
        ↓
  [LLM: Qwen2-0.5B]
  ┌─────┴──────┐
  Global tokens   Semantic tokens
  (speaker ID)    (content + prosody)
       └────────┬────────┘
         [BiCodec Decoder]
                ↓
           Waveform 16kHz

Component	Details
LLM	Qwen2-0.5B (507M params)
BiCodec	Neural audio codec — global + semantic tokenization
Wav2Vec2	`wav2vec2-large-xlsr-53` — feature extraction
Sample rate	16kHz
Voice profile	`elise_voice.safetensors` — 1024-dim d-vector

📦 Repository Files

File	Description
`model.safetensors`	Fine-tuned LLM weights (966MB, 16-bit merged)
`elise_voice.safetensors`	Elise speaker d-vector (1024-dim, avg of 10 clips)
`tokenizer.json`	Tokenizer including BiCodec special tokens
`config.json`	Model configuration

For inference you also need:

File	Source
BiCodec model	`unsloth/Spark-TTS-0.5B`
Spark-TTS code	SparkAudio/Spark-TTS

⚠️ Limitations

English only — only tested with English text inputs
Single speaker — optimized for Elise; base model multi-speaker may be degraded
16kHz output — use audiosr for upsampling to 44.1kHz
Emotion intensity — tags produce subtle intonation changes, not acoustic emotion sounds
ROCm-trained — tested on AMD MI300X; CUDA users may need minor env adjustments

🔮 What's Next

🔊 Super-resolution — integrate audiosr for 44.1kHz HD output
🗣️ Multi-speaker — train on multiple voices
📈 Larger dataset — more hours of Elise audio for stronger emotion control
🎭 Acoustic emotions — explore Orpheus-style explicit emotion tokens

📜 License

Apache 2.0 — consistent with Spark-TTS-0.5B.

Built with ❤️ by Featherlabs

Operated by Owlkun