🔊 Finatts Enhanced
High-fidelity voice cloning — fine-tuned Spark-TTS v2
Text-to-Speech · Voice Cloning · Emotion Tags · Portable Voice Profile
Built by Featherlabs · Operated by Owlkun
✨ What is Finatts Enhanced?
Finatts Enhanced is an improved 507M-parameter text-to-speech model built on Spark-TTS-0.5B, fine-tuned for high-fidelity single-speaker voice cloning with emotion tag support.
Compared to the original Finatts, this version features 3× the training, a more stable learning rate, and a portable voice profile (elise_voice.safetensors) — no reference audio needed at inference time.
Improvements over v1
| Setting | Finatts v1 | Finatts Enhanced |
|---|---|---|
| Epochs | 2 | 6 |
| Learning rate | 1e-4 | 5e-5 |
| Warmup steps | 20 | 50 |
| Weight decay | 0.001 | 0.01 |
| Emotion tags | ❌ | ✅ |
| Voice profile | ❌ | ✅ elise_voice.safetensors |
| Final loss | 5.827 | 5.806 |
🎯 Built For
| Capability | Description |
|---|---|
| 🎙️ Voice Cloning | Clone Elise's voice — no reference audio required |
| 🎭 Emotion Tags | <laughs> <giggles> <whispers> <sighs> <chuckles> <long pause> |
| 📝 Text-to-Speech | Convert text to natural, expressive speech |
| 📦 Portable Profile | Load elise_voice.safetensors — deploy anywhere |
🏋️ Training Details
| Property | Value |
| Base model | SparkAudio/Spark-TTS-0.5B |
| LLM backbone | Qwen2-0.5B (507M params) |
| Dataset | MrDragonFox/Elise (1,195 samples, ~3h) |
| Training type | Full Supervised Fine-Tuning (SFT) |
| Epochs | 6 |
| Batch size | 8 (effective 16 with grad accum) |
| Learning rate | 5e-5 |
| Warmup steps | 50 |
| Weight decay | 0.01 |
| Context length | 4,096 tokens |
| Precision | BF16 |
| Optimizer | AdamW (torch fused) |
| LR scheduler | Cosine |
| Framework | Unsloth + TRL (SFTTrainer) |
| Hardware | AMD MI300X (192GB HBM3) |
📊 Training Metrics
| Metric | Value |
|---|---|
| Final loss | 5.806 |
| Training time | 144s (2.4 min) |
| Peak VRAM | 22.5 GB (11.7% of 192GB) |
| Trainable params | 506,634,112 (100%) |
| Total steps | 450 |
Training Loss Curve
Model converges from ~6.9 → ~5.8 over 450 steps — 3× more convergence than v1:
| Step | Loss | Step | Loss | Step | Loss |
|---|---|---|---|---|---|
| 1 | 6.90 | 150 | 5.79 | 300 | 5.74 |
| 50 | 5.82 | 200 | 5.76 | 400 | 5.77 |
| 100 | 5.77 | 250 | 5.73 | 450 | 5.81 |
🚀 Quick Start
Prerequisites
pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile safetensors
# Clone Spark-TTS for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS
Inference with Elise Voice Profile
import torch, re, sys
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import snapshot_download, hf_hub_download
from safetensors.torch import load_file
import json
sys.path.append("Spark-TTS")
from sparktts.models.audio_tokenizer import BiCodecTokenizer
MODEL_ID = "Featherlabs/Finatts-enhanced"
# Load LLM
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()
# Load BiCodec
snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")
# Load Elise voice profile (global token IDs — no reference audio needed)
profile_path = hf_hub_download(MODEL_ID, "elise_voice_profile.json")
with open(profile_path) as f:
profile = json.load(f)
elise_global_ids = profile["global_token_ids"]
elise_global_token_str = profile["global_token_str"]
@torch.inference_mode()
def generate_speech(text, temperature=0.8, top_k=40, top_p=0.92):
prompt = "".join([
"<|task_tts|>",
"<|start_content|>", text, "<|end_content|>",
"<|start_global_token|>",
elise_global_token_str, # Elise's voice injected here
"<|end_global_token|>",
"<|start_semantic_token|>",
])
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated = model.generate(
**inputs, max_new_tokens=2048,
do_sample=True, temperature=temperature,
top_k=top_k, top_p=top_p,
eos_token_id=tokenizer.eos_token_id,
)
out = tokenizer.batch_decode(
generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=False
)[0]
sem = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", out)]
if not sem:
return None
pred_sem = torch.tensor(sem, dtype=torch.long).unsqueeze(0).to("cuda")
pred_global = torch.tensor(elise_global_ids, dtype=torch.long).unsqueeze(0).to("cuda")
audio_tokenizer.model.to("cuda")
return audio_tokenizer.detokenize(pred_global, pred_sem).squeeze().cpu().numpy()
# Try emotion tags!
texts = [
"Hey there! My name is Elise, nice to meet you.",
"<laughs> Oh my gosh, I can't believe that actually worked!",
"<whispers> Come closer... I have a secret to tell you.",
"<sighs> Some days just feel heavier than others.",
]
for i, text in enumerate(texts):
wav = generate_speech(text)
if wav is not None:
sf.write(f"output_{i+1}.wav", wav, 16000)
print(f"✅ output_{i+1}.wav")
🎭 Emotion Tags
The Elise dataset includes inline emotion tags captured from real speech. Place them anywhere in your text:
| Tag | Effect |
|---|---|
<laughs> |
Lighter, brighter intonation |
<giggles> |
Playful, uptick in pitch |
<whispers> |
Softer, breathier delivery |
<sighs> |
Drawn-out, melancholic tone |
<chuckles> |
Gentle amusement |
<long pause> |
Extended pause in speech |
Note: Tags produce intonation variation rather than literal acoustic sounds (e.g., actual giggling audio). For acoustic emotion effects, see Orpheus-TTS.
🏗️ Architecture
Text + Emotion Tags
↓
[LLM: Qwen2-0.5B]
┌─────┴──────┐
Global tokens Semantic tokens
(speaker ID) (content + prosody)
└────────┬────────┘
[BiCodec Decoder]
↓
Waveform 16kHz
| Component | Details |
|---|---|
| LLM | Qwen2-0.5B (507M params) |
| BiCodec | Neural audio codec — global + semantic tokenization |
| Wav2Vec2 | wav2vec2-large-xlsr-53 — feature extraction |
| Sample rate | 16kHz |
| Voice profile | elise_voice.safetensors — 1024-dim d-vector |
📦 Repository Files
| File | Description |
|---|---|
model.safetensors |
Fine-tuned LLM weights (966MB, 16-bit merged) |
elise_voice.safetensors |
Elise speaker d-vector (1024-dim, avg of 10 clips) |
tokenizer.json |
Tokenizer including BiCodec special tokens |
config.json |
Model configuration |
For inference you also need:
| File | Source |
|---|---|
| BiCodec model | unsloth/Spark-TTS-0.5B |
| Spark-TTS code | SparkAudio/Spark-TTS |
⚠️ Limitations
- English only — only tested with English text inputs
- Single speaker — optimized for Elise; base model multi-speaker may be degraded
- 16kHz output — use audiosr for upsampling to 44.1kHz
- Emotion intensity — tags produce subtle intonation changes, not acoustic emotion sounds
- ROCm-trained — tested on AMD MI300X; CUDA users may need minor env adjustments
🔮 What's Next
- 🔊 Super-resolution — integrate audiosr for 44.1kHz HD output
- 🗣️ Multi-speaker — train on multiple voices
- 📈 Larger dataset — more hours of Elise audio for stronger emotion control
- 🎭 Acoustic emotions — explore Orpheus-style explicit emotion tokens
📜 License
Apache 2.0 — consistent with Spark-TTS-0.5B.
Built with ❤️ by Featherlabs
Operated by Owlkun
- Downloads last month
- 27
Model tree for Featherlabs/Finatts-enhanced
Base model
SparkAudio/Spark-TTS-0.5B