๐ Finatts
High-fidelity voice cloning โ fine-tuned Spark-TTS
Text-to-Speech ยท Voice Cloning ยท Emotion Synthesis ยท BiCodec
Built by Featherlabs ยท Operated by Owlkun
โจ What is Finatts?
Finatts is a 507M-parameter text-to-speech model fine-tuned for high-fidelity single-speaker voice cloning. Built on top of Spark-TTS-0.5B and trained on the Elise dataset โ a curated collection of 1,200 voice samples (3 hours) with rich emotional range.
Spark-TTS uses a novel BiCodec architecture that decomposes speech into:
- Global tokens โ speaker identity, timbre, and style
- Semantic tokens โ linguistic content and prosody
This separation enables zero-shot voice cloning and controllable speech synthesis.
๐ฏ Built For
| Capability | Description |
|---|---|
| ๐๏ธ Voice Cloning | Clone a specific voice from reference audio samples |
| ๐ญ Emotion Synthesis | Generate speech with varied emotional tones |
| ๐ Text-to-Speech | Convert text to natural, expressive speech |
| ๐ High-Fidelity Audio | 16kHz output with BiCodec tokenization |
๐๏ธ Training Details
| Property | Value |
| Base model | SparkAudio/Spark-TTS-0.5B |
| LLM backbone | Qwen2-0.5B (507M params) |
| Dataset | MrDragonFox/Elise (1,195 samples, ~3h) |
| Training type | Full Supervised Fine-Tuning (SFT) |
| Epochs | 2 |
| Batch size | 8 (effective 16 with grad accum) |
| Learning rate | 1e-4 |
| Warmup steps | 20 |
| Context length | 4,096 tokens |
| Precision | BF16 |
| Optimizer | AdamW (torch fused) |
| LR scheduler | Cosine |
| Framework | Unsloth + TRL (SFTTrainer) |
| Hardware | AMD MI300X (192GB HBM3) |
๐ Training Metrics
| Metric | Value |
|---|---|
| Final loss | 5.827 |
| Training time | 83s (1.4 min) |
| Peak VRAM | 18.8 GB (9.8% of 192GB) |
| Trainable params | 506,634,112 (100%) |
| Total steps | 150 |
Training Loss Curve
The model shows healthy convergence from ~7.0 โ ~5.8 over 150 steps:
| Step | Loss | Step | Loss | Step | Loss |
|---|---|---|---|---|---|
| 1 | 6.90 | 50 | 5.70 | 100 | 5.72 |
| 10 | 6.85 | 60 | 5.62 | 110 | 5.77 |
| 20 | 6.34 | 70 | 5.76 | 120 | 5.72 |
| 30 | 5.90 | 80 | 5.71 | 130 | 5.79 |
| 40 | 5.92 | 90 | 5.79 | 150 | 5.83 |
๐ Quick Start
Prerequisites
pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile
# Clone Spark-TTS for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS
Inference
import torch
import re
import sys
import numpy as np
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import snapshot_download
sys.path.append("Spark-TTS")
from sparktts.models.audio_tokenizer import BiCodecTokenizer
# Load model
model_id = "Featherlabs/Finatts"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load BiCodec for audio detokenization
snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")
# Generate speech
text = "Hey there, my name is Elise! Nice to meet you."
prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>"
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=True,
temperature=0.8,
top_k=50,
top_p=1.0,
)
# Decode tokens
output_text = tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
semantic_ids = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", output_text)]
global_ids = [int(t) for t in re.findall(r"bicodec_global_(\d+)", output_text)]
# Convert to audio
pred_semantic = torch.tensor(semantic_ids).long().unsqueeze(0).to("cuda")
pred_global = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to("cuda")
wav = audio_tokenizer.detokenize(pred_global, pred_semantic)
sf.write("output.wav", wav.squeeze().cpu().numpy(), 16000)
print("โ
Saved output.wav")
๐๏ธ Architecture
Spark-TTS uses a unique approach that separates speech into two token streams:
Text Input โ [LLM Backbone] โ Global Tokens (speaker identity)
โ Semantic Tokens (content + prosody)
โ
[BiCodec Decoder] โ Waveform
| Component | Details |
|---|---|
| LLM | Qwen2-0.5B (507M params) โ generates audio token sequences |
| BiCodec | Neural audio codec with global + semantic tokenization |
| Wav2Vec2 | wav2vec2-large-xlsr-53 โ feature extraction for tokenization |
| Sample rate | 16kHz |
| Token types | bicodec_global_* (speaker) + bicodec_semantic_* (content) |
๐ฆ Model Files
The repository contains the fine-tuned LLM weights. For inference, you also need:
| File | Source |
|---|---|
| LLM weights | This repo (Featherlabs/Finatts) |
| BiCodec model | unsloth/Spark-TTS-0.5B |
| Wav2Vec2 features | Included in Spark-TTS-0.5B |
| Spark-TTS code | SparkAudio/Spark-TTS |
โ ๏ธ Known Issues
- Detokenization error โ An
AxisSizeErrorineinxcan occur during inference when the generated global token count doesn't match the expected quantizer codebook dimensions (q [c] d, b n q -> q b n d). This is a shape mismatch between the model's generated tokens and BiCodec's expected input format. A workaround is being investigated. - Single speaker โ Fine-tuned on a single voice (Elise); multi-speaker capabilities from the base model may be degraded.
- English only โ Only tested with English text inputs.
โ ๏ธ Limitations
- Single speaker model โ optimized for the Elise voice character
- 16kHz output โ not yet upsampled to 24kHz/48kHz
- Requires Spark-TTS codebase โ BiCodec tokenizer is needed for both training and inference
- ROCm-specific โ trained on AMD MI300X; CUDA users may need minor adjustments
- Short training โ only 2 epochs / 150 steps; additional training may improve quality
๐ฎ What's Next
- ๐ Fix inference โ resolve the
einxAxisSizeError in detokenization - ๐ญ Emotion tags โ add explicit emotion control (
[happy],[sad],[surprised]) - ๐ Extended training โ more epochs with larger/diverse datasets
- ๐ Super-resolution โ upsample to 24kHz/48kHz for higher fidelity
- ๐ฃ๏ธ Multi-speaker โ train on multiple voices for speaker-switchable TTS
๐ License
Apache 2.0 โ consistent with Spark-TTS-0.5B.
Built with โค๏ธ by Featherlabs
Operated by Owlkun
- Downloads last month
- 54
Model tree for Featherlabs/Finatts
Base model
SparkAudio/Spark-TTS-0.5B