๐Ÿ”Š Finatts

High-fidelity voice cloning โ€” fine-tuned Spark-TTS

Text-to-Speech ยท Voice Cloning ยท Emotion Synthesis ยท BiCodec

License Base Model Dataset Parameters

Built by Featherlabs ยท Operated by Owlkun


โœจ What is Finatts?

Finatts is a 507M-parameter text-to-speech model fine-tuned for high-fidelity single-speaker voice cloning. Built on top of Spark-TTS-0.5B and trained on the Elise dataset โ€” a curated collection of 1,200 voice samples (3 hours) with rich emotional range.

Spark-TTS uses a novel BiCodec architecture that decomposes speech into:

  • Global tokens โ€” speaker identity, timbre, and style
  • Semantic tokens โ€” linguistic content and prosody

This separation enables zero-shot voice cloning and controllable speech synthesis.

๐ŸŽฏ Built For

Capability Description
๐ŸŽ™๏ธ Voice Cloning Clone a specific voice from reference audio samples
๐ŸŽญ Emotion Synthesis Generate speech with varied emotional tones
๐Ÿ“ Text-to-Speech Convert text to natural, expressive speech
๐Ÿ”Š High-Fidelity Audio 16kHz output with BiCodec tokenization

๐Ÿ‹๏ธ Training Details

PropertyValue
Base modelSparkAudio/Spark-TTS-0.5B
LLM backboneQwen2-0.5B (507M params)
DatasetMrDragonFox/Elise (1,195 samples, ~3h)
Training typeFull Supervised Fine-Tuning (SFT)
Epochs2
Batch size8 (effective 16 with grad accum)
Learning rate1e-4
Warmup steps20
Context length4,096 tokens
PrecisionBF16
OptimizerAdamW (torch fused)
LR schedulerCosine
FrameworkUnsloth + TRL (SFTTrainer)
HardwareAMD MI300X (192GB HBM3)

๐Ÿ“Š Training Metrics

Metric Value
Final loss 5.827
Training time 83s (1.4 min)
Peak VRAM 18.8 GB (9.8% of 192GB)
Trainable params 506,634,112 (100%)
Total steps 150

Training Loss Curve

The model shows healthy convergence from ~7.0 โ†’ ~5.8 over 150 steps:

Step Loss Step Loss Step Loss
1 6.90 50 5.70 100 5.72
10 6.85 60 5.62 110 5.77
20 6.34 70 5.76 120 5.72
30 5.90 80 5.71 130 5.79
40 5.92 90 5.79 150 5.83

๐Ÿš€ Quick Start

Prerequisites

pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile

# Clone Spark-TTS for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS

Inference

import torch
import re
import sys
import numpy as np
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import snapshot_download

sys.path.append("Spark-TTS")
from sparktts.models.audio_tokenizer import BiCodecTokenizer

# Load model
model_id = "Featherlabs/Finatts"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load BiCodec for audio detokenization
snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")

# Generate speech
text = "Hey there, my name is Elise! Nice to meet you."
prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>"

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated = model.generate(
    **inputs,
    max_new_tokens=2048,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=1.0,
)

# Decode tokens
output_text = tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
semantic_ids = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", output_text)]
global_ids = [int(t) for t in re.findall(r"bicodec_global_(\d+)", output_text)]

# Convert to audio
pred_semantic = torch.tensor(semantic_ids).long().unsqueeze(0).to("cuda")
pred_global = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to("cuda")

wav = audio_tokenizer.detokenize(pred_global, pred_semantic)
sf.write("output.wav", wav.squeeze().cpu().numpy(), 16000)
print("โœ… Saved output.wav")

๐Ÿ—๏ธ Architecture

Spark-TTS uses a unique approach that separates speech into two token streams:

Text Input โ†’ [LLM Backbone] โ†’ Global Tokens (speaker identity)
                              โ†’ Semantic Tokens (content + prosody)
                                    โ†“
                            [BiCodec Decoder] โ†’ Waveform
Component Details
LLM Qwen2-0.5B (507M params) โ€” generates audio token sequences
BiCodec Neural audio codec with global + semantic tokenization
Wav2Vec2 wav2vec2-large-xlsr-53 โ€” feature extraction for tokenization
Sample rate 16kHz
Token types bicodec_global_* (speaker) + bicodec_semantic_* (content)

๐Ÿ“ฆ Model Files

The repository contains the fine-tuned LLM weights. For inference, you also need:

File Source
LLM weights This repo (Featherlabs/Finatts)
BiCodec model unsloth/Spark-TTS-0.5B
Wav2Vec2 features Included in Spark-TTS-0.5B
Spark-TTS code SparkAudio/Spark-TTS

โš ๏ธ Known Issues

  • Detokenization error โ€” An AxisSizeError in einx can occur during inference when the generated global token count doesn't match the expected quantizer codebook dimensions (q [c] d, b n q -> q b n d). This is a shape mismatch between the model's generated tokens and BiCodec's expected input format. A workaround is being investigated.
  • Single speaker โ€” Fine-tuned on a single voice (Elise); multi-speaker capabilities from the base model may be degraded.
  • English only โ€” Only tested with English text inputs.

โš ๏ธ Limitations

  • Single speaker model โ€” optimized for the Elise voice character
  • 16kHz output โ€” not yet upsampled to 24kHz/48kHz
  • Requires Spark-TTS codebase โ€” BiCodec tokenizer is needed for both training and inference
  • ROCm-specific โ€” trained on AMD MI300X; CUDA users may need minor adjustments
  • Short training โ€” only 2 epochs / 150 steps; additional training may improve quality

๐Ÿ”ฎ What's Next

  • ๐Ÿ› Fix inference โ€” resolve the einx AxisSizeError in detokenization
  • ๐ŸŽญ Emotion tags โ€” add explicit emotion control ([happy], [sad], [surprised])
  • ๐Ÿ“ˆ Extended training โ€” more epochs with larger/diverse datasets
  • ๐Ÿ”Š Super-resolution โ€” upsample to 24kHz/48kHz for higher fidelity
  • ๐Ÿ—ฃ๏ธ Multi-speaker โ€” train on multiple voices for speaker-switchable TTS

๐Ÿ“œ License

Apache 2.0 โ€” consistent with Spark-TTS-0.5B.


Built with โค๏ธ by Featherlabs

Operated by Owlkun

Downloads last month
54
Safetensors
Model size
0.5B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Featherlabs/Finatts

Finetuned
(22)
this model

Dataset used to train Featherlabs/Finatts