Finatts / README.md
Featherlabs's picture
Update README.md
fecaa9f verified
---
language:
- en
license: apache-2.0
base_model: SparkAudio/Spark-TTS-0.5B
datasets:
- MrDragonFox/Elise
tags:
- tts
- text-to-speech
- spark-tts
- voice-cloning
- unsloth
- trl
- sft
- featherlabs
- audio
library_name: transformers
pipeline_tag: text-to-speech
---
<div align="center">
# ๐Ÿ”Š Finatts
### *High-fidelity voice cloning โ€” fine-tuned Spark-TTS*
**Text-to-Speech ยท Voice Cloning ยท Emotion Synthesis ยท BiCodec**
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base-Spark--TTS--0.5B-purple)](https://huggingface.co/SparkAudio/Spark-TTS-0.5B)
[![Dataset](https://img.shields.io/badge/Dataset-MrDragonFox%2FElise-green)](https://huggingface.co/datasets/MrDragonFox/Elise)
[![Parameters](https://img.shields.io/badge/Params-507M-orange)](https://huggingface.co/Featherlabs/Finatts)
*Built by [Featherlabs](https://huggingface.co/Featherlabs) ยท Operated by Owlkun*
</div>
---
## โœจ What is Finatts?
Finatts is a **507M-parameter text-to-speech model** fine-tuned for **high-fidelity single-speaker voice cloning**. Built on top of [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) and trained on the [Elise](https://huggingface.co/datasets/MrDragonFox/Elise) dataset โ€” a curated collection of ~1,200 voice samples (~3 hours) with rich emotional range.
Spark-TTS uses a novel **BiCodec** architecture that decomposes speech into:
- **Global tokens** โ€” speaker identity, timbre, and style
- **Semantic tokens** โ€” linguistic content and prosody
This separation enables zero-shot voice cloning and controllable speech synthesis.
### ๐ŸŽฏ Built For
| Capability | Description |
|:---:|---|
| ๐ŸŽ™๏ธ **Voice Cloning** | Clone a specific voice from reference audio samples |
| ๐ŸŽญ **Emotion Synthesis** | Generate speech with varied emotional tones |
| ๐Ÿ“ **Text-to-Speech** | Convert text to natural, expressive speech |
| ๐Ÿ”Š **High-Fidelity Audio** | 16kHz output with BiCodec tokenization |
---
## ๐Ÿ‹๏ธ Training Details
<table>
<tr><td><b>Property</b></td><td><b>Value</b></td></tr>
<tr><td>Base model</td><td><a href="https://huggingface.co/SparkAudio/Spark-TTS-0.5B">SparkAudio/Spark-TTS-0.5B</a></td></tr>
<tr><td>LLM backbone</td><td>Qwen2-0.5B (507M params)</td></tr>
<tr><td>Dataset</td><td><a href="https://huggingface.co/datasets/MrDragonFox/Elise">MrDragonFox/Elise</a> (1,195 samples, ~3h)</td></tr>
<tr><td>Training type</td><td>Full Supervised Fine-Tuning (SFT)</td></tr>
<tr><td>Epochs</td><td>2</td></tr>
<tr><td>Batch size</td><td>8 (effective 16 with grad accum)</td></tr>
<tr><td>Learning rate</td><td>1e-4</td></tr>
<tr><td>Warmup steps</td><td>20</td></tr>
<tr><td>Context length</td><td>4,096 tokens</td></tr>
<tr><td>Precision</td><td>BF16</td></tr>
<tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr>
<tr><td>LR scheduler</td><td>Cosine</td></tr>
<tr><td>Framework</td><td>Unsloth + TRL (SFTTrainer)</td></tr>
<tr><td>Hardware</td><td>AMD MI300X (192GB HBM3)</td></tr>
</table>
### ๐Ÿ“Š Training Metrics
| Metric | Value |
|:---|:---:|
| **Final loss** | 5.827 |
| **Training time** | 83s (1.4 min) |
| **Peak VRAM** | 18.8 GB (9.8% of 192GB) |
| **Trainable params** | 506,634,112 (100%) |
| **Total steps** | 150 |
### Training Loss Curve
The model shows healthy convergence from **~7.0 โ†’ ~5.8** over 150 steps:
| Step | Loss | Step | Loss | Step | Loss |
|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 6.90 | 50 | 5.70 | 100 | 5.72 |
| 10 | 6.85 | 60 | 5.62 | 110 | 5.77 |
| 20 | 6.34 | 70 | 5.76 | 120 | 5.72 |
| 30 | 5.90 | 80 | 5.71 | 130 | 5.79 |
| 40 | 5.92 | 90 | 5.79 | 150 | 5.83 |
---
## ๐Ÿš€ Quick Start
### Prerequisites
```bash
pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile
# Clone Spark-TTS for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS
```
### Inference
```python
import torch
import re
import sys
import numpy as np
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import snapshot_download
sys.path.append("Spark-TTS")
from sparktts.models.audio_tokenizer import BiCodecTokenizer
# Load model
model_id = "Featherlabs/Finatts"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Load BiCodec for audio detokenization
snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")
# Generate speech
text = "Hey there, my name is Elise! Nice to meet you."
prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>"
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=True,
temperature=0.8,
top_k=50,
top_p=1.0,
)
# Decode tokens
output_text = tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
semantic_ids = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", output_text)]
global_ids = [int(t) for t in re.findall(r"bicodec_global_(\d+)", output_text)]
# Convert to audio
pred_semantic = torch.tensor(semantic_ids).long().unsqueeze(0).to("cuda")
pred_global = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to("cuda")
wav = audio_tokenizer.detokenize(pred_global, pred_semantic)
sf.write("output.wav", wav.squeeze().cpu().numpy(), 16000)
print("โœ… Saved output.wav")
```
---
## ๐Ÿ—๏ธ Architecture
Spark-TTS uses a unique approach that separates speech into two token streams:
```
Text Input โ†’ [LLM Backbone] โ†’ Global Tokens (speaker identity)
โ†’ Semantic Tokens (content + prosody)
โ†“
[BiCodec Decoder] โ†’ Waveform
```
| Component | Details |
|:---|:---|
| **LLM** | Qwen2-0.5B (507M params) โ€” generates audio token sequences |
| **BiCodec** | Neural audio codec with global + semantic tokenization |
| **Wav2Vec2** | `wav2vec2-large-xlsr-53` โ€” feature extraction for tokenization |
| **Sample rate** | 16kHz |
| **Token types** | `bicodec_global_*` (speaker) + `bicodec_semantic_*` (content) |
---
## ๐Ÿ“ฆ Model Files
The repository contains the fine-tuned LLM weights. For inference, you also need:
| File | Source |
|:---|:---|
| LLM weights | This repo (`Featherlabs/Finatts`) |
| BiCodec model | [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) |
| Wav2Vec2 features | Included in Spark-TTS-0.5B |
| Spark-TTS code | [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) |
---
## โš ๏ธ Known Issues
- **Detokenization error** โ€” An `AxisSizeError` in `einx` can occur during inference when the generated global token count doesn't match the expected quantizer codebook dimensions (`q [c] d, b n q -> q b n d`). This is a shape mismatch between the model's generated tokens and BiCodec's expected input format. A workaround is being investigated.
- **Single speaker** โ€” Fine-tuned on a single voice (Elise); multi-speaker capabilities from the base model may be degraded.
- **English only** โ€” Only tested with English text inputs.
---
## โš ๏ธ Limitations
- **Single speaker model** โ€” optimized for the Elise voice character
- **16kHz output** โ€” not yet upsampled to 24kHz/48kHz
- **Requires Spark-TTS codebase** โ€” BiCodec tokenizer is needed for both training and inference
- **ROCm-specific** โ€” trained on AMD MI300X; CUDA users may need minor adjustments
- **Short training** โ€” only 2 epochs / 150 steps; additional training may improve quality
---
## ๐Ÿ”ฎ What's Next
- ๐Ÿ› **Fix inference** โ€” resolve the `einx` AxisSizeError in detokenization
- ๐ŸŽญ **Emotion tags** โ€” add explicit emotion control (`[happy]`, `[sad]`, `[surprised]`)
- ๐Ÿ“ˆ **Extended training** โ€” more epochs with larger/diverse datasets
- ๐Ÿ”Š **Super-resolution** โ€” upsample to 24kHz/48kHz for higher fidelity
- ๐Ÿ—ฃ๏ธ **Multi-speaker** โ€” train on multiple voices for speaker-switchable TTS
---
## ๐Ÿ“œ License
Apache 2.0 โ€” consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B).
---
<div align="center">
**Built with โค๏ธ by [Featherlabs](https://huggingface.co/Featherlabs)**
*Operated by Owlkun*
</div>