Finatts-enhanced / README.md
Featherlabs's picture
Create README.md
a091db0 verified
---
language:
- en
license: apache-2.0
base_model: SparkAudio/Spark-TTS-0.5B
datasets:
- MrDragonFox/Elise
tags:
- tts
- text-to-speech
- spark-tts
- voice-cloning
- emotion-tags
- unsloth
- trl
- sft
- featherlabs
- audio
- amd-mi300x
library_name: transformers
pipeline_tag: text-to-speech
---
<div align="center">
# ๐Ÿ”Š Finatts Enhanced
### *High-fidelity voice cloning โ€” fine-tuned Spark-TTS v2*
**Text-to-Speech ยท Voice Cloning ยท Emotion Tags ยท Portable Voice Profile**
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Base Model](https://img.shields.io/badge/Base-Spark--TTS--0.5B-purple)](https://huggingface.co/SparkAudio/Spark-TTS-0.5B)
[![Dataset](https://img.shields.io/badge/Dataset-MrDragonFox%2FElise-green)](https://huggingface.co/datasets/MrDragonFox/Elise)
[![Parameters](https://img.shields.io/badge/Params-507M-orange)](https://huggingface.co/Featherlabs/Finatts-enhanced)
*Built by [Featherlabs](https://huggingface.co/Featherlabs) ยท Operated by Owlkun*
</div>
---
## โœจ What is Finatts Enhanced?
Finatts Enhanced is an improved **507M-parameter text-to-speech model** built on [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B), fine-tuned for **high-fidelity single-speaker voice cloning** with emotion tag support.
Compared to the original [Finatts](https://huggingface.co/Featherlabs/Finatts), this version features **3ร— the training**, a more stable learning rate, and a portable voice profile (`elise_voice.safetensors`) โ€” no reference audio needed at inference time.
### Improvements over v1
| Setting | Finatts v1 | Finatts Enhanced |
|:---|:---:|:---:|
| Epochs | 2 | **6** |
| Learning rate | 1e-4 | **5e-5** |
| Warmup steps | 20 | **50** |
| Weight decay | 0.001 | **0.01** |
| Emotion tags | โŒ | **โœ…** |
| Voice profile | โŒ | **โœ… `elise_voice.safetensors`** |
| Final loss | 5.827 | **5.806** |
### ๐ŸŽฏ Built For
| Capability | Description |
|:---:|---|
| ๐ŸŽ™๏ธ **Voice Cloning** | Clone Elise's voice โ€” no reference audio required |
| ๐ŸŽญ **Emotion Tags** | `<laughs>` `<giggles>` `<whispers>` `<sighs>` `<chuckles>` `<long pause>` |
| ๐Ÿ“ **Text-to-Speech** | Convert text to natural, expressive speech |
| ๐Ÿ“ฆ **Portable Profile** | Load `elise_voice.safetensors` โ€” deploy anywhere |
---
## ๐Ÿ‹๏ธ Training Details
<table>
<tr><td><b>Property</b></td><td><b>Value</b></td></tr>
<tr><td>Base model</td><td><a href="https://huggingface.co/SparkAudio/Spark-TTS-0.5B">SparkAudio/Spark-TTS-0.5B</a></td></tr>
<tr><td>LLM backbone</td><td>Qwen2-0.5B (507M params)</td></tr>
<tr><td>Dataset</td><td><a href="https://huggingface.co/datasets/MrDragonFox/Elise">MrDragonFox/Elise</a> (1,195 samples, ~3h)</td></tr>
<tr><td>Training type</td><td>Full Supervised Fine-Tuning (SFT)</td></tr>
<tr><td>Epochs</td><td>6</td></tr>
<tr><td>Batch size</td><td>8 (effective 16 with grad accum)</td></tr>
<tr><td>Learning rate</td><td>5e-5</td></tr>
<tr><td>Warmup steps</td><td>50</td></tr>
<tr><td>Weight decay</td><td>0.01</td></tr>
<tr><td>Context length</td><td>4,096 tokens</td></tr>
<tr><td>Precision</td><td>BF16</td></tr>
<tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr>
<tr><td>LR scheduler</td><td>Cosine</td></tr>
<tr><td>Framework</td><td>Unsloth + TRL (SFTTrainer)</td></tr>
<tr><td>Hardware</td><td>AMD MI300X (192GB HBM3)</td></tr>
</table>
### ๐Ÿ“Š Training Metrics
| Metric | Value |
|:---|:---:|
| **Final loss** | 5.806 |
| **Training time** | 144s (2.4 min) |
| **Peak VRAM** | 22.5 GB (11.7% of 192GB) |
| **Trainable params** | 506,634,112 (100%) |
| **Total steps** | 450 |
### Training Loss Curve
Model converges from **~6.9 โ†’ ~5.8** over 450 steps โ€” 3ร— more convergence than v1:
| Step | Loss | Step | Loss | Step | Loss |
|:---:|:---:|:---:|:---:|:---:|:---:|
| 1 | 6.90 | 150 | 5.79 | 300 | 5.74 |
| 50 | 5.82 | 200 | 5.76 | 400 | 5.77 |
| 100 | 5.77 | 250 | 5.73 | 450 | 5.81 |
---
## ๐Ÿš€ Quick Start
### Prerequisites
```bash
pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile safetensors
# Clone Spark-TTS for BiCodec tokenizer
git clone https://github.com/SparkAudio/Spark-TTS
```
### Inference with Elise Voice Profile
```python
import torch, re, sys
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from huggingface_hub import snapshot_download, hf_hub_download
from safetensors.torch import load_file
import json
sys.path.append("Spark-TTS")
from sparktts.models.audio_tokenizer import BiCodecTokenizer
MODEL_ID = "Featherlabs/Finatts-enhanced"
# Load LLM
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()
# Load BiCodec
snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")
# Load Elise voice profile (global token IDs โ€” no reference audio needed)
profile_path = hf_hub_download(MODEL_ID, "elise_voice_profile.json")
with open(profile_path) as f:
profile = json.load(f)
elise_global_ids = profile["global_token_ids"]
elise_global_token_str = profile["global_token_str"]
@torch.inference_mode()
def generate_speech(text, temperature=0.8, top_k=40, top_p=0.92):
prompt = "".join([
"<|task_tts|>",
"<|start_content|>", text, "<|end_content|>",
"<|start_global_token|>",
elise_global_token_str, # Elise's voice injected here
"<|end_global_token|>",
"<|start_semantic_token|>",
])
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated = model.generate(
**inputs, max_new_tokens=2048,
do_sample=True, temperature=temperature,
top_k=top_k, top_p=top_p,
eos_token_id=tokenizer.eos_token_id,
)
out = tokenizer.batch_decode(
generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=False
)[0]
sem = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", out)]
if not sem:
return None
pred_sem = torch.tensor(sem, dtype=torch.long).unsqueeze(0).to("cuda")
pred_global = torch.tensor(elise_global_ids, dtype=torch.long).unsqueeze(0).to("cuda")
audio_tokenizer.model.to("cuda")
return audio_tokenizer.detokenize(pred_global, pred_sem).squeeze().cpu().numpy()
# Try emotion tags!
texts = [
"Hey there! My name is Elise, nice to meet you.",
"<laughs> Oh my gosh, I can't believe that actually worked!",
"<whispers> Come closer... I have a secret to tell you.",
"<sighs> Some days just feel heavier than others.",
]
for i, text in enumerate(texts):
wav = generate_speech(text)
if wav is not None:
sf.write(f"output_{i+1}.wav", wav, 16000)
print(f"โœ… output_{i+1}.wav")
```
---
## ๐ŸŽญ Emotion Tags
The Elise dataset includes inline emotion tags captured from real speech. Place them anywhere in your text:
| Tag | Effect |
|:---|:---|
| `<laughs>` | Lighter, brighter intonation |
| `<giggles>` | Playful, uptick in pitch |
| `<whispers>` | Softer, breathier delivery |
| `<sighs>` | Drawn-out, melancholic tone |
| `<chuckles>` | Gentle amusement |
| `<long pause>` | Extended pause in speech |
**Note:** Tags produce **intonation variation** rather than literal acoustic sounds (e.g., actual giggling audio). For acoustic emotion effects, see [Orpheus-TTS](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft).
---
## ๐Ÿ—๏ธ Architecture
```
Text + Emotion Tags
โ†“
[LLM: Qwen2-0.5B]
โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”
Global tokens Semantic tokens
(speaker ID) (content + prosody)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
[BiCodec Decoder]
โ†“
Waveform 16kHz
```
| Component | Details |
|:---|:---|
| **LLM** | Qwen2-0.5B (507M params) |
| **BiCodec** | Neural audio codec โ€” global + semantic tokenization |
| **Wav2Vec2** | `wav2vec2-large-xlsr-53` โ€” feature extraction |
| **Sample rate** | 16kHz |
| **Voice profile** | `elise_voice.safetensors` โ€” 1024-dim d-vector |
---
## ๐Ÿ“ฆ Repository Files
| File | Description |
|:---|:---|
| `model.safetensors` | Fine-tuned LLM weights (966MB, 16-bit merged) |
| `elise_voice.safetensors` | Elise speaker d-vector (1024-dim, avg of 10 clips) |
| `tokenizer.json` | Tokenizer including BiCodec special tokens |
| `config.json` | Model configuration |
For inference you also need:
| File | Source |
|:---|:---|
| BiCodec model | [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) |
| Spark-TTS code | [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) |
---
## โš ๏ธ Limitations
- **English only** โ€” only tested with English text inputs
- **Single speaker** โ€” optimized for Elise; base model multi-speaker may be degraded
- **16kHz output** โ€” use [audiosr](https://github.com/haoheliu/versatile_audio_super_resolution) for upsampling to 44.1kHz
- **Emotion intensity** โ€” tags produce subtle intonation changes, not acoustic emotion sounds
- **ROCm-trained** โ€” tested on AMD MI300X; CUDA users may need minor env adjustments
---
## ๐Ÿ”ฎ What's Next
- ๐Ÿ”Š **Super-resolution** โ€” integrate audiosr for 44.1kHz HD output
- ๐Ÿ—ฃ๏ธ **Multi-speaker** โ€” train on multiple voices
- ๐Ÿ“ˆ **Larger dataset** โ€” more hours of Elise audio for stronger emotion control
- ๐ŸŽญ **Acoustic emotions** โ€” explore Orpheus-style explicit emotion tokens
---
## ๐Ÿ“œ License
Apache 2.0 โ€” consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B).
---
<div align="center">
**Built with โค๏ธ by [Featherlabs](https://huggingface.co/Featherlabs)**
*Operated by Owlkun*
</div>