README.md · Featherlabs/Finatts at main

Finatts / README.md

Featherlabs

Update README.md

fecaa9f verified 6 days ago

preview code

raw

history blame contribute delete

8.57 kB

	---
	language:
	- en
	license: apache-2.0
	base_model: SparkAudio/Spark-TTS-0.5B
	datasets:
	- MrDragonFox/Elise
	tags:
	- tts
	- text-to-speech
	- spark-tts
	- voice-cloning
	- unsloth
	- trl
	- sft
	- featherlabs
	- audio
	library_name: transformers
	pipeline_tag: text-to-speech
	---

	<div align="center">

	# 🔊 Finatts

	### High-fidelity voice cloning — fine-tuned Spark-TTS

	Text-to-Speech · Voice Cloning · Emotion Synthesis · BiCodec

	[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Base Model](https://img.shields.io/badge/Base-Spark--TTS--0.5B-purple)](https://huggingface.co/SparkAudio/Spark-TTS-0.5B)
	[![Dataset](https://img.shields.io/badge/Dataset-MrDragonFox%2FElise-green)](https://huggingface.co/datasets/MrDragonFox/Elise)
	[![Parameters](https://img.shields.io/badge/Params-507M-orange)](https://huggingface.co/Featherlabs/Finatts)

	Built by [Featherlabs](https://huggingface.co/Featherlabs) · Operated by Owlkun

	</div>

	---

	## ✨ What is Finatts?

	Finatts is a 507M-parameter text-to-speech model fine-tuned for high-fidelity single-speaker voice cloning. Built on top of [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) and trained on the [Elise](https://huggingface.co/datasets/MrDragonFox/Elise) dataset — a curated collection of ~1,200 voice samples (~3 hours) with rich emotional range.

	Spark-TTS uses a novel BiCodec architecture that decomposes speech into:
	- Global tokens — speaker identity, timbre, and style
	- Semantic tokens — linguistic content and prosody

	This separation enables zero-shot voice cloning and controllable speech synthesis.

	### 🎯 Built For

	\| Capability \| Description \|
	\|:---:\|---\|
	\| 🎙️ Voice Cloning \| Clone a specific voice from reference audio samples \|
	\| 🎭 Emotion Synthesis \| Generate speech with varied emotional tones \|
	\| 📝 Text-to-Speech \| Convert text to natural, expressive speech \|
	\| 🔊 High-Fidelity Audio \| 16kHz output with BiCodec tokenization \|

	---

	## 🏋️ Training Details

	<table>
	<tr><td><b>Property</b></td><td><b>Value</b></td></tr>
	<tr><td>Base model</td><td><a href="https://huggingface.co/SparkAudio/Spark-TTS-0.5B">SparkAudio/Spark-TTS-0.5B</a></td></tr>
	<tr><td>LLM backbone</td><td>Qwen2-0.5B (507M params)</td></tr>
	<tr><td>Dataset</td><td><a href="https://huggingface.co/datasets/MrDragonFox/Elise">MrDragonFox/Elise</a> (1,195 samples, ~3h)</td></tr>
	<tr><td>Training type</td><td>Full Supervised Fine-Tuning (SFT)</td></tr>
	<tr><td>Epochs</td><td>2</td></tr>
	<tr><td>Batch size</td><td>8 (effective 16 with grad accum)</td></tr>
	<tr><td>Learning rate</td><td>1e-4</td></tr>
	<tr><td>Warmup steps</td><td>20</td></tr>
	<tr><td>Context length</td><td>4,096 tokens</td></tr>
	<tr><td>Precision</td><td>BF16</td></tr>
	<tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr>
	<tr><td>LR scheduler</td><td>Cosine</td></tr>
	<tr><td>Framework</td><td>Unsloth + TRL (SFTTrainer)</td></tr>
	<tr><td>Hardware</td><td>AMD MI300X (192GB HBM3)</td></tr>
	</table>

	### 📊 Training Metrics

	\| Metric \| Value \|
	\|:---\|:---:\|
	\| Final loss \| 5.827 \|
	\| Training time \| 83s (1.4 min) \|
	\| Peak VRAM \| 18.8 GB (9.8% of 192GB) \|
	\| Trainable params \| 506,634,112 (100%) \|
	\| Total steps \| 150 \|

	### Training Loss Curve

	The model shows healthy convergence from ~7.0 → ~5.8 over 150 steps:

	\| Step \| Loss \| Step \| Loss \| Step \| Loss \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 1 \| 6.90 \| 50 \| 5.70 \| 100 \| 5.72 \|
	\| 10 \| 6.85 \| 60 \| 5.62 \| 110 \| 5.77 \|
	\| 20 \| 6.34 \| 70 \| 5.76 \| 120 \| 5.72 \|
	\| 30 \| 5.90 \| 80 \| 5.71 \| 130 \| 5.79 \|
	\| 40 \| 5.92 \| 90 \| 5.79 \| 150 \| 5.83 \|

	---

	## 🚀 Quick Start

	### Prerequisites

	```bash
	pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
	pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
	pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile

	# Clone Spark-TTS for BiCodec tokenizer
	git clone https://github.com/SparkAudio/Spark-TTS
	```

	### Inference

	```python
	import torch
	import re
	import sys
	import numpy as np
	import soundfile as sf
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from huggingface_hub import snapshot_download

	sys.path.append("Spark-TTS")
	from sparktts.models.audio_tokenizer import BiCodecTokenizer

	# Load model
	model_id = "Featherlabs/Finatts"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# Load BiCodec for audio detokenization
	snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
	audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")

	# Generate speech
	text = "Hey there, my name is Elise! Nice to meet you."
	prompt = f"<\|task_tts\|><\|start_content\|>{text}<\|end_content\|><\|start_global_token\|>"

	inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
	generated = model.generate(
	**inputs,
	max_new_tokens=2048,
	do_sample=True,
	temperature=0.8,
	top_k=50,
	top_p=1.0,
	)

	# Decode tokens
	output_text = tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
	semantic_ids = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", output_text)]
	global_ids = [int(t) for t in re.findall(r"bicodec_global_(\d+)", output_text)]

	# Convert to audio
	pred_semantic = torch.tensor(semantic_ids).long().unsqueeze(0).to("cuda")
	pred_global = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to("cuda")

	wav = audio_tokenizer.detokenize(pred_global, pred_semantic)
	sf.write("output.wav", wav.squeeze().cpu().numpy(), 16000)
	print("✅ Saved output.wav")
	```

	---

	## 🏗️ Architecture

	Spark-TTS uses a unique approach that separates speech into two token streams:

	```
	Text Input → [LLM Backbone] → Global Tokens (speaker identity)
	→ Semantic Tokens (content + prosody)
	↓
	[BiCodec Decoder] → Waveform
	```

	\| Component \| Details \|
	\|:---\|:---\|
	\| LLM \| Qwen2-0.5B (507M params) — generates audio token sequences \|
	\| BiCodec \| Neural audio codec with global + semantic tokenization \|
	\| Wav2Vec2 \| `wav2vec2-large-xlsr-53` — feature extraction for tokenization \|
	\| Sample rate \| 16kHz \|
	\| Token types \| `bicodec_global_` (speaker) + `bicodec_semantic_` (content) \|

	---

	## 📦 Model Files

	The repository contains the fine-tuned LLM weights. For inference, you also need:

	\| File \| Source \|
	\|:---\|:---\|
	\| LLM weights \| This repo (`Featherlabs/Finatts`) \|
	\| BiCodec model \| [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) \|
	\| Wav2Vec2 features \| Included in Spark-TTS-0.5B \|
	\| Spark-TTS code \| [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) \|

	---

	## ⚠️ Known Issues

	- Detokenization error — An `AxisSizeError` in `einx` can occur during inference when the generated global token count doesn't match the expected quantizer codebook dimensions (`q [c] d, b n q -> q b n d`). This is a shape mismatch between the model's generated tokens and BiCodec's expected input format. A workaround is being investigated.
	- Single speaker — Fine-tuned on a single voice (Elise); multi-speaker capabilities from the base model may be degraded.
	- English only — Only tested with English text inputs.

	---

	## ⚠️ Limitations

	- Single speaker model — optimized for the Elise voice character
	- 16kHz output — not yet upsampled to 24kHz/48kHz
	- Requires Spark-TTS codebase — BiCodec tokenizer is needed for both training and inference
	- ROCm-specific — trained on AMD MI300X; CUDA users may need minor adjustments
	- Short training — only 2 epochs / 150 steps; additional training may improve quality

	---

	## 🔮 What's Next

	- 🐛 Fix inference — resolve the `einx` AxisSizeError in detokenization
	- 🎭 Emotion tags — add explicit emotion control (`[happy]`, `[sad]`, `[surprised]`)
	- 📈 Extended training — more epochs with larger/diverse datasets
	- 🔊 Super-resolution — upsample to 24kHz/48kHz for higher fidelity
	- 🗣️ Multi-speaker — train on multiple voices for speaker-switchable TTS

	---

	## 📜 License

	Apache 2.0 — consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B).

	---

	<div align="center">

	Built with ❤️ by [Featherlabs](https://huggingface.co/Featherlabs)

	Operated by Owlkun

	</div>