README.md · Featherlabs/Finatts-enhanced at main

Finatts-enhanced / README.md

Featherlabs

Create README.md

a091db0 verified 29 days ago

preview code

raw

history blame contribute delete

9.95 kB

	---
	language:
	- en
	license: apache-2.0
	base_model: SparkAudio/Spark-TTS-0.5B
	datasets:
	- MrDragonFox/Elise
	tags:
	- tts
	- text-to-speech
	- spark-tts
	- voice-cloning
	- emotion-tags
	- unsloth
	- trl
	- sft
	- featherlabs
	- audio
	- amd-mi300x
	library_name: transformers
	pipeline_tag: text-to-speech
	---

	<div align="center">

	# 🔊 Finatts Enhanced

	### High-fidelity voice cloning — fine-tuned Spark-TTS v2

	Text-to-Speech · Voice Cloning · Emotion Tags · Portable Voice Profile

	[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Base Model](https://img.shields.io/badge/Base-Spark--TTS--0.5B-purple)](https://huggingface.co/SparkAudio/Spark-TTS-0.5B)
	[![Dataset](https://img.shields.io/badge/Dataset-MrDragonFox%2FElise-green)](https://huggingface.co/datasets/MrDragonFox/Elise)
	[![Parameters](https://img.shields.io/badge/Params-507M-orange)](https://huggingface.co/Featherlabs/Finatts-enhanced)

	Built by [Featherlabs](https://huggingface.co/Featherlabs) · Operated by Owlkun

	</div>

	---

	## ✨ What is Finatts Enhanced?

	Finatts Enhanced is an improved 507M-parameter text-to-speech model built on [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B), fine-tuned for high-fidelity single-speaker voice cloning with emotion tag support.

	Compared to the original [Finatts](https://huggingface.co/Featherlabs/Finatts), this version features 3× the training, a more stable learning rate, and a portable voice profile (`elise_voice.safetensors`) — no reference audio needed at inference time.

	### Improvements over v1

	\| Setting \| Finatts v1 \| Finatts Enhanced \|
	\|:---\|:---:\|:---:\|
	\| Epochs \| 2 \| 6 \|
	\| Learning rate \| 1e-4 \| 5e-5 \|
	\| Warmup steps \| 20 \| 50 \|
	\| Weight decay \| 0.001 \| 0.01 \|
	\| Emotion tags \| ❌ \| ✅ \|
	\| Voice profile \| ❌ \| ✅ `elise_voice.safetensors` \|
	\| Final loss \| 5.827 \| 5.806 \|

	### 🎯 Built For

	\| Capability \| Description \|
	\|:---:\|---\|
	\| 🎙️ Voice Cloning \| Clone Elise's voice — no reference audio required \|
	\| 🎭 Emotion Tags \| `<laughs>` `<giggles>` `<whispers>` `<sighs>` `<chuckles>` `<long pause>` \|
	\| 📝 Text-to-Speech \| Convert text to natural, expressive speech \|
	\| 📦 Portable Profile \| Load `elise_voice.safetensors` — deploy anywhere \|

	---

	## 🏋️ Training Details

	<table>
	<tr><td><b>Property</b></td><td><b>Value</b></td></tr>
	<tr><td>Base model</td><td><a href="https://huggingface.co/SparkAudio/Spark-TTS-0.5B">SparkAudio/Spark-TTS-0.5B</a></td></tr>
	<tr><td>LLM backbone</td><td>Qwen2-0.5B (507M params)</td></tr>
	<tr><td>Dataset</td><td><a href="https://huggingface.co/datasets/MrDragonFox/Elise">MrDragonFox/Elise</a> (1,195 samples, ~3h)</td></tr>
	<tr><td>Training type</td><td>Full Supervised Fine-Tuning (SFT)</td></tr>
	<tr><td>Epochs</td><td>6</td></tr>
	<tr><td>Batch size</td><td>8 (effective 16 with grad accum)</td></tr>
	<tr><td>Learning rate</td><td>5e-5</td></tr>
	<tr><td>Warmup steps</td><td>50</td></tr>
	<tr><td>Weight decay</td><td>0.01</td></tr>
	<tr><td>Context length</td><td>4,096 tokens</td></tr>
	<tr><td>Precision</td><td>BF16</td></tr>
	<tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr>
	<tr><td>LR scheduler</td><td>Cosine</td></tr>
	<tr><td>Framework</td><td>Unsloth + TRL (SFTTrainer)</td></tr>
	<tr><td>Hardware</td><td>AMD MI300X (192GB HBM3)</td></tr>
	</table>

	### 📊 Training Metrics

	\| Metric \| Value \|
	\|:---\|:---:\|
	\| Final loss \| 5.806 \|
	\| Training time \| 144s (2.4 min) \|
	\| Peak VRAM \| 22.5 GB (11.7% of 192GB) \|
	\| Trainable params \| 506,634,112 (100%) \|
	\| Total steps \| 450 \|

	### Training Loss Curve

	Model converges from ~6.9 → ~5.8 over 450 steps — 3× more convergence than v1:

	\| Step \| Loss \| Step \| Loss \| Step \| Loss \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 1 \| 6.90 \| 150 \| 5.79 \| 300 \| 5.74 \|
	\| 50 \| 5.82 \| 200 \| 5.76 \| 400 \| 5.77 \|
	\| 100 \| 5.77 \| 250 \| 5.73 \| 450 \| 5.81 \|

	---

	## 🚀 Quick Start

	### Prerequisites

	```bash
	pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
	pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
	pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile safetensors

	# Clone Spark-TTS for BiCodec tokenizer
	git clone https://github.com/SparkAudio/Spark-TTS
	```

	### Inference with Elise Voice Profile

	```python
	import torch, re, sys
	import soundfile as sf
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from huggingface_hub import snapshot_download, hf_hub_download
	from safetensors.torch import load_file
	import json

	sys.path.append("Spark-TTS")
	from sparktts.models.audio_tokenizer import BiCodecTokenizer

	MODEL_ID = "Featherlabs/Finatts-enhanced"

	# Load LLM
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto"
	)
	model.eval()

	# Load BiCodec
	snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
	audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")

	# Load Elise voice profile (global token IDs — no reference audio needed)
	profile_path = hf_hub_download(MODEL_ID, "elise_voice_profile.json")
	with open(profile_path) as f:
	profile = json.load(f)
	elise_global_ids = profile["global_token_ids"]
	elise_global_token_str = profile["global_token_str"]


	@torch.inference_mode()
	def generate_speech(text, temperature=0.8, top_k=40, top_p=0.92):
	prompt = "".join([
	"<\|task_tts\|>",
	"<\|start_content\|>", text, "<\|end_content\|>",
	"<\|start_global_token\|>",
	elise_global_token_str, # Elise's voice injected here
	"<\|end_global_token\|>",
	"<\|start_semantic_token\|>",
	])
	inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
	generated = model.generate(
	**inputs, max_new_tokens=2048,
	do_sample=True, temperature=temperature,
	top_k=top_k, top_p=top_p,
	eos_token_id=tokenizer.eos_token_id,
	)
	out = tokenizer.batch_decode(
	generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=False
	)[0]
	sem = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", out)]
	if not sem:
	return None
	pred_sem = torch.tensor(sem, dtype=torch.long).unsqueeze(0).to("cuda")
	pred_global = torch.tensor(elise_global_ids, dtype=torch.long).unsqueeze(0).to("cuda")
	audio_tokenizer.model.to("cuda")
	return audio_tokenizer.detokenize(pred_global, pred_sem).squeeze().cpu().numpy()


	# Try emotion tags!
	texts = [
	"Hey there! My name is Elise, nice to meet you.",
	"<laughs> Oh my gosh, I can't believe that actually worked!",
	"<whispers> Come closer... I have a secret to tell you.",
	"<sighs> Some days just feel heavier than others.",
	]
	for i, text in enumerate(texts):
	wav = generate_speech(text)
	if wav is not None:
	sf.write(f"output_{i+1}.wav", wav, 16000)
	print(f"✅ output_{i+1}.wav")
	```

	---

	## 🎭 Emotion Tags

	The Elise dataset includes inline emotion tags captured from real speech. Place them anywhere in your text:

	\| Tag \| Effect \|
	\|:---\|:---\|
	\| `<laughs>` \| Lighter, brighter intonation \|
	\| `<giggles>` \| Playful, uptick in pitch \|
	\| `<whispers>` \| Softer, breathier delivery \|
	\| `<sighs>` \| Drawn-out, melancholic tone \|
	\| `<chuckles>` \| Gentle amusement \|
	\| `<long pause>` \| Extended pause in speech \|

	Note: Tags produce intonation variation rather than literal acoustic sounds (e.g., actual giggling audio). For acoustic emotion effects, see [Orpheus-TTS](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft).

	---

	## 🏗️ Architecture

	```
	Text + Emotion Tags
	↓
	[LLM: Qwen2-0.5B]
	┌─────┴──────┐
	Global tokens Semantic tokens
	(speaker ID) (content + prosody)
	└────────┬────────┘
	[BiCodec Decoder]
	↓
	Waveform 16kHz
	```

	\| Component \| Details \|
	\|:---\|:---\|
	\| LLM \| Qwen2-0.5B (507M params) \|
	\| BiCodec \| Neural audio codec — global + semantic tokenization \|
	\| Wav2Vec2 \| `wav2vec2-large-xlsr-53` — feature extraction \|
	\| Sample rate \| 16kHz \|
	\| Voice profile \| `elise_voice.safetensors` — 1024-dim d-vector \|

	---

	## 📦 Repository Files

	\| File \| Description \|
	\|:---\|:---\|
	\| `model.safetensors` \| Fine-tuned LLM weights (966MB, 16-bit merged) \|
	\| `elise_voice.safetensors` \| Elise speaker d-vector (1024-dim, avg of 10 clips) \|
	\| `tokenizer.json` \| Tokenizer including BiCodec special tokens \|
	\| `config.json` \| Model configuration \|

	For inference you also need:

	\| File \| Source \|
	\|:---\|:---\|
	\| BiCodec model \| [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) \|
	\| Spark-TTS code \| [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) \|

	---

	## ⚠️ Limitations

	- English only — only tested with English text inputs
	- Single speaker — optimized for Elise; base model multi-speaker may be degraded
	- 16kHz output — use [audiosr](https://github.com/haoheliu/versatile_audio_super_resolution) for upsampling to 44.1kHz
	- Emotion intensity — tags produce subtle intonation changes, not acoustic emotion sounds
	- ROCm-trained — tested on AMD MI300X; CUDA users may need minor env adjustments

	---

	## 🔮 What's Next

	- 🔊 Super-resolution — integrate audiosr for 44.1kHz HD output
	- 🗣️ Multi-speaker — train on multiple voices
	- 📈 Larger dataset — more hours of Elise audio for stronger emotion control
	- 🎭 Acoustic emotions — explore Orpheus-style explicit emotion tokens

	---

	## 📜 License

	Apache 2.0 — consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B).

	---

	<div align="center">

	Built with ❤️ by [Featherlabs](https://huggingface.co/Featherlabs)

	Operated by Owlkun

	</div>