--- language: - en license: apache-2.0 base_model: SparkAudio/Spark-TTS-0.5B datasets: - MrDragonFox/Elise tags: - tts - text-to-speech - spark-tts - voice-cloning - unsloth - trl - sft - featherlabs - audio library_name: transformers pipeline_tag: text-to-speech ---

# 🔊 Finatts ### *High-fidelity voice cloning — fine-tuned Spark-TTS* **Text-to-Speech · Voice Cloning · Emotion Synthesis · BiCodec** [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Base Model](https://img.shields.io/badge/Base-Spark--TTS--0.5B-purple)](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) [![Dataset](https://img.shields.io/badge/Dataset-MrDragonFox%2FElise-green)](https://huggingface.co/datasets/MrDragonFox/Elise) [![Parameters](https://img.shields.io/badge/Params-507M-orange)](https://huggingface.co/Featherlabs/Finatts) *Built by [Featherlabs](https://huggingface.co/Featherlabs) · Operated by Owlkun*

--- ## ✨ What is Finatts? Finatts is a **507M-parameter text-to-speech model** fine-tuned for **high-fidelity single-speaker voice cloning**. Built on top of [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) and trained on the [Elise](https://huggingface.co/datasets/MrDragonFox/Elise) dataset — a curated collection of ~1,200 voice samples (~3 hours) with rich emotional range. Spark-TTS uses a novel **BiCodec** architecture that decomposes speech into: - **Global tokens** — speaker identity, timbre, and style - **Semantic tokens** — linguistic content and prosody This separation enables zero-shot voice cloning and controllable speech synthesis. ### 🎯 Built For | Capability | Description | |:---:|---| | 🎙️ **Voice Cloning** | Clone a specific voice from reference audio samples | | 🎭 **Emotion Synthesis** | Generate speech with varied emotional tones | | 📝 **Text-to-Speech** | Convert text to natural, expressive speech | | 🔊 **High-Fidelity Audio** | 16kHz output with BiCodec tokenization | --- ## 🏋️ Training Details

Property	Value
Base model	SparkAudio/Spark-TTS-0.5B
LLM backbone	Qwen2-0.5B (507M params)
Dataset	MrDragonFox/Elise (1,195 samples, ~3h)
Training type	Full Supervised Fine-Tuning (SFT)
Epochs	2
Batch size	8 (effective 16 with grad accum)
Learning rate	1e-4
Warmup steps	20
Context length	4,096 tokens
Precision	BF16
Optimizer	AdamW (torch fused)
LR scheduler	Cosine
Framework	Unsloth + TRL (SFTTrainer)
Hardware	AMD MI300X (192GB HBM3)

### 📊 Training Metrics | Metric | Value | |:---|:---:| | **Final loss** | 5.827 | | **Training time** | 83s (1.4 min) | | **Peak VRAM** | 18.8 GB (9.8% of 192GB) | | **Trainable params** | 506,634,112 (100%) | | **Total steps** | 150 | ### Training Loss Curve The model shows healthy convergence from **~7.0 → ~5.8** over 150 steps: | Step | Loss | Step | Loss | Step | Loss | |:---:|:---:|:---:|:---:|:---:|:---:| | 1 | 6.90 | 50 | 5.70 | 100 | 5.72 | | 10 | 6.85 | 60 | 5.62 | 110 | 5.77 | | 20 | 6.34 | 70 | 5.76 | 120 | 5.72 | | 30 | 5.90 | 80 | 5.71 | 130 | 5.79 | | 40 | 5.92 | 90 | 5.79 | 150 | 5.83 | --- ## 🚀 Quick Start ### Prerequisites ```bash pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth" pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2" pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile # Clone Spark-TTS for BiCodec tokenizer git clone https://github.com/SparkAudio/Spark-TTS ``` ### Inference ```python import torch import re import sys import numpy as np import soundfile as sf from transformers import AutoTokenizer, AutoModelForCausalLM from huggingface_hub import snapshot_download sys.path.append("Spark-TTS") from sparktts.models.audio_tokenizer import BiCodecTokenizer # Load model model_id = "Featherlabs/Finatts" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) # Load BiCodec for audio detokenization snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B") audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda") # Generate speech text = "Hey there, my name is Elise! Nice to meet you." prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>" inputs = tokenizer([prompt], return_tensors="pt").to("cuda") generated = model.generate( **inputs, max_new_tokens=2048, do_sample=True, temperature=0.8, top_k=50, top_p=1.0, ) # Decode tokens output_text = tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=False) semantic_ids = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", output_text)] global_ids = [int(t) for t in re.findall(r"bicodec_global_(\d+)", output_text)] # Convert to audio pred_semantic = torch.tensor(semantic_ids).long().unsqueeze(0).to("cuda") pred_global = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to("cuda") wav = audio_tokenizer.detokenize(pred_global, pred_semantic) sf.write("output.wav", wav.squeeze().cpu().numpy(), 16000) print("✅ Saved output.wav") ``` --- ## 🏗️ Architecture Spark-TTS uses a unique approach that separates speech into two token streams: ``` Text Input → [LLM Backbone] → Global Tokens (speaker identity) → Semantic Tokens (content + prosody) ↓ [BiCodec Decoder] → Waveform ``` | Component | Details | |:---|:---| | **LLM** | Qwen2-0.5B (507M params) — generates audio token sequences | | **BiCodec** | Neural audio codec with global + semantic tokenization | | **Wav2Vec2** | `wav2vec2-large-xlsr-53` — feature extraction for tokenization | | **Sample rate** | 16kHz | | **Token types** | `bicodec_global_*` (speaker) + `bicodec_semantic_*` (content) | --- ## 📦 Model Files The repository contains the fine-tuned LLM weights. For inference, you also need: | File | Source | |:---|:---| | LLM weights | This repo (`Featherlabs/Finatts`) | | BiCodec model | [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) | | Wav2Vec2 features | Included in Spark-TTS-0.5B | | Spark-TTS code | [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) | --- ## ⚠️ Known Issues - **Detokenization error** — An `AxisSizeError` in `einx` can occur during inference when the generated global token count doesn't match the expected quantizer codebook dimensions (`q [c] d, b n q -> q b n d`). This is a shape mismatch between the model's generated tokens and BiCodec's expected input format. A workaround is being investigated. - **Single speaker** — Fine-tuned on a single voice (Elise); multi-speaker capabilities from the base model may be degraded. - **English only** — Only tested with English text inputs. --- ## ⚠️ Limitations - **Single speaker model** — optimized for the Elise voice character - **16kHz output** — not yet upsampled to 24kHz/48kHz - **Requires Spark-TTS codebase** — BiCodec tokenizer is needed for both training and inference - **ROCm-specific** — trained on AMD MI300X; CUDA users may need minor adjustments - **Short training** — only 2 epochs / 150 steps; additional training may improve quality --- ## 🔮 What's Next - 🐛 **Fix inference** — resolve the `einx` AxisSizeError in detokenization - 🎭 **Emotion tags** — add explicit emotion control (`[happy]`, `[sad]`, `[surprised]`) - 📈 **Extended training** — more epochs with larger/diverse datasets - 🔊 **Super-resolution** — upsample to 24kHz/48kHz for higher fidelity - 🗣️ **Multi-speaker** — train on multiple voices for speaker-switchable TTS --- ## 📜 License Apache 2.0 — consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B). ---

**Built with ❤️ by [Featherlabs](https://huggingface.co/Featherlabs)** *Operated by Owlkun*