--- language: - en license: apache-2.0 base_model: SparkAudio/Spark-TTS-0.5B datasets: - MrDragonFox/Elise tags: - tts - text-to-speech - spark-tts - voice-cloning - unsloth - trl - sft - featherlabs - audio library_name: transformers pipeline_tag: text-to-speech ---
# ๐Ÿ”Š Finatts ### *High-fidelity voice cloning โ€” fine-tuned Spark-TTS* **Text-to-Speech ยท Voice Cloning ยท Emotion Synthesis ยท BiCodec** [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Base Model](https://img.shields.io/badge/Base-Spark--TTS--0.5B-purple)](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) [![Dataset](https://img.shields.io/badge/Dataset-MrDragonFox%2FElise-green)](https://huggingface.co/datasets/MrDragonFox/Elise) [![Parameters](https://img.shields.io/badge/Params-507M-orange)](https://huggingface.co/Featherlabs/Finatts) *Built by [Featherlabs](https://huggingface.co/Featherlabs) ยท Operated by Owlkun*
--- ## โœจ What is Finatts? Finatts is a **507M-parameter text-to-speech model** fine-tuned for **high-fidelity single-speaker voice cloning**. Built on top of [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) and trained on the [Elise](https://huggingface.co/datasets/MrDragonFox/Elise) dataset โ€” a curated collection of ~1,200 voice samples (~3 hours) with rich emotional range. Spark-TTS uses a novel **BiCodec** architecture that decomposes speech into: - **Global tokens** โ€” speaker identity, timbre, and style - **Semantic tokens** โ€” linguistic content and prosody This separation enables zero-shot voice cloning and controllable speech synthesis. ### ๐ŸŽฏ Built For | Capability | Description | |:---:|---| | ๐ŸŽ™๏ธ **Voice Cloning** | Clone a specific voice from reference audio samples | | ๐ŸŽญ **Emotion Synthesis** | Generate speech with varied emotional tones | | ๐Ÿ“ **Text-to-Speech** | Convert text to natural, expressive speech | | ๐Ÿ”Š **High-Fidelity Audio** | 16kHz output with BiCodec tokenization | --- ## ๐Ÿ‹๏ธ Training Details
PropertyValue
Base modelSparkAudio/Spark-TTS-0.5B
LLM backboneQwen2-0.5B (507M params)
DatasetMrDragonFox/Elise (1,195 samples, ~3h)
Training typeFull Supervised Fine-Tuning (SFT)
Epochs2
Batch size8 (effective 16 with grad accum)
Learning rate1e-4
Warmup steps20
Context length4,096 tokens
PrecisionBF16
OptimizerAdamW (torch fused)
LR schedulerCosine
FrameworkUnsloth + TRL (SFTTrainer)
HardwareAMD MI300X (192GB HBM3)
### ๐Ÿ“Š Training Metrics | Metric | Value | |:---|:---:| | **Final loss** | 5.827 | | **Training time** | 83s (1.4 min) | | **Peak VRAM** | 18.8 GB (9.8% of 192GB) | | **Trainable params** | 506,634,112 (100%) | | **Total steps** | 150 | ### Training Loss Curve The model shows healthy convergence from **~7.0 โ†’ ~5.8** over 150 steps: | Step | Loss | Step | Loss | Step | Loss | |:---:|:---:|:---:|:---:|:---:|:---:| | 1 | 6.90 | 50 | 5.70 | 100 | 5.72 | | 10 | 6.85 | 60 | 5.62 | 110 | 5.77 | | 20 | 6.34 | 70 | 5.76 | 120 | 5.72 | | 30 | 5.90 | 80 | 5.71 | 130 | 5.79 | | 40 | 5.92 | 90 | 5.79 | 150 | 5.83 | --- ## ๐Ÿš€ Quick Start ### Prerequisites ```bash pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth" pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2" pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile # Clone Spark-TTS for BiCodec tokenizer git clone https://github.com/SparkAudio/Spark-TTS ``` ### Inference ```python import torch import re import sys import numpy as np import soundfile as sf from transformers import AutoTokenizer, AutoModelForCausalLM from huggingface_hub import snapshot_download sys.path.append("Spark-TTS") from sparktts.models.audio_tokenizer import BiCodecTokenizer # Load model model_id = "Featherlabs/Finatts" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) # Load BiCodec for audio detokenization snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B") audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda") # Generate speech text = "Hey there, my name is Elise! Nice to meet you." prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>" inputs = tokenizer([prompt], return_tensors="pt").to("cuda") generated = model.generate( **inputs, max_new_tokens=2048, do_sample=True, temperature=0.8, top_k=50, top_p=1.0, ) # Decode tokens output_text = tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=False) semantic_ids = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", output_text)] global_ids = [int(t) for t in re.findall(r"bicodec_global_(\d+)", output_text)] # Convert to audio pred_semantic = torch.tensor(semantic_ids).long().unsqueeze(0).to("cuda") pred_global = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to("cuda") wav = audio_tokenizer.detokenize(pred_global, pred_semantic) sf.write("output.wav", wav.squeeze().cpu().numpy(), 16000) print("โœ… Saved output.wav") ``` --- ## ๐Ÿ—๏ธ Architecture Spark-TTS uses a unique approach that separates speech into two token streams: ``` Text Input โ†’ [LLM Backbone] โ†’ Global Tokens (speaker identity) โ†’ Semantic Tokens (content + prosody) โ†“ [BiCodec Decoder] โ†’ Waveform ``` | Component | Details | |:---|:---| | **LLM** | Qwen2-0.5B (507M params) โ€” generates audio token sequences | | **BiCodec** | Neural audio codec with global + semantic tokenization | | **Wav2Vec2** | `wav2vec2-large-xlsr-53` โ€” feature extraction for tokenization | | **Sample rate** | 16kHz | | **Token types** | `bicodec_global_*` (speaker) + `bicodec_semantic_*` (content) | --- ## ๐Ÿ“ฆ Model Files The repository contains the fine-tuned LLM weights. For inference, you also need: | File | Source | |:---|:---| | LLM weights | This repo (`Featherlabs/Finatts`) | | BiCodec model | [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) | | Wav2Vec2 features | Included in Spark-TTS-0.5B | | Spark-TTS code | [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) | --- ## โš ๏ธ Known Issues - **Detokenization error** โ€” An `AxisSizeError` in `einx` can occur during inference when the generated global token count doesn't match the expected quantizer codebook dimensions (`q [c] d, b n q -> q b n d`). This is a shape mismatch between the model's generated tokens and BiCodec's expected input format. A workaround is being investigated. - **Single speaker** โ€” Fine-tuned on a single voice (Elise); multi-speaker capabilities from the base model may be degraded. - **English only** โ€” Only tested with English text inputs. --- ## โš ๏ธ Limitations - **Single speaker model** โ€” optimized for the Elise voice character - **16kHz output** โ€” not yet upsampled to 24kHz/48kHz - **Requires Spark-TTS codebase** โ€” BiCodec tokenizer is needed for both training and inference - **ROCm-specific** โ€” trained on AMD MI300X; CUDA users may need minor adjustments - **Short training** โ€” only 2 epochs / 150 steps; additional training may improve quality --- ## ๐Ÿ”ฎ What's Next - ๐Ÿ› **Fix inference** โ€” resolve the `einx` AxisSizeError in detokenization - ๐ŸŽญ **Emotion tags** โ€” add explicit emotion control (`[happy]`, `[sad]`, `[surprised]`) - ๐Ÿ“ˆ **Extended training** โ€” more epochs with larger/diverse datasets - ๐Ÿ”Š **Super-resolution** โ€” upsample to 24kHz/48kHz for higher fidelity - ๐Ÿ—ฃ๏ธ **Multi-speaker** โ€” train on multiple voices for speaker-switchable TTS --- ## ๐Ÿ“œ License Apache 2.0 โ€” consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B). ---
**Built with โค๏ธ by [Featherlabs](https://huggingface.co/Featherlabs)** *Operated by Owlkun*