| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | base_model: SparkAudio/Spark-TTS-0.5B |
| | datasets: |
| | - MrDragonFox/Elise |
| | tags: |
| | - tts |
| | - text-to-speech |
| | - spark-tts |
| | - voice-cloning |
| | - unsloth |
| | - trl |
| | - sft |
| | - featherlabs |
| | - audio |
| | library_name: transformers |
| | pipeline_tag: text-to-speech |
| | --- |
| | |
| | <div align="center"> |
| |
|
| | # ๐ Finatts |
| |
|
| | ### *High-fidelity voice cloning โ fine-tuned Spark-TTS* |
| |
|
| | **Text-to-Speech ยท Voice Cloning ยท Emotion Synthesis ยท BiCodec** |
| |
|
| | [](https://opensource.org/licenses/Apache-2.0) |
| | [](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) |
| | [](https://huggingface.co/datasets/MrDragonFox/Elise) |
| | [](https://huggingface.co/Featherlabs/Finatts) |
| |
|
| | *Built by [Featherlabs](https://huggingface.co/Featherlabs) ยท Operated by Owlkun* |
| |
|
| | </div> |
| |
|
| | --- |
| |
|
| | ## โจ What is Finatts? |
| |
|
| | Finatts is a **507M-parameter text-to-speech model** fine-tuned for **high-fidelity single-speaker voice cloning**. Built on top of [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) and trained on the [Elise](https://huggingface.co/datasets/MrDragonFox/Elise) dataset โ a curated collection of ~1,200 voice samples (~3 hours) with rich emotional range. |
| |
|
| | Spark-TTS uses a novel **BiCodec** architecture that decomposes speech into: |
| | - **Global tokens** โ speaker identity, timbre, and style |
| | - **Semantic tokens** โ linguistic content and prosody |
| |
|
| | This separation enables zero-shot voice cloning and controllable speech synthesis. |
| |
|
| | ### ๐ฏ Built For |
| |
|
| | | Capability | Description | |
| | |:---:|---| |
| | | ๐๏ธ **Voice Cloning** | Clone a specific voice from reference audio samples | |
| | | ๐ญ **Emotion Synthesis** | Generate speech with varied emotional tones | |
| | | ๐ **Text-to-Speech** | Convert text to natural, expressive speech | |
| | | ๐ **High-Fidelity Audio** | 16kHz output with BiCodec tokenization | |
| |
|
| | --- |
| |
|
| | ## ๐๏ธ Training Details |
| |
|
| | <table> |
| | <tr><td><b>Property</b></td><td><b>Value</b></td></tr> |
| | <tr><td>Base model</td><td><a href="https://huggingface.co/SparkAudio/Spark-TTS-0.5B">SparkAudio/Spark-TTS-0.5B</a></td></tr> |
| | <tr><td>LLM backbone</td><td>Qwen2-0.5B (507M params)</td></tr> |
| | <tr><td>Dataset</td><td><a href="https://huggingface.co/datasets/MrDragonFox/Elise">MrDragonFox/Elise</a> (1,195 samples, ~3h)</td></tr> |
| | <tr><td>Training type</td><td>Full Supervised Fine-Tuning (SFT)</td></tr> |
| | <tr><td>Epochs</td><td>2</td></tr> |
| | <tr><td>Batch size</td><td>8 (effective 16 with grad accum)</td></tr> |
| | <tr><td>Learning rate</td><td>1e-4</td></tr> |
| | <tr><td>Warmup steps</td><td>20</td></tr> |
| | <tr><td>Context length</td><td>4,096 tokens</td></tr> |
| | <tr><td>Precision</td><td>BF16</td></tr> |
| | <tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr> |
| | <tr><td>LR scheduler</td><td>Cosine</td></tr> |
| | <tr><td>Framework</td><td>Unsloth + TRL (SFTTrainer)</td></tr> |
| | <tr><td>Hardware</td><td>AMD MI300X (192GB HBM3)</td></tr> |
| | </table> |
| |
|
| | ### ๐ Training Metrics |
| |
|
| | | Metric | Value | |
| | |:---|:---:| |
| | | **Final loss** | 5.827 | |
| | | **Training time** | 83s (1.4 min) | |
| | | **Peak VRAM** | 18.8 GB (9.8% of 192GB) | |
| | | **Trainable params** | 506,634,112 (100%) | |
| | | **Total steps** | 150 | |
| |
|
| | ### Training Loss Curve |
| |
|
| | The model shows healthy convergence from **~7.0 โ ~5.8** over 150 steps: |
| |
|
| | | Step | Loss | Step | Loss | Step | Loss | |
| | |:---:|:---:|:---:|:---:|:---:|:---:| |
| | | 1 | 6.90 | 50 | 5.70 | 100 | 5.72 | |
| | | 10 | 6.85 | 60 | 5.62 | 110 | 5.77 | |
| | | 20 | 6.34 | 70 | 5.76 | 120 | 5.72 | |
| | | 30 | 5.90 | 80 | 5.71 | 130 | 5.79 | |
| | | 40 | 5.92 | 90 | 5.79 | 150 | 5.83 | |
| |
|
| | --- |
| |
|
| | ## ๐ Quick Start |
| |
|
| | ### Prerequisites |
| |
|
| | ```bash |
| | pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth" |
| | pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2" |
| | pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile |
| | |
| | # Clone Spark-TTS for BiCodec tokenizer |
| | git clone https://github.com/SparkAudio/Spark-TTS |
| | ``` |
| |
|
| | ### Inference |
| |
|
| | ```python |
| | import torch |
| | import re |
| | import sys |
| | import numpy as np |
| | import soundfile as sf |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | from huggingface_hub import snapshot_download |
| | |
| | sys.path.append("Spark-TTS") |
| | from sparktts.models.audio_tokenizer import BiCodecTokenizer |
| | |
| | # Load model |
| | model_id = "Featherlabs/Finatts" |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, |
| | torch_dtype=torch.bfloat16, |
| | device_map="auto" |
| | ) |
| | |
| | # Load BiCodec for audio detokenization |
| | snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B") |
| | audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda") |
| | |
| | # Generate speech |
| | text = "Hey there, my name is Elise! Nice to meet you." |
| | prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>" |
| | |
| | inputs = tokenizer([prompt], return_tensors="pt").to("cuda") |
| | generated = model.generate( |
| | **inputs, |
| | max_new_tokens=2048, |
| | do_sample=True, |
| | temperature=0.8, |
| | top_k=50, |
| | top_p=1.0, |
| | ) |
| | |
| | # Decode tokens |
| | output_text = tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=False) |
| | semantic_ids = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", output_text)] |
| | global_ids = [int(t) for t in re.findall(r"bicodec_global_(\d+)", output_text)] |
| | |
| | # Convert to audio |
| | pred_semantic = torch.tensor(semantic_ids).long().unsqueeze(0).to("cuda") |
| | pred_global = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to("cuda") |
| | |
| | wav = audio_tokenizer.detokenize(pred_global, pred_semantic) |
| | sf.write("output.wav", wav.squeeze().cpu().numpy(), 16000) |
| | print("โ
Saved output.wav") |
| | ``` |
| |
|
| | --- |
| |
|
| | ## ๐๏ธ Architecture |
| |
|
| | Spark-TTS uses a unique approach that separates speech into two token streams: |
| |
|
| | ``` |
| | Text Input โ [LLM Backbone] โ Global Tokens (speaker identity) |
| | โ Semantic Tokens (content + prosody) |
| | โ |
| | [BiCodec Decoder] โ Waveform |
| | ``` |
| |
|
| | | Component | Details | |
| | |:---|:---| |
| | | **LLM** | Qwen2-0.5B (507M params) โ generates audio token sequences | |
| | | **BiCodec** | Neural audio codec with global + semantic tokenization | |
| | | **Wav2Vec2** | `wav2vec2-large-xlsr-53` โ feature extraction for tokenization | |
| | | **Sample rate** | 16kHz | |
| | | **Token types** | `bicodec_global_*` (speaker) + `bicodec_semantic_*` (content) | |
| |
|
| | --- |
| |
|
| | ## ๐ฆ Model Files |
| |
|
| | The repository contains the fine-tuned LLM weights. For inference, you also need: |
| |
|
| | | File | Source | |
| | |:---|:---| |
| | | LLM weights | This repo (`Featherlabs/Finatts`) | |
| | | BiCodec model | [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) | |
| | | Wav2Vec2 features | Included in Spark-TTS-0.5B | |
| | | Spark-TTS code | [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) | |
| |
|
| | --- |
| |
|
| | ## โ ๏ธ Known Issues |
| |
|
| | - **Detokenization error** โ An `AxisSizeError` in `einx` can occur during inference when the generated global token count doesn't match the expected quantizer codebook dimensions (`q [c] d, b n q -> q b n d`). This is a shape mismatch between the model's generated tokens and BiCodec's expected input format. A workaround is being investigated. |
| | - **Single speaker** โ Fine-tuned on a single voice (Elise); multi-speaker capabilities from the base model may be degraded. |
| | - **English only** โ Only tested with English text inputs. |
| |
|
| | --- |
| |
|
| | ## โ ๏ธ Limitations |
| |
|
| | - **Single speaker model** โ optimized for the Elise voice character |
| | - **16kHz output** โ not yet upsampled to 24kHz/48kHz |
| | - **Requires Spark-TTS codebase** โ BiCodec tokenizer is needed for both training and inference |
| | - **ROCm-specific** โ trained on AMD MI300X; CUDA users may need minor adjustments |
| | - **Short training** โ only 2 epochs / 150 steps; additional training may improve quality |
| |
|
| | --- |
| |
|
| | ## ๐ฎ What's Next |
| |
|
| | - ๐ **Fix inference** โ resolve the `einx` AxisSizeError in detokenization |
| | - ๐ญ **Emotion tags** โ add explicit emotion control (`[happy]`, `[sad]`, `[surprised]`) |
| | - ๐ **Extended training** โ more epochs with larger/diverse datasets |
| | - ๐ **Super-resolution** โ upsample to 24kHz/48kHz for higher fidelity |
| | - ๐ฃ๏ธ **Multi-speaker** โ train on multiple voices for speaker-switchable TTS |
| |
|
| | --- |
| |
|
| | ## ๐ License |
| |
|
| | Apache 2.0 โ consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B). |
| |
|
| | --- |
| |
|
| | <div align="center"> |
| |
|
| | **Built with โค๏ธ by [Featherlabs](https://huggingface.co/Featherlabs)** |
| |
|
| | *Operated by Owlkun* |
| |
|
| | </div> |
| |
|