| --- |
| language: |
| - en |
| license: apache-2.0 |
| base_model: SparkAudio/Spark-TTS-0.5B |
| datasets: |
| - MrDragonFox/Elise |
| tags: |
| - tts |
| - text-to-speech |
| - spark-tts |
| - voice-cloning |
| - emotion-tags |
| - unsloth |
| - trl |
| - sft |
| - featherlabs |
| - audio |
| - amd-mi300x |
| library_name: transformers |
| pipeline_tag: text-to-speech |
| --- |
| |
| <div align="center"> |
|
|
| # ๐ Finatts Enhanced |
|
|
| ### *High-fidelity voice cloning โ fine-tuned Spark-TTS v2* |
|
|
| **Text-to-Speech ยท Voice Cloning ยท Emotion Tags ยท Portable Voice Profile** |
|
|
| [](https://opensource.org/licenses/Apache-2.0) |
| [](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) |
| [](https://huggingface.co/datasets/MrDragonFox/Elise) |
| [](https://huggingface.co/Featherlabs/Finatts-enhanced) |
|
|
| *Built by [Featherlabs](https://huggingface.co/Featherlabs) ยท Operated by Owlkun* |
|
|
| </div> |
|
|
| --- |
|
|
| ## โจ What is Finatts Enhanced? |
|
|
| Finatts Enhanced is an improved **507M-parameter text-to-speech model** built on [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B), fine-tuned for **high-fidelity single-speaker voice cloning** with emotion tag support. |
|
|
| Compared to the original [Finatts](https://huggingface.co/Featherlabs/Finatts), this version features **3ร the training**, a more stable learning rate, and a portable voice profile (`elise_voice.safetensors`) โ no reference audio needed at inference time. |
|
|
| ### Improvements over v1 |
|
|
| | Setting | Finatts v1 | Finatts Enhanced | |
| |:---|:---:|:---:| |
| | Epochs | 2 | **6** | |
| | Learning rate | 1e-4 | **5e-5** | |
| | Warmup steps | 20 | **50** | |
| | Weight decay | 0.001 | **0.01** | |
| | Emotion tags | โ | **โ
** | |
| | Voice profile | โ | **โ
`elise_voice.safetensors`** | |
| | Final loss | 5.827 | **5.806** | |
| |
| ### ๐ฏ Built For |
| |
| | Capability | Description | |
| |:---:|---| |
| | ๐๏ธ **Voice Cloning** | Clone Elise's voice โ no reference audio required | |
| | ๐ญ **Emotion Tags** | `<laughs>` `<giggles>` `<whispers>` `<sighs>` `<chuckles>` `<long pause>` | |
| | ๐ **Text-to-Speech** | Convert text to natural, expressive speech | |
| | ๐ฆ **Portable Profile** | Load `elise_voice.safetensors` โ deploy anywhere | |
| |
| --- |
| |
| ## ๐๏ธ Training Details |
| |
| <table> |
| <tr><td><b>Property</b></td><td><b>Value</b></td></tr> |
| <tr><td>Base model</td><td><a href="https://huggingface.co/SparkAudio/Spark-TTS-0.5B">SparkAudio/Spark-TTS-0.5B</a></td></tr> |
| <tr><td>LLM backbone</td><td>Qwen2-0.5B (507M params)</td></tr> |
| <tr><td>Dataset</td><td><a href="https://huggingface.co/datasets/MrDragonFox/Elise">MrDragonFox/Elise</a> (1,195 samples, ~3h)</td></tr> |
| <tr><td>Training type</td><td>Full Supervised Fine-Tuning (SFT)</td></tr> |
| <tr><td>Epochs</td><td>6</td></tr> |
| <tr><td>Batch size</td><td>8 (effective 16 with grad accum)</td></tr> |
| <tr><td>Learning rate</td><td>5e-5</td></tr> |
| <tr><td>Warmup steps</td><td>50</td></tr> |
| <tr><td>Weight decay</td><td>0.01</td></tr> |
| <tr><td>Context length</td><td>4,096 tokens</td></tr> |
| <tr><td>Precision</td><td>BF16</td></tr> |
| <tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr> |
| <tr><td>LR scheduler</td><td>Cosine</td></tr> |
| <tr><td>Framework</td><td>Unsloth + TRL (SFTTrainer)</td></tr> |
| <tr><td>Hardware</td><td>AMD MI300X (192GB HBM3)</td></tr> |
| </table> |
| |
| ### ๐ Training Metrics |
| |
| | Metric | Value | |
| |:---|:---:| |
| | **Final loss** | 5.806 | |
| | **Training time** | 144s (2.4 min) | |
| | **Peak VRAM** | 22.5 GB (11.7% of 192GB) | |
| | **Trainable params** | 506,634,112 (100%) | |
| | **Total steps** | 450 | |
|
|
| ### Training Loss Curve |
|
|
| Model converges from **~6.9 โ ~5.8** over 450 steps โ 3ร more convergence than v1: |
|
|
| | Step | Loss | Step | Loss | Step | Loss | |
| |:---:|:---:|:---:|:---:|:---:|:---:| |
| | 1 | 6.90 | 150 | 5.79 | 300 | 5.74 | |
| | 50 | 5.82 | 200 | 5.76 | 400 | 5.77 | |
| | 100 | 5.77 | 250 | 5.73 | 450 | 5.81 | |
|
|
| --- |
|
|
| ## ๐ Quick Start |
|
|
| ### Prerequisites |
|
|
| ```bash |
| pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth" |
| pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2" |
| pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile safetensors |
| |
| # Clone Spark-TTS for BiCodec tokenizer |
| git clone https://github.com/SparkAudio/Spark-TTS |
| ``` |
|
|
| ### Inference with Elise Voice Profile |
|
|
| ```python |
| import torch, re, sys |
| import soundfile as sf |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| from huggingface_hub import snapshot_download, hf_hub_download |
| from safetensors.torch import load_file |
| import json |
| |
| sys.path.append("Spark-TTS") |
| from sparktts.models.audio_tokenizer import BiCodecTokenizer |
| |
| MODEL_ID = "Featherlabs/Finatts-enhanced" |
| |
| # Load LLM |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
| model = AutoModelForCausalLM.from_pretrained( |
| MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto" |
| ) |
| model.eval() |
| |
| # Load BiCodec |
| snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B") |
| audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda") |
| |
| # Load Elise voice profile (global token IDs โ no reference audio needed) |
| profile_path = hf_hub_download(MODEL_ID, "elise_voice_profile.json") |
| with open(profile_path) as f: |
| profile = json.load(f) |
| elise_global_ids = profile["global_token_ids"] |
| elise_global_token_str = profile["global_token_str"] |
| |
| |
| @torch.inference_mode() |
| def generate_speech(text, temperature=0.8, top_k=40, top_p=0.92): |
| prompt = "".join([ |
| "<|task_tts|>", |
| "<|start_content|>", text, "<|end_content|>", |
| "<|start_global_token|>", |
| elise_global_token_str, # Elise's voice injected here |
| "<|end_global_token|>", |
| "<|start_semantic_token|>", |
| ]) |
| inputs = tokenizer([prompt], return_tensors="pt").to("cuda") |
| generated = model.generate( |
| **inputs, max_new_tokens=2048, |
| do_sample=True, temperature=temperature, |
| top_k=top_k, top_p=top_p, |
| eos_token_id=tokenizer.eos_token_id, |
| ) |
| out = tokenizer.batch_decode( |
| generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=False |
| )[0] |
| sem = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", out)] |
| if not sem: |
| return None |
| pred_sem = torch.tensor(sem, dtype=torch.long).unsqueeze(0).to("cuda") |
| pred_global = torch.tensor(elise_global_ids, dtype=torch.long).unsqueeze(0).to("cuda") |
| audio_tokenizer.model.to("cuda") |
| return audio_tokenizer.detokenize(pred_global, pred_sem).squeeze().cpu().numpy() |
| |
| |
| # Try emotion tags! |
| texts = [ |
| "Hey there! My name is Elise, nice to meet you.", |
| "<laughs> Oh my gosh, I can't believe that actually worked!", |
| "<whispers> Come closer... I have a secret to tell you.", |
| "<sighs> Some days just feel heavier than others.", |
| ] |
| for i, text in enumerate(texts): |
| wav = generate_speech(text) |
| if wav is not None: |
| sf.write(f"output_{i+1}.wav", wav, 16000) |
| print(f"โ
output_{i+1}.wav") |
| ``` |
|
|
| --- |
|
|
| ## ๐ญ Emotion Tags |
|
|
| The Elise dataset includes inline emotion tags captured from real speech. Place them anywhere in your text: |
|
|
| | Tag | Effect | |
| |:---|:---| |
| | `<laughs>` | Lighter, brighter intonation | |
| | `<giggles>` | Playful, uptick in pitch | |
| | `<whispers>` | Softer, breathier delivery | |
| | `<sighs>` | Drawn-out, melancholic tone | |
| | `<chuckles>` | Gentle amusement | |
| | `<long pause>` | Extended pause in speech | |
|
|
| **Note:** Tags produce **intonation variation** rather than literal acoustic sounds (e.g., actual giggling audio). For acoustic emotion effects, see [Orpheus-TTS](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft). |
|
|
| --- |
|
|
| ## ๐๏ธ Architecture |
|
|
| ``` |
| Text + Emotion Tags |
| โ |
| [LLM: Qwen2-0.5B] |
| โโโโโโโดโโโโโโโ |
| Global tokens Semantic tokens |
| (speaker ID) (content + prosody) |
| โโโโโโโโโโฌโโโโโโโโโ |
| [BiCodec Decoder] |
| โ |
| Waveform 16kHz |
| ``` |
|
|
| | Component | Details | |
| |:---|:---| |
| | **LLM** | Qwen2-0.5B (507M params) | |
| | **BiCodec** | Neural audio codec โ global + semantic tokenization | |
| | **Wav2Vec2** | `wav2vec2-large-xlsr-53` โ feature extraction | |
| | **Sample rate** | 16kHz | |
| | **Voice profile** | `elise_voice.safetensors` โ 1024-dim d-vector | |
|
|
| --- |
|
|
| ## ๐ฆ Repository Files |
|
|
| | File | Description | |
| |:---|:---| |
| | `model.safetensors` | Fine-tuned LLM weights (966MB, 16-bit merged) | |
| | `elise_voice.safetensors` | Elise speaker d-vector (1024-dim, avg of 10 clips) | |
| | `tokenizer.json` | Tokenizer including BiCodec special tokens | |
| | `config.json` | Model configuration | |
|
|
| For inference you also need: |
|
|
| | File | Source | |
| |:---|:---| |
| | BiCodec model | [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) | |
| | Spark-TTS code | [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) | |
|
|
| --- |
|
|
| ## โ ๏ธ Limitations |
|
|
| - **English only** โ only tested with English text inputs |
| - **Single speaker** โ optimized for Elise; base model multi-speaker may be degraded |
| - **16kHz output** โ use [audiosr](https://github.com/haoheliu/versatile_audio_super_resolution) for upsampling to 44.1kHz |
| - **Emotion intensity** โ tags produce subtle intonation changes, not acoustic emotion sounds |
| - **ROCm-trained** โ tested on AMD MI300X; CUDA users may need minor env adjustments |
|
|
| --- |
|
|
| ## ๐ฎ What's Next |
|
|
| - ๐ **Super-resolution** โ integrate audiosr for 44.1kHz HD output |
| - ๐ฃ๏ธ **Multi-speaker** โ train on multiple voices |
| - ๐ **Larger dataset** โ more hours of Elise audio for stronger emotion control |
| - ๐ญ **Acoustic emotions** โ explore Orpheus-style explicit emotion tokens |
|
|
| --- |
|
|
| ## ๐ License |
|
|
| Apache 2.0 โ consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B). |
|
|
| --- |
|
|
| <div align="center"> |
|
|
| **Built with โค๏ธ by [Featherlabs](https://huggingface.co/Featherlabs)** |
|
|
| *Operated by Owlkun* |
|
|
| </div> |
|
|