--- language: - en license: apache-2.0 base_model: SparkAudio/Spark-TTS-0.5B datasets: - MrDragonFox/Elise tags: - tts - text-to-speech - spark-tts - voice-cloning - emotion-tags - unsloth - trl - sft - featherlabs - audio - amd-mi300x library_name: transformers pipeline_tag: text-to-speech ---
# ๐Ÿ”Š Finatts Enhanced ### *High-fidelity voice cloning โ€” fine-tuned Spark-TTS v2* **Text-to-Speech ยท Voice Cloning ยท Emotion Tags ยท Portable Voice Profile** [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Base Model](https://img.shields.io/badge/Base-Spark--TTS--0.5B-purple)](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) [![Dataset](https://img.shields.io/badge/Dataset-MrDragonFox%2FElise-green)](https://huggingface.co/datasets/MrDragonFox/Elise) [![Parameters](https://img.shields.io/badge/Params-507M-orange)](https://huggingface.co/Featherlabs/Finatts-enhanced) *Built by [Featherlabs](https://huggingface.co/Featherlabs) ยท Operated by Owlkun*
--- ## โœจ What is Finatts Enhanced? Finatts Enhanced is an improved **507M-parameter text-to-speech model** built on [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B), fine-tuned for **high-fidelity single-speaker voice cloning** with emotion tag support. Compared to the original [Finatts](https://huggingface.co/Featherlabs/Finatts), this version features **3ร— the training**, a more stable learning rate, and a portable voice profile (`elise_voice.safetensors`) โ€” no reference audio needed at inference time. ### Improvements over v1 | Setting | Finatts v1 | Finatts Enhanced | |:---|:---:|:---:| | Epochs | 2 | **6** | | Learning rate | 1e-4 | **5e-5** | | Warmup steps | 20 | **50** | | Weight decay | 0.001 | **0.01** | | Emotion tags | โŒ | **โœ…** | | Voice profile | โŒ | **โœ… `elise_voice.safetensors`** | | Final loss | 5.827 | **5.806** | ### ๐ŸŽฏ Built For | Capability | Description | |:---:|---| | ๐ŸŽ™๏ธ **Voice Cloning** | Clone Elise's voice โ€” no reference audio required | | ๐ŸŽญ **Emotion Tags** | `` `` `` `` `` `` | | ๐Ÿ“ **Text-to-Speech** | Convert text to natural, expressive speech | | ๐Ÿ“ฆ **Portable Profile** | Load `elise_voice.safetensors` โ€” deploy anywhere | --- ## ๐Ÿ‹๏ธ Training Details
PropertyValue
Base modelSparkAudio/Spark-TTS-0.5B
LLM backboneQwen2-0.5B (507M params)
DatasetMrDragonFox/Elise (1,195 samples, ~3h)
Training typeFull Supervised Fine-Tuning (SFT)
Epochs6
Batch size8 (effective 16 with grad accum)
Learning rate5e-5
Warmup steps50
Weight decay0.01
Context length4,096 tokens
PrecisionBF16
OptimizerAdamW (torch fused)
LR schedulerCosine
FrameworkUnsloth + TRL (SFTTrainer)
HardwareAMD MI300X (192GB HBM3)
### ๐Ÿ“Š Training Metrics | Metric | Value | |:---|:---:| | **Final loss** | 5.806 | | **Training time** | 144s (2.4 min) | | **Peak VRAM** | 22.5 GB (11.7% of 192GB) | | **Trainable params** | 506,634,112 (100%) | | **Total steps** | 450 | ### Training Loss Curve Model converges from **~6.9 โ†’ ~5.8** over 450 steps โ€” 3ร— more convergence than v1: | Step | Loss | Step | Loss | Step | Loss | |:---:|:---:|:---:|:---:|:---:|:---:| | 1 | 6.90 | 150 | 5.79 | 300 | 5.74 | | 50 | 5.82 | 200 | 5.76 | 400 | 5.77 | | 100 | 5.77 | 250 | 5.73 | 450 | 5.81 | --- ## ๐Ÿš€ Quick Start ### Prerequisites ```bash pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth" pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2" pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile safetensors # Clone Spark-TTS for BiCodec tokenizer git clone https://github.com/SparkAudio/Spark-TTS ``` ### Inference with Elise Voice Profile ```python import torch, re, sys import soundfile as sf from transformers import AutoTokenizer, AutoModelForCausalLM from huggingface_hub import snapshot_download, hf_hub_download from safetensors.torch import load_file import json sys.path.append("Spark-TTS") from sparktts.models.audio_tokenizer import BiCodecTokenizer MODEL_ID = "Featherlabs/Finatts-enhanced" # Load LLM tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) model = AutoModelForCausalLM.from_pretrained( MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto" ) model.eval() # Load BiCodec snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B") audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda") # Load Elise voice profile (global token IDs โ€” no reference audio needed) profile_path = hf_hub_download(MODEL_ID, "elise_voice_profile.json") with open(profile_path) as f: profile = json.load(f) elise_global_ids = profile["global_token_ids"] elise_global_token_str = profile["global_token_str"] @torch.inference_mode() def generate_speech(text, temperature=0.8, top_k=40, top_p=0.92): prompt = "".join([ "<|task_tts|>", "<|start_content|>", text, "<|end_content|>", "<|start_global_token|>", elise_global_token_str, # Elise's voice injected here "<|end_global_token|>", "<|start_semantic_token|>", ]) inputs = tokenizer([prompt], return_tensors="pt").to("cuda") generated = model.generate( **inputs, max_new_tokens=2048, do_sample=True, temperature=temperature, top_k=top_k, top_p=top_p, eos_token_id=tokenizer.eos_token_id, ) out = tokenizer.batch_decode( generated[:, inputs.input_ids.shape[1]:], skip_special_tokens=False )[0] sem = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", out)] if not sem: return None pred_sem = torch.tensor(sem, dtype=torch.long).unsqueeze(0).to("cuda") pred_global = torch.tensor(elise_global_ids, dtype=torch.long).unsqueeze(0).to("cuda") audio_tokenizer.model.to("cuda") return audio_tokenizer.detokenize(pred_global, pred_sem).squeeze().cpu().numpy() # Try emotion tags! texts = [ "Hey there! My name is Elise, nice to meet you.", " Oh my gosh, I can't believe that actually worked!", " Come closer... I have a secret to tell you.", " Some days just feel heavier than others.", ] for i, text in enumerate(texts): wav = generate_speech(text) if wav is not None: sf.write(f"output_{i+1}.wav", wav, 16000) print(f"โœ… output_{i+1}.wav") ``` --- ## ๐ŸŽญ Emotion Tags The Elise dataset includes inline emotion tags captured from real speech. Place them anywhere in your text: | Tag | Effect | |:---|:---| | `` | Lighter, brighter intonation | | `` | Playful, uptick in pitch | | `` | Softer, breathier delivery | | `` | Drawn-out, melancholic tone | | `` | Gentle amusement | | `` | Extended pause in speech | **Note:** Tags produce **intonation variation** rather than literal acoustic sounds (e.g., actual giggling audio). For acoustic emotion effects, see [Orpheus-TTS](https://huggingface.co/canopylabs/orpheus-3b-0.1-ft). --- ## ๐Ÿ—๏ธ Architecture ``` Text + Emotion Tags โ†“ [LLM: Qwen2-0.5B] โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ” Global tokens Semantic tokens (speaker ID) (content + prosody) โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ [BiCodec Decoder] โ†“ Waveform 16kHz ``` | Component | Details | |:---|:---| | **LLM** | Qwen2-0.5B (507M params) | | **BiCodec** | Neural audio codec โ€” global + semantic tokenization | | **Wav2Vec2** | `wav2vec2-large-xlsr-53` โ€” feature extraction | | **Sample rate** | 16kHz | | **Voice profile** | `elise_voice.safetensors` โ€” 1024-dim d-vector | --- ## ๐Ÿ“ฆ Repository Files | File | Description | |:---|:---| | `model.safetensors` | Fine-tuned LLM weights (966MB, 16-bit merged) | | `elise_voice.safetensors` | Elise speaker d-vector (1024-dim, avg of 10 clips) | | `tokenizer.json` | Tokenizer including BiCodec special tokens | | `config.json` | Model configuration | For inference you also need: | File | Source | |:---|:---| | BiCodec model | [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) | | Spark-TTS code | [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) | --- ## โš ๏ธ Limitations - **English only** โ€” only tested with English text inputs - **Single speaker** โ€” optimized for Elise; base model multi-speaker may be degraded - **16kHz output** โ€” use [audiosr](https://github.com/haoheliu/versatile_audio_super_resolution) for upsampling to 44.1kHz - **Emotion intensity** โ€” tags produce subtle intonation changes, not acoustic emotion sounds - **ROCm-trained** โ€” tested on AMD MI300X; CUDA users may need minor env adjustments --- ## ๐Ÿ”ฎ What's Next - ๐Ÿ”Š **Super-resolution** โ€” integrate audiosr for 44.1kHz HD output - ๐Ÿ—ฃ๏ธ **Multi-speaker** โ€” train on multiple voices - ๐Ÿ“ˆ **Larger dataset** โ€” more hours of Elise audio for stronger emotion control - ๐ŸŽญ **Acoustic emotions** โ€” explore Orpheus-style explicit emotion tokens --- ## ๐Ÿ“œ License Apache 2.0 โ€” consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B). ---
**Built with โค๏ธ by [Featherlabs](https://huggingface.co/Featherlabs)** *Operated by Owlkun*