Featherlabs
/

Finatts

@@ -1,200 +1,250 @@
 ---
-library_name: transformers
 tags:
-- unsloth
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+language:
+  - en
+license: apache-2.0
+base_model: SparkAudio/Spark-TTS-0.5B
+datasets:
+  - MrDragonFox/Elise
 tags:
+  - tts
+  - text-to-speech
+  - spark-tts
+  - voice-cloning
+  - unsloth
+  - trl
+  - sft
+  - featherlabs
+  - audio
+library_name: transformers
+pipeline_tag: text-to-speech
 ---
+<div align="center">
+# 🔊 Finatts
+### *High-fidelity voice cloning — fine-tuned Spark-TTS*
+**Text-to-Speech · Voice Cloning · Emotion Synthesis · BiCodec**
+[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+[![Base Model](https://img.shields.io/badge/Base-Spark--TTS--0.5B-purple)](https://huggingface.co/SparkAudio/Spark-TTS-0.5B)
+[![Dataset](https://img.shields.io/badge/Dataset-MrDragonFox%2FElise-green)](https://huggingface.co/datasets/MrDragonFox/Elise)
+[![Parameters](https://img.shields.io/badge/Params-507M-orange)](https://huggingface.co/Featherlabs/Finatts)
+*Built by [Featherlabs](https://huggingface.co/Featherlabs) · Operated by Owlkun*
+</div>
+---
+## ✨ What is Finatts?
+Finatts is a **507M-parameter text-to-speech model** fine-tuned for **high-fidelity single-speaker voice cloning**. Built on top of [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) and trained on the [Elise](https://huggingface.co/datasets/MrDragonFox/Elise) dataset — a curated collection of ~1,200 voice samples (~3 hours) with rich emotional range.
+Spark-TTS uses a novel **BiCodec** architecture that decomposes speech into:
+- **Global tokens** — speaker identity, timbre, and style
+- **Semantic tokens** — linguistic content and prosody
+This separation enables zero-shot voice cloning and controllable speech synthesis.
+### 🎯 Built For
+| Capability | Description |
+|:---:|---|
+| 🎙️ **Voice Cloning** | Clone a specific voice from reference audio samples |
+| 🎭 **Emotion Synthesis** | Generate speech with varied emotional tones |
+| 📝 **Text-to-Speech** | Convert text to natural, expressive speech |
+| 🔊 **High-Fidelity Audio** | 16kHz output with BiCodec tokenization |
+---
+## 🏋️ Training Details
+<table>
+<tr><td><b>Property</b></td><td><b>Value</b></td></tr>
+<tr><td>Base model</td><td><a href="https://huggingface.co/SparkAudio/Spark-TTS-0.5B">SparkAudio/Spark-TTS-0.5B</a></td></tr>
+<tr><td>LLM backbone</td><td>Qwen2-0.5B (507M params)</td></tr>
+<tr><td>Dataset</td><td><a href="https://huggingface.co/datasets/MrDragonFox/Elise">MrDragonFox/Elise</a> (1,195 samples, ~3h)</td></tr>
+<tr><td>Training type</td><td>Full Supervised Fine-Tuning (SFT)</td></tr>
+<tr><td>Epochs</td><td>2</td></tr>
+<tr><td>Batch size</td><td>8 (effective 16 with grad accum)</td></tr>
+<tr><td>Learning rate</td><td>1e-4</td></tr>
+<tr><td>Warmup steps</td><td>20</td></tr>
+<tr><td>Context length</td><td>4,096 tokens</td></tr>
+<tr><td>Precision</td><td>BF16</td></tr>
+<tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr>
+<tr><td>LR scheduler</td><td>Cosine</td></tr>
+<tr><td>Framework</td><td>Unsloth + TRL (SFTTrainer)</td></tr>
+<tr><td>Hardware</td><td>AMD MI300X (192GB HBM3)</td></tr>
+</table>
+### 📊 Training Metrics
+| Metric | Value |
+|:---|:---:|
+| **Final loss** | 5.827 |
+| **Training time** | 83s (1.4 min) |
+| **Peak VRAM** | 18.8 GB (9.8% of 192GB) |
+| **Trainable params** | 506,634,112 (100%) |
+| **Total steps** | 150 |
+### Training Loss Curve
+The model shows healthy convergence from **~7.0 → ~5.8** over 150 steps:
+| Step | Loss | Step | Loss | Step | Loss |
+|:---:|:---:|:---:|:---:|:---:|:---:|
+| 1 | 6.90 | 50 | 5.70 | 100 | 5.72 |
+| 10 | 6.85 | 60 | 5.62 | 110 | 5.77 |
+| 20 | 6.34 | 70 | 5.76 | 120 | 5.72 |
+| 30 | 5.90 | 80 | 5.71 | 130 | 5.79 |
+| 40 | 5.92 | 90 | 5.79 | 150 | 5.83 |
+---
+## 🚀 Quick Start
+### Prerequisites
+```bash
+pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
+pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
+pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile
+# Clone Spark-TTS for BiCodec tokenizer
+git clone https://github.com/SparkAudio/Spark-TTS
+```
+### Inference
+```python
+import torch
+import re
+import sys
+import numpy as np
+import soundfile as sf
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from huggingface_hub import snapshot_download
+sys.path.append("Spark-TTS")
+from sparktts.models.audio_tokenizer import BiCodecTokenizer
+# Load model
+model_id = "Featherlabs/Finatts"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+# Load BiCodec for audio detokenization
+snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
+audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")
+# Generate speech
+text = "Hey there, my name is Elise! Nice to meet you."
+prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>"
+inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
+generated = model.generate(
+    **inputs,
+    max_new_tokens=2048,
+    do_sample=True,
+    temperature=0.8,
+    top_k=50,
+    top_p=1.0,
+)
+# Decode tokens
+output_text = tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
+semantic_ids = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", output_text)]
+global_ids = [int(t) for t in re.findall(r"bicodec_global_(\d+)", output_text)]
+# Convert to audio
+pred_semantic = torch.tensor(semantic_ids).long().unsqueeze(0).to("cuda")
+pred_global = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to("cuda")
+wav = audio_tokenizer.detokenize(pred_global, pred_semantic)
+sf.write("output.wav", wav.squeeze().cpu().numpy(), 16000)
+print("✅ Saved output.wav")
+```
+---
+## 🏗️ Architecture
+Spark-TTS uses a unique approach that separates speech into two token streams:
+```
+Text Input → [LLM Backbone] → Global Tokens (speaker identity)
+                              → Semantic Tokens (content + prosody)
+                                    ↓
+                            [BiCodec Decoder] → Waveform
+```
+| Component | Details |
+|:---|:---|
+| **LLM** | Qwen2-0.5B (507M params) — generates audio token sequences |
+| **BiCodec** | Neural audio codec with global + semantic tokenization |
+| **Wav2Vec2** | `wav2vec2-large-xlsr-53` — feature extraction for tokenization |
+| **Sample rate** | 16kHz |
+| **Token types** | `bicodec_global_*` (speaker) + `bicodec_semantic_*` (content) |
+---
+## 📦 Model Files
+The repository contains the fine-tuned LLM weights. For inference, you also need:
+| File | Source |
+|:---|:---|
+| LLM weights | This repo (`Featherlabs/Finatts`) |
+| BiCodec model | [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) |
+| Wav2Vec2 features | Included in Spark-TTS-0.5B |
+| Spark-TTS code | [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) |
+---
+## ⚠️ Known Issues
+- **Detokenization error** — An `AxisSizeError` in `einx` can occur during inference when the generated global token count doesn't match the expected quantizer codebook dimensions (`q [c] d, b n q -> q b n d`). This is a shape mismatch between the model's generated tokens and BiCodec's expected input format. A workaround is being investigated.
+- **Single speaker** — Fine-tuned on a single voice (Elise); multi-speaker capabilities from the base model may be degraded.
+- **English only** — Only tested with English text inputs.
+---
+## ⚠️ Limitations
+- **Single speaker model** — optimized for the Elise voice character
+- **16kHz output** — not yet upsampled to 24kHz/48kHz
+- **Requires Spark-TTS codebase** — BiCodec tokenizer is needed for both training and inference
+- **ROCm-specific** — trained on AMD MI300X; CUDA users may need minor adjustments
+- **Short training** — only 2 epochs / 150 steps; additional training may improve quality
+---
+## 🔮 What's Next
+- 🐛 **Fix inference** — resolve the `einx` AxisSizeError in detokenization
+- 🎭 **Emotion tags** — add explicit emotion control (`[happy]`, `[sad]`, `[surprised]`)
+- 📈 **Extended training** — more epochs with larger/diverse datasets
+- 🔊 **Super-resolution** — upsample to 24kHz/48kHz for higher fidelity
+- 🗣️ **Multi-speaker** — train on multiple voices for speaker-switchable TTS
+---
+## 📜 License
+Apache 2.0 — consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B).
+---
+<div align="center">
+**Built with ❤️ by [Featherlabs](https://huggingface.co/Featherlabs)**
+*Operated by Owlkun*
+</div>