Featherlabs commited on
Commit
fecaa9f
ยท
verified ยท
1 Parent(s): bd2293a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +203 -153
README.md CHANGED
@@ -1,200 +1,250 @@
1
  ---
2
- library_name: transformers
 
 
 
 
 
3
  tags:
4
- - unsloth
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- # Model Card for Model ID
8
-
9
- <!-- Provide a quick summary of what the model is/does. -->
10
-
11
-
12
-
13
- ## Model Details
14
-
15
- ### Model Description
16
-
17
- <!-- Provide a longer summary of what this model is. -->
18
-
19
- This is the model card of a ๐Ÿค— transformers model that has been pushed on the Hub. This model card has been automatically generated.
20
-
21
- - **Developed by:** [More Information Needed]
22
- - **Funded by [optional]:** [More Information Needed]
23
- - **Shared by [optional]:** [More Information Needed]
24
- - **Model type:** [More Information Needed]
25
- - **Language(s) (NLP):** [More Information Needed]
26
- - **License:** [More Information Needed]
27
- - **Finetuned from model [optional]:** [More Information Needed]
28
-
29
- ### Model Sources [optional]
30
-
31
- <!-- Provide the basic links for the model. -->
32
-
33
- - **Repository:** [More Information Needed]
34
- - **Paper [optional]:** [More Information Needed]
35
- - **Demo [optional]:** [More Information Needed]
36
-
37
- ## Uses
38
-
39
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
40
-
41
- ### Direct Use
42
-
43
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
44
-
45
- [More Information Needed]
46
-
47
- ### Downstream Use [optional]
48
-
49
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
50
-
51
- [More Information Needed]
52
-
53
- ### Out-of-Scope Use
54
-
55
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
56
-
57
- [More Information Needed]
58
-
59
- ## Bias, Risks, and Limitations
60
-
61
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
62
-
63
- [More Information Needed]
64
-
65
- ### Recommendations
66
-
67
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
68
-
69
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
70
-
71
- ## How to Get Started with the Model
72
 
73
- Use the code below to get started with the model.
74
 
75
- [More Information Needed]
76
 
77
- ## Training Details
78
 
79
- ### Training Data
 
 
 
80
 
81
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
82
 
83
- [More Information Needed]
84
 
85
- ### Training Procedure
86
-
87
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
88
-
89
- #### Preprocessing [optional]
90
-
91
- [More Information Needed]
92
-
93
-
94
- #### Training Hyperparameters
95
-
96
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
97
-
98
- #### Speeds, Sizes, Times [optional]
99
-
100
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
101
-
102
- [More Information Needed]
103
-
104
- ## Evaluation
105
-
106
- <!-- This section describes the evaluation protocols and provides the results. -->
107
-
108
- ### Testing Data, Factors & Metrics
109
-
110
- #### Testing Data
111
-
112
- <!-- This should link to a Dataset Card if possible. -->
113
-
114
- [More Information Needed]
115
-
116
- #### Factors
117
-
118
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
119
-
120
- [More Information Needed]
121
-
122
- #### Metrics
123
-
124
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
125
-
126
- [More Information Needed]
127
 
128
- ### Results
129
 
130
- [More Information Needed]
131
 
132
- #### Summary
 
 
133
 
 
134
 
 
135
 
136
- ## Model Examination [optional]
 
 
 
 
 
137
 
138
- <!-- Relevant interpretability work for the model goes here -->
139
 
140
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
- ## Environmental Impact
143
 
144
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
147
 
148
- - **Hardware Type:** [More Information Needed]
149
- - **Hours used:** [More Information Needed]
150
- - **Cloud Provider:** [More Information Needed]
151
- - **Compute Region:** [More Information Needed]
152
- - **Carbon Emitted:** [More Information Needed]
153
 
154
- ## Technical Specifications [optional]
155
 
156
- ### Model Architecture and Objective
 
 
 
 
 
157
 
158
- [More Information Needed]
 
 
 
 
 
 
159
 
160
- ### Compute Infrastructure
161
 
162
- [More Information Needed]
163
 
164
- #### Hardware
165
 
166
- [More Information Needed]
 
 
 
 
 
167
 
168
- #### Software
169
 
170
- [More Information Needed]
171
 
172
- ## Citation [optional]
 
 
173
 
174
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
175
 
176
- **BibTeX:**
177
 
178
- [More Information Needed]
 
 
 
 
179
 
180
- **APA:**
181
 
182
- [More Information Needed]
183
 
184
- ## Glossary [optional]
 
 
 
 
185
 
186
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
187
 
188
- [More Information Needed]
189
 
190
- ## More Information [optional]
191
 
192
- [More Information Needed]
193
 
194
- ## Model Card Authors [optional]
195
 
196
- [More Information Needed]
197
 
198
- ## Model Card Contact
199
 
200
- [More Information Needed]
 
1
  ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ base_model: SparkAudio/Spark-TTS-0.5B
6
+ datasets:
7
+ - MrDragonFox/Elise
8
  tags:
9
+ - tts
10
+ - text-to-speech
11
+ - spark-tts
12
+ - voice-cloning
13
+ - unsloth
14
+ - trl
15
+ - sft
16
+ - featherlabs
17
+ - audio
18
+ library_name: transformers
19
+ pipeline_tag: text-to-speech
20
  ---
21
 
22
+ <div align="center">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ # ๐Ÿ”Š Finatts
25
 
26
+ ### *High-fidelity voice cloning โ€” fine-tuned Spark-TTS*
27
 
28
+ **Text-to-Speech ยท Voice Cloning ยท Emotion Synthesis ยท BiCodec**
29
 
30
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
31
+ [![Base Model](https://img.shields.io/badge/Base-Spark--TTS--0.5B-purple)](https://huggingface.co/SparkAudio/Spark-TTS-0.5B)
32
+ [![Dataset](https://img.shields.io/badge/Dataset-MrDragonFox%2FElise-green)](https://huggingface.co/datasets/MrDragonFox/Elise)
33
+ [![Parameters](https://img.shields.io/badge/Params-507M-orange)](https://huggingface.co/Featherlabs/Finatts)
34
 
35
+ *Built by [Featherlabs](https://huggingface.co/Featherlabs) ยท Operated by Owlkun*
36
 
37
+ </div>
38
 
39
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ ## โœจ What is Finatts?
42
 
43
+ Finatts is a **507M-parameter text-to-speech model** fine-tuned for **high-fidelity single-speaker voice cloning**. Built on top of [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B) and trained on the [Elise](https://huggingface.co/datasets/MrDragonFox/Elise) dataset โ€” a curated collection of ~1,200 voice samples (~3 hours) with rich emotional range.
44
 
45
+ Spark-TTS uses a novel **BiCodec** architecture that decomposes speech into:
46
+ - **Global tokens** โ€” speaker identity, timbre, and style
47
+ - **Semantic tokens** โ€” linguistic content and prosody
48
 
49
+ This separation enables zero-shot voice cloning and controllable speech synthesis.
50
 
51
+ ### ๐ŸŽฏ Built For
52
 
53
+ | Capability | Description |
54
+ |:---:|---|
55
+ | ๐ŸŽ™๏ธ **Voice Cloning** | Clone a specific voice from reference audio samples |
56
+ | ๐ŸŽญ **Emotion Synthesis** | Generate speech with varied emotional tones |
57
+ | ๐Ÿ“ **Text-to-Speech** | Convert text to natural, expressive speech |
58
+ | ๐Ÿ”Š **High-Fidelity Audio** | 16kHz output with BiCodec tokenization |
59
 
60
+ ---
61
 
62
+ ## ๐Ÿ‹๏ธ Training Details
63
+
64
+ <table>
65
+ <tr><td><b>Property</b></td><td><b>Value</b></td></tr>
66
+ <tr><td>Base model</td><td><a href="https://huggingface.co/SparkAudio/Spark-TTS-0.5B">SparkAudio/Spark-TTS-0.5B</a></td></tr>
67
+ <tr><td>LLM backbone</td><td>Qwen2-0.5B (507M params)</td></tr>
68
+ <tr><td>Dataset</td><td><a href="https://huggingface.co/datasets/MrDragonFox/Elise">MrDragonFox/Elise</a> (1,195 samples, ~3h)</td></tr>
69
+ <tr><td>Training type</td><td>Full Supervised Fine-Tuning (SFT)</td></tr>
70
+ <tr><td>Epochs</td><td>2</td></tr>
71
+ <tr><td>Batch size</td><td>8 (effective 16 with grad accum)</td></tr>
72
+ <tr><td>Learning rate</td><td>1e-4</td></tr>
73
+ <tr><td>Warmup steps</td><td>20</td></tr>
74
+ <tr><td>Context length</td><td>4,096 tokens</td></tr>
75
+ <tr><td>Precision</td><td>BF16</td></tr>
76
+ <tr><td>Optimizer</td><td>AdamW (torch fused)</td></tr>
77
+ <tr><td>LR scheduler</td><td>Cosine</td></tr>
78
+ <tr><td>Framework</td><td>Unsloth + TRL (SFTTrainer)</td></tr>
79
+ <tr><td>Hardware</td><td>AMD MI300X (192GB HBM3)</td></tr>
80
+ </table>
81
+
82
+ ### ๐Ÿ“Š Training Metrics
83
+
84
+ | Metric | Value |
85
+ |:---|:---:|
86
+ | **Final loss** | 5.827 |
87
+ | **Training time** | 83s (1.4 min) |
88
+ | **Peak VRAM** | 18.8 GB (9.8% of 192GB) |
89
+ | **Trainable params** | 506,634,112 (100%) |
90
+ | **Total steps** | 150 |
91
+
92
+ ### Training Loss Curve
93
+
94
+ The model shows healthy convergence from **~7.0 โ†’ ~5.8** over 150 steps:
95
+
96
+ | Step | Loss | Step | Loss | Step | Loss |
97
+ |:---:|:---:|:---:|:---:|:---:|:---:|
98
+ | 1 | 6.90 | 50 | 5.70 | 100 | 5.72 |
99
+ | 10 | 6.85 | 60 | 5.62 | 110 | 5.77 |
100
+ | 20 | 6.34 | 70 | 5.76 | 120 | 5.72 |
101
+ | 30 | 5.90 | 80 | 5.71 | 130 | 5.79 |
102
+ | 40 | 5.92 | 90 | 5.79 | 150 | 5.83 |
103
 
104
+ ---
105
 
106
+ ## ๐Ÿš€ Quick Start
107
+
108
+ ### Prerequisites
109
+
110
+ ```bash
111
+ pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
112
+ pip install "transformers<=5.2.0,>=4.51.3" "trl<=0.24.0,>=0.18.2"
113
+ pip install omegaconf einx "datasets>=3.4.1,<4.4.0" soundfile
114
+
115
+ # Clone Spark-TTS for BiCodec tokenizer
116
+ git clone https://github.com/SparkAudio/Spark-TTS
117
+ ```
118
+
119
+ ### Inference
120
+
121
+ ```python
122
+ import torch
123
+ import re
124
+ import sys
125
+ import numpy as np
126
+ import soundfile as sf
127
+ from transformers import AutoTokenizer, AutoModelForCausalLM
128
+ from huggingface_hub import snapshot_download
129
+
130
+ sys.path.append("Spark-TTS")
131
+ from sparktts.models.audio_tokenizer import BiCodecTokenizer
132
+
133
+ # Load model
134
+ model_id = "Featherlabs/Finatts"
135
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
136
+ model = AutoModelForCausalLM.from_pretrained(
137
+ model_id,
138
+ torch_dtype=torch.bfloat16,
139
+ device_map="auto"
140
+ )
141
+
142
+ # Load BiCodec for audio detokenization
143
+ snapshot_download("unsloth/Spark-TTS-0.5B", local_dir="Spark-TTS-0.5B")
144
+ audio_tokenizer = BiCodecTokenizer("Spark-TTS-0.5B", "cuda")
145
+
146
+ # Generate speech
147
+ text = "Hey there, my name is Elise! Nice to meet you."
148
+ prompt = f"<|task_tts|><|start_content|>{text}<|end_content|><|start_global_token|>"
149
+
150
+ inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
151
+ generated = model.generate(
152
+ **inputs,
153
+ max_new_tokens=2048,
154
+ do_sample=True,
155
+ temperature=0.8,
156
+ top_k=50,
157
+ top_p=1.0,
158
+ )
159
+
160
+ # Decode tokens
161
+ output_text = tokenizer.decode(generated[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
162
+ semantic_ids = [int(t) for t in re.findall(r"bicodec_semantic_(\d+)", output_text)]
163
+ global_ids = [int(t) for t in re.findall(r"bicodec_global_(\d+)", output_text)]
164
+
165
+ # Convert to audio
166
+ pred_semantic = torch.tensor(semantic_ids).long().unsqueeze(0).to("cuda")
167
+ pred_global = torch.tensor(global_ids).long().unsqueeze(0).unsqueeze(0).to("cuda")
168
+
169
+ wav = audio_tokenizer.detokenize(pred_global, pred_semantic)
170
+ sf.write("output.wav", wav.squeeze().cpu().numpy(), 16000)
171
+ print("โœ… Saved output.wav")
172
+ ```
173
 
174
+ ---
175
 
176
+ ## ๐Ÿ—๏ธ Architecture
 
 
 
 
177
 
178
+ Spark-TTS uses a unique approach that separates speech into two token streams:
179
 
180
+ ```
181
+ Text Input โ†’ [LLM Backbone] โ†’ Global Tokens (speaker identity)
182
+ โ†’ Semantic Tokens (content + prosody)
183
+ โ†“
184
+ [BiCodec Decoder] โ†’ Waveform
185
+ ```
186
 
187
+ | Component | Details |
188
+ |:---|:---|
189
+ | **LLM** | Qwen2-0.5B (507M params) โ€” generates audio token sequences |
190
+ | **BiCodec** | Neural audio codec with global + semantic tokenization |
191
+ | **Wav2Vec2** | `wav2vec2-large-xlsr-53` โ€” feature extraction for tokenization |
192
+ | **Sample rate** | 16kHz |
193
+ | **Token types** | `bicodec_global_*` (speaker) + `bicodec_semantic_*` (content) |
194
 
195
+ ---
196
 
197
+ ## ๐Ÿ“ฆ Model Files
198
 
199
+ The repository contains the fine-tuned LLM weights. For inference, you also need:
200
 
201
+ | File | Source |
202
+ |:---|:---|
203
+ | LLM weights | This repo (`Featherlabs/Finatts`) |
204
+ | BiCodec model | [`unsloth/Spark-TTS-0.5B`](https://huggingface.co/unsloth/Spark-TTS-0.5B) |
205
+ | Wav2Vec2 features | Included in Spark-TTS-0.5B |
206
+ | Spark-TTS code | [SparkAudio/Spark-TTS](https://github.com/SparkAudio/Spark-TTS) |
207
 
208
+ ---
209
 
210
+ ## โš ๏ธ Known Issues
211
 
212
+ - **Detokenization error** โ€” An `AxisSizeError` in `einx` can occur during inference when the generated global token count doesn't match the expected quantizer codebook dimensions (`q [c] d, b n q -> q b n d`). This is a shape mismatch between the model's generated tokens and BiCodec's expected input format. A workaround is being investigated.
213
+ - **Single speaker** โ€” Fine-tuned on a single voice (Elise); multi-speaker capabilities from the base model may be degraded.
214
+ - **English only** โ€” Only tested with English text inputs.
215
 
216
+ ---
217
 
218
+ ## โš ๏ธ Limitations
219
 
220
+ - **Single speaker model** โ€” optimized for the Elise voice character
221
+ - **16kHz output** โ€” not yet upsampled to 24kHz/48kHz
222
+ - **Requires Spark-TTS codebase** โ€” BiCodec tokenizer is needed for both training and inference
223
+ - **ROCm-specific** โ€” trained on AMD MI300X; CUDA users may need minor adjustments
224
+ - **Short training** โ€” only 2 epochs / 150 steps; additional training may improve quality
225
 
226
+ ---
227
 
228
+ ## ๐Ÿ”ฎ What's Next
229
 
230
+ - ๐Ÿ› **Fix inference** โ€” resolve the `einx` AxisSizeError in detokenization
231
+ - ๐ŸŽญ **Emotion tags** โ€” add explicit emotion control (`[happy]`, `[sad]`, `[surprised]`)
232
+ - ๐Ÿ“ˆ **Extended training** โ€” more epochs with larger/diverse datasets
233
+ - ๐Ÿ”Š **Super-resolution** โ€” upsample to 24kHz/48kHz for higher fidelity
234
+ - ๐Ÿ—ฃ๏ธ **Multi-speaker** โ€” train on multiple voices for speaker-switchable TTS
235
 
236
+ ---
237
 
238
+ ## ๐Ÿ“œ License
239
 
240
+ Apache 2.0 โ€” consistent with [Spark-TTS-0.5B](https://huggingface.co/SparkAudio/Spark-TTS-0.5B).
241
 
242
+ ---
243
 
244
+ <div align="center">
245
 
246
+ **Built with โค๏ธ by [Featherlabs](https://huggingface.co/Featherlabs)**
247
 
248
+ *Operated by Owlkun*
249
 
250
+ </div>