File size: 2,153 Bytes
454a376
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
---
language:
  - en
tags:
  - tts
  - text-to-speech
  - safetensors
  - cake
license: apache-2.0
base_model: YatharthS/LuxTTS
---

# LuxTTS (Safetensors / FP16)

This is a converted version of [YatharthS/LuxTTS](https://huggingface.co/YatharthS/LuxTTS), a flow-matching based text-to-speech model. All credit for the original model, training, and research goes to the original authors.

## What changed

The original PyTorch checkpoint (`model.pt` and `vocoder/vocos.bin`) has been converted to **safetensors** format in **float16** precision for use with [Cake](https://github.com/evilsocket/cake). The conversion applies the following transformations:

- **Format**: `.pt` / `.bin``.safetensors` (safer, faster loading, memory-mappable).
- **Precision**: FP32 → FP16, reducing total size from ~530 MB to ~266 MB.
- **Key remapping**: The nested `fm_decoder.encoders.{stack}.layers.{layer}` hierarchy is flattened to `fm_decoder.layers.{flat_index}` using the stack sizes `[2, 2, 4, 4, 4]` (16 layers total). Similarly, `text_encoder.encoders.0.layers` is flattened to `text_encoder.layers`. Per-stack components (`time_emb`, `downsample`, `out_combiner`) are reorganized under `fm_decoder.stack_time_emb`, `fm_decoder.downsample`, and `fm_decoder.out_combiner` respectively.
- **Config**: `architectures` field and feature extraction parameters (`n_fft`, `hop_length`, `n_mels`, `sample_rate`) are added to `config.json`.

No weights were retrained or fine-tuned — this is a lossless format conversion (modulo FP32→FP16 quantization).

## Model details

| Component | File | Size |
|---|---|---|
| Main model (flow-matching decoder + text encoder) | `model.safetensors` | 235 MB |
| Vocoder (Vocos) | `vocos.safetensors` | 31 MB |

- **Architecture**: Flow-matching TTS with conformer-based decoder (16 layers across 5 stacks) and 4-layer text encoder
- **Vocoder**: Vocos (iSTFT-based, 8 layers, 512 dim)
- **Sample rate**: 24 kHz (with 48 kHz upsampler head)
- **Vocabulary**: 360 tokens (characters + punctuation)

## Original project

- **Model**: [YatharthS/LuxTTS](https://huggingface.co/YatharthS/LuxTTS)
- **License**: Apache 2.0