File size: 2,153 Bytes
454a376 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | ---
language:
- en
tags:
- tts
- text-to-speech
- safetensors
- cake
license: apache-2.0
base_model: YatharthS/LuxTTS
---
# LuxTTS (Safetensors / FP16)
This is a converted version of [YatharthS/LuxTTS](https://huggingface.co/YatharthS/LuxTTS), a flow-matching based text-to-speech model. All credit for the original model, training, and research goes to the original authors.
## What changed
The original PyTorch checkpoint (`model.pt` and `vocoder/vocos.bin`) has been converted to **safetensors** format in **float16** precision for use with [Cake](https://github.com/evilsocket/cake). The conversion applies the following transformations:
- **Format**: `.pt` / `.bin` → `.safetensors` (safer, faster loading, memory-mappable).
- **Precision**: FP32 → FP16, reducing total size from ~530 MB to ~266 MB.
- **Key remapping**: The nested `fm_decoder.encoders.{stack}.layers.{layer}` hierarchy is flattened to `fm_decoder.layers.{flat_index}` using the stack sizes `[2, 2, 4, 4, 4]` (16 layers total). Similarly, `text_encoder.encoders.0.layers` is flattened to `text_encoder.layers`. Per-stack components (`time_emb`, `downsample`, `out_combiner`) are reorganized under `fm_decoder.stack_time_emb`, `fm_decoder.downsample`, and `fm_decoder.out_combiner` respectively.
- **Config**: `architectures` field and feature extraction parameters (`n_fft`, `hop_length`, `n_mels`, `sample_rate`) are added to `config.json`.
No weights were retrained or fine-tuned — this is a lossless format conversion (modulo FP32→FP16 quantization).
## Model details
| Component | File | Size |
|---|---|---|
| Main model (flow-matching decoder + text encoder) | `model.safetensors` | 235 MB |
| Vocoder (Vocos) | `vocos.safetensors` | 31 MB |
- **Architecture**: Flow-matching TTS with conformer-based decoder (16 layers across 5 stacks) and 4-layer text encoder
- **Vocoder**: Vocos (iSTFT-based, 8 layers, 512 dim)
- **Sample rate**: 24 kHz (with 48 kHz upsampler head)
- **Vocabulary**: 360 tokens (characters + punctuation)
## Original project
- **Model**: [YatharthS/LuxTTS](https://huggingface.co/YatharthS/LuxTTS)
- **License**: Apache 2.0
|