| --- |
| language: |
| - en |
| tags: |
| - tts |
| - text-to-speech |
| - safetensors |
| - cake |
| license: apache-2.0 |
| base_model: YatharthS/LuxTTS |
| --- |
| |
| # LuxTTS (Safetensors / FP16) |
|
|
| This is a converted version of [YatharthS/LuxTTS](https://huggingface.co/YatharthS/LuxTTS), a flow-matching based text-to-speech model. All credit for the original model, training, and research goes to the original authors. |
|
|
| ## What changed |
|
|
| The original PyTorch checkpoint (`model.pt` and `vocoder/vocos.bin`) has been converted to **safetensors** format in **float16** precision for use with [Cake](https://github.com/evilsocket/cake). The conversion applies the following transformations: |
|
|
| - **Format**: `.pt` / `.bin` → `.safetensors` (safer, faster loading, memory-mappable). |
| - **Precision**: FP32 → FP16, reducing total size from ~530 MB to ~266 MB. |
| - **Key remapping**: The nested `fm_decoder.encoders.{stack}.layers.{layer}` hierarchy is flattened to `fm_decoder.layers.{flat_index}` using the stack sizes `[2, 2, 4, 4, 4]` (16 layers total). Similarly, `text_encoder.encoders.0.layers` is flattened to `text_encoder.layers`. Per-stack components (`time_emb`, `downsample`, `out_combiner`) are reorganized under `fm_decoder.stack_time_emb`, `fm_decoder.downsample`, and `fm_decoder.out_combiner` respectively. |
| - **Config**: `architectures` field and feature extraction parameters (`n_fft`, `hop_length`, `n_mels`, `sample_rate`) are added to `config.json`. |
|
|
| No weights were retrained or fine-tuned — this is a lossless format conversion (modulo FP32→FP16 quantization). |
|
|
| ## Model details |
|
|
| | Component | File | Size | |
| |---|---|---| |
| | Main model (flow-matching decoder + text encoder) | `model.safetensors` | 235 MB | |
| | Vocoder (Vocos) | `vocos.safetensors` | 31 MB | |
|
|
| - **Architecture**: Flow-matching TTS with conformer-based decoder (16 layers across 5 stacks) and 4-layer text encoder |
| - **Vocoder**: Vocos (iSTFT-based, 8 layers, 512 dim) |
| - **Sample rate**: 24 kHz (with 48 kHz upsampler head) |
| - **Vocabulary**: 360 tokens (characters + punctuation) |
|
|
| ## Original project |
|
|
| - **Model**: [YatharthS/LuxTTS](https://huggingface.co/YatharthS/LuxTTS) |
| - **License**: Apache 2.0 |
|
|