LuxTTS (Safetensors / FP16)
This is a converted version of YatharthS/LuxTTS, a flow-matching based text-to-speech model. All credit for the original model, training, and research goes to the original authors.
What changed
The original PyTorch checkpoint (model.pt and vocoder/vocos.bin) has been converted to safetensors format in float16 precision for use with Cake. The conversion applies the following transformations:
- Format:
.pt/.binโ.safetensors(safer, faster loading, memory-mappable). - Precision: FP32 โ FP16, reducing total size from ~530 MB to ~266 MB.
- Key remapping: The nested
fm_decoder.encoders.{stack}.layers.{layer}hierarchy is flattened tofm_decoder.layers.{flat_index}using the stack sizes[2, 2, 4, 4, 4](16 layers total). Similarly,text_encoder.encoders.0.layersis flattened totext_encoder.layers. Per-stack components (time_emb,downsample,out_combiner) are reorganized underfm_decoder.stack_time_emb,fm_decoder.downsample, andfm_decoder.out_combinerrespectively. - Config:
architecturesfield and feature extraction parameters (n_fft,hop_length,n_mels,sample_rate) are added toconfig.json.
No weights were retrained or fine-tuned โ this is a lossless format conversion (modulo FP32โFP16 quantization).
Model details
| Component | File | Size |
|---|---|---|
| Main model (flow-matching decoder + text encoder) | model.safetensors |
235 MB |
| Vocoder (Vocos) | vocos.safetensors |
31 MB |
- Architecture: Flow-matching TTS with conformer-based decoder (16 layers across 5 stacks) and 4-layer text encoder
- Vocoder: Vocos (iSTFT-based, 8 layers, 512 dim)
- Sample rate: 24 kHz (with 48 kHz upsampler head)
- Vocabulary: 360 tokens (characters + punctuation)
Original project
- Model: YatharthS/LuxTTS
- License: Apache 2.0
- Downloads last month
- 2
Model tree for evilsocket/luxtts
Base model
YatharthS/LuxTTS