LuxTTS (Safetensors / FP16)

This is a converted version of YatharthS/LuxTTS, a flow-matching based text-to-speech model. All credit for the original model, training, and research goes to the original authors.

What changed

The original PyTorch checkpoint (model.pt and vocoder/vocos.bin) has been converted to safetensors format in float16 precision for use with Cake. The conversion applies the following transformations:

  • Format: .pt / .bin โ†’ .safetensors (safer, faster loading, memory-mappable).
  • Precision: FP32 โ†’ FP16, reducing total size from ~530 MB to ~266 MB.
  • Key remapping: The nested fm_decoder.encoders.{stack}.layers.{layer} hierarchy is flattened to fm_decoder.layers.{flat_index} using the stack sizes [2, 2, 4, 4, 4] (16 layers total). Similarly, text_encoder.encoders.0.layers is flattened to text_encoder.layers. Per-stack components (time_emb, downsample, out_combiner) are reorganized under fm_decoder.stack_time_emb, fm_decoder.downsample, and fm_decoder.out_combiner respectively.
  • Config: architectures field and feature extraction parameters (n_fft, hop_length, n_mels, sample_rate) are added to config.json.

No weights were retrained or fine-tuned โ€” this is a lossless format conversion (modulo FP32โ†’FP16 quantization).

Model details

Component File Size
Main model (flow-matching decoder + text encoder) model.safetensors 235 MB
Vocoder (Vocos) vocos.safetensors 31 MB
  • Architecture: Flow-matching TTS with conformer-based decoder (16 layers across 5 stacks) and 4-layer text encoder
  • Vocoder: Vocos (iSTFT-based, 8 layers, 512 dim)
  • Sample rate: 24 kHz (with 48 kHz upsampler head)
  • Vocabulary: 360 tokens (characters + punctuation)

Original project

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for evilsocket/luxtts

Base model

YatharthS/LuxTTS
Finetuned
(1)
this model