evilsocket
/

luxtts

Model card Files Files and versions

luxtts / README.md

evilsocket's picture

Upload folder using huggingface_hub

454a376 verified 8 days ago

|

history blame contribute delete

2.15 kB

	---
	language:
	- en
	tags:
	- tts
	- text-to-speech
	- safetensors
	- cake
	license: apache-2.0
	base_model: YatharthS/LuxTTS
	---

	# LuxTTS (Safetensors / FP16)

	This is a converted version of [YatharthS/LuxTTS](https://huggingface.co/YatharthS/LuxTTS), a flow-matching based text-to-speech model. All credit for the original model, training, and research goes to the original authors.

	## What changed

	The original PyTorch checkpoint (`model.pt` and `vocoder/vocos.bin`) has been converted to safetensors format in float16 precision for use with [Cake](https://github.com/evilsocket/cake). The conversion applies the following transformations:

	- Format: `.pt` / `.bin` → `.safetensors` (safer, faster loading, memory-mappable).
	- Precision: FP32 → FP16, reducing total size from ~530 MB to ~266 MB.
	- Key remapping: The nested `fm_decoder.encoders.{stack}.layers.{layer}` hierarchy is flattened to `fm_decoder.layers.{flat_index}` using the stack sizes `[2, 2, 4, 4, 4]` (16 layers total). Similarly, `text_encoder.encoders.0.layers` is flattened to `text_encoder.layers`. Per-stack components (`time_emb`, `downsample`, `out_combiner`) are reorganized under `fm_decoder.stack_time_emb`, `fm_decoder.downsample`, and `fm_decoder.out_combiner` respectively.
	- Config: `architectures` field and feature extraction parameters (`n_fft`, `hop_length`, `n_mels`, `sample_rate`) are added to `config.json`.

	No weights were retrained or fine-tuned — this is a lossless format conversion (modulo FP32→FP16 quantization).

	## Model details

	\| Component \| File \| Size \|
	\|---\|---\|---\|
	\| Main model (flow-matching decoder + text encoder) \| `model.safetensors` \| 235 MB \|
	\| Vocoder (Vocos) \| `vocos.safetensors` \| 31 MB \|

	- Architecture: Flow-matching TTS with conformer-based decoder (16 layers across 5 stacks) and 4-layer text encoder
	- Vocoder: Vocos (iSTFT-based, 8 layers, 512 dim)
	- Sample rate: 24 kHz (with 48 kHz upsampler head)
	- Vocabulary: 360 tokens (characters + punctuation)

	## Original project

	- Model: [YatharthS/LuxTTS](https://huggingface.co/YatharthS/LuxTTS)
	- License: Apache 2.0