SFlowTTS
SlopTTS is an experimental text-to-speech system designed around a custom FSQ-based speech codec and flow-based generation pipeline. The repository contains training and inference code for speech synthesis, codec training, duration modeling, speaker conditioning, and latent flow modeling.
The project is intended for research and experimentation in neural speech synthesis and speech representation learning.
Features
- Custom FSQ speech codec pipeline
- Flow-based latent generation
- Speaker-aware synthesis components
- Duration prediction modules
- End-to-end training scripts
- Inference utilities for speech generation
- Modular architecture for experimentation and research
Repository Structure
Configs/ Configuration files
Data/ Dataset-related resources
Modules/ Model components
train_codec_hybrid_temporal.py Codec training
train_codec_mel_speaker.py Mel + speaker codec training
train_codec_speaker.py Speaker codec training
train_duration_predictor_context.py
Duration prediction training
train_fsq_flow_convnext.py Flow model training
train_predictors_speaker_flow_context_temporal.py
Predictor training
infer_sloptts.py Main TTS inference
infer_fsq_flow.py Flow inference
models.py Core model definitions
models_speaker.py Speaker modules
models_mel_speaker.py Mel-speaker models
losses.py Training losses
optimizers.py Optimizer utilities
utils.py Helper functions
Project Overview
SlopTTS follows a multi-stage speech generation pipeline. Text is processed into intermediate representations, which are transformed through duration and contextual prediction modules. A custom FSQ-based codec is used to represent speech efficiently, while flow-based models generate coherent latent representations for speech reconstruction.
The repository separates codec training, predictor training, and speech generation into independent stages, allowing each component to be improved or replaced individually.
Training
Training is organized into multiple stages:
- Codec training
- Speaker-aware representation learning
- Duration prediction
- Flow model training
- Predictor training
- End-to-end synthesis evaluation
Individual training scripts are provided for each stage.
Example:
python train_codec_hybrid_temporal.py
or
python train_fsq_flow_convnext.py
Configuration files can be adjusted according to dataset size, hardware resources, and training objectives.
Inference
To generate speech:
python infer_sloptts.py
Additional flow-based inference utilities are available:
python infer_fsq_flow.py
Refer to the configuration files for model paths and runtime settings.
Requirements
Recommended:
- Python 3.10+
- PyTorch
- CUDA-capable GPU
- NumPy
- SciPy
Additional dependencies may be listed in the project environment configuration.
Current Status
This repository is actively developed and may contain experimental components. Interfaces, training procedures, and model architectures can change between versions.
Intended Use
SlopTTS is intended for:
- Speech synthesis research
- Codec-based TTS experiments
- Representation learning research
- Multilingual speech generation studies
- Custom voice and speaker modeling research
Disclaimer
This project is provided for research and educational purposes. Generated speech quality depends on the training data, configuration, and model checkpoints used.