SFlowTTS

SlopTTS is an experimental text-to-speech system designed around a custom FSQ-based speech codec and flow-based generation pipeline. The repository contains training and inference code for speech synthesis, codec training, duration modeling, speaker conditioning, and latent flow modeling.

The project is intended for research and experimentation in neural speech synthesis and speech representation learning.

Features

  • Custom FSQ speech codec pipeline
  • Flow-based latent generation
  • Speaker-aware synthesis components
  • Duration prediction modules
  • End-to-end training scripts
  • Inference utilities for speech generation
  • Modular architecture for experimentation and research

Repository Structure

Configs/                          Configuration files
Data/                             Dataset-related resources
Modules/                          Model components

train_codec_hybrid_temporal.py    Codec training
train_codec_mel_speaker.py        Mel + speaker codec training
train_codec_speaker.py            Speaker codec training

train_duration_predictor_context.py
                                  Duration prediction training

train_fsq_flow_convnext.py        Flow model training

train_predictors_speaker_flow_context_temporal.py
                                  Predictor training

infer_sloptts.py                  Main TTS inference
infer_fsq_flow.py                 Flow inference

models.py                         Core model definitions
models_speaker.py                 Speaker modules
models_mel_speaker.py             Mel-speaker models

losses.py                         Training losses
optimizers.py                     Optimizer utilities
utils.py                          Helper functions

Project Overview

SlopTTS follows a multi-stage speech generation pipeline. Text is processed into intermediate representations, which are transformed through duration and contextual prediction modules. A custom FSQ-based codec is used to represent speech efficiently, while flow-based models generate coherent latent representations for speech reconstruction.

The repository separates codec training, predictor training, and speech generation into independent stages, allowing each component to be improved or replaced individually.

Training

Training is organized into multiple stages:

  1. Codec training
  2. Speaker-aware representation learning
  3. Duration prediction
  4. Flow model training
  5. Predictor training
  6. End-to-end synthesis evaluation

Individual training scripts are provided for each stage.

Example:

python train_codec_hybrid_temporal.py

or

python train_fsq_flow_convnext.py

Configuration files can be adjusted according to dataset size, hardware resources, and training objectives.

Inference

To generate speech:

python infer_sloptts.py

Additional flow-based inference utilities are available:

python infer_fsq_flow.py

Refer to the configuration files for model paths and runtime settings.

Requirements

Recommended:

  • Python 3.10+
  • PyTorch
  • CUDA-capable GPU
  • NumPy
  • SciPy

Additional dependencies may be listed in the project environment configuration.

Current Status

This repository is actively developed and may contain experimental components. Interfaces, training procedures, and model architectures can change between versions.

Intended Use

SlopTTS is intended for:

  • Speech synthesis research
  • Codec-based TTS experiments
  • Representation learning research
  • Multilingual speech generation studies
  • Custom voice and speaker modeling research

Disclaimer

This project is provided for research and educational purposes. Generated speech quality depends on the training data, configuration, and model checkpoints used.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using FashionFlora/SlopTTS 1