Mesko TTS

Mesko TTS is MesklinTech's dedicated text-to-speech research project.

We are actively training Mesko TTS as a fast, streaming-oriented speech model. This repository currently shares the architecture and training code while the full voice system continues to improve.

MesklinTech is open to collaboration with researchers, engineers, product teams, and supporters who are interested in efficient real-time speech AI. To connect with us or support the work, visit:

https://mesklintech.com

Mission

MesklinTech is building practical AI systems from first principles: compact, efficient, understandable models that can run outside large-lab infrastructure. Mesko TTS is our speech effort: a fast, streaming-oriented TTS stack designed around sparse routing, explicit acoustic control, and low-latency inference.

Our goal is to build a world-class fast streaming TTS system for real-time assistants, accessibility products, education tools, creator workflows, and business voice interfaces.

Current Status

Status: untrained architecture release / training in progress

What is available now:

  • TTS architecture source code
  • sparse semantic encoder
  • speaker encoder
  • duration, pitch, and energy predictors
  • sparse acoustic decoder
  • sparse neural vocoder code
  • LJSpeech training scripts and config structure
  • no trained model weights are attached to this repository yet

What is not ready yet:

  • production-quality speech checkpoint
  • production-grade trained neural vocoder release
  • standardized MOS / WER / speaker-similarity benchmark
  • long-form streaming quality validation

Architecture Direction

Mesko TTS is built around:

  • low-rank Q/K/V projections
  • causal sparse candidate attention
  • local, memory, landmark, and content candidate routing
  • laminar excitatory/inhibitory refinement
  • explicit speaker conditioning
  • explicit duration, pitch, and energy modeling
  • compact acoustic decoding
  • streaming-oriented state/cache structure

The intended model path is:

  1. Reference mel -> speaker encoder
  2. Text tokens -> sparse semantic encoder
  3. Semantic states + speaker latent -> FiLM conditioning
  4. Duration predictor -> length regulation
  5. Pitch and energy predictors -> frame-level controls
  6. Frame states + speaker + pitch + energy -> sparse acoustic decoder
  7. Acoustic energy/gating head -> mel spectrogram
  8. Trained neural vocoder -> waveform

Weights

No trained weights are attached to this repository yet.

Until a full text-to-mel and vocoder training run is complete, this repository should be treated as source code and architecture documentation, not as a finished voice model.

Responsible Use

Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train mesklintech/mesko-tts