Mesko TTS

Mesko TTS is MesklinTech's dedicated text-to-speech research project.

We are actively training Mesko TTS as a fast, streaming-oriented speech model. This repository currently shares the architecture and training code while the full voice system continues to improve.

MesklinTech is open to collaboration with researchers, engineers, product teams, and supporters who are interested in efficient real-time speech AI. To connect with us or support the work, visit:

https://mesklintech.com

Mission

MesklinTech is building practical AI systems from first principles: compact, efficient, understandable models that can run outside large-lab infrastructure. Mesko TTS is our speech effort: a fast, streaming-oriented TTS stack designed around sparse routing, explicit acoustic control, and low-latency inference.

Our goal is to build a world-class fast streaming TTS system for real-time assistants, accessibility products, education tools, creator workflows, and business voice interfaces.

Current Status

Status: untrained architecture release / training in progress

What is available now:

TTS architecture source code
sparse semantic encoder
speaker encoder
duration, pitch, and energy predictors
sparse acoustic decoder
sparse neural vocoder code
LJSpeech training scripts and config structure
no trained model weights are attached to this repository yet

What is not ready yet:

production-quality speech checkpoint
production-grade trained neural vocoder release
standardized MOS / WER / speaker-similarity benchmark
long-form streaming quality validation

Architecture Direction

Mesko TTS is built around:

low-rank Q/K/V projections
causal sparse candidate attention
local, memory, landmark, and content candidate routing
laminar excitatory/inhibitory refinement
explicit speaker conditioning
explicit duration, pitch, and energy modeling
compact acoustic decoding
streaming-oriented state/cache structure

The intended model path is:

Reference mel -> speaker encoder
Text tokens -> sparse semantic encoder
Semantic states + speaker latent -> FiLM conditioning
Duration predictor -> length regulation
Pitch and energy predictors -> frame-level controls
Frame states + speaker + pitch + energy -> sparse acoustic decoder
Acoustic energy/gating head -> mel spectrogram
Trained neural vocoder -> waveform

Weights

No trained weights are attached to this repository yet.

Until a full text-to-mel and vocoder training run is complete, this repository should be treated as source code and architecture documentation, not as a finished voice model.

Responsible Use

Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.

Downloads last month: 9

mesklintech
/

mesko-tts