Mesko TTS
Mesko TTS is MesklinTech's dedicated text-to-speech research project.
We are actively training Mesko TTS as a fast, streaming-oriented speech model. This repository currently shares the architecture and training code while the full voice system continues to improve.
MesklinTech is open to collaboration with researchers, engineers, product teams, and supporters who are interested in efficient real-time speech AI. To connect with us or support the work, visit:
Mission
MesklinTech is building practical AI systems from first principles: compact, efficient, understandable models that can run outside large-lab infrastructure. Mesko TTS is our speech effort: a fast, streaming-oriented TTS stack designed around sparse routing, explicit acoustic control, and low-latency inference.
Our goal is to build a world-class fast streaming TTS system for real-time assistants, accessibility products, education tools, creator workflows, and business voice interfaces.
Current Status
Status: untrained architecture release / training in progress
What is available now:
- TTS architecture source code
- sparse semantic encoder
- speaker encoder
- duration, pitch, and energy predictors
- sparse acoustic decoder
- sparse neural vocoder code
- LJSpeech training scripts and config structure
- no trained model weights are attached to this repository yet
What is not ready yet:
- production-quality speech checkpoint
- production-grade trained neural vocoder release
- standardized MOS / WER / speaker-similarity benchmark
- long-form streaming quality validation
Architecture Direction
Mesko TTS is built around:
- low-rank Q/K/V projections
- causal sparse candidate attention
- local, memory, landmark, and content candidate routing
- laminar excitatory/inhibitory refinement
- explicit speaker conditioning
- explicit duration, pitch, and energy modeling
- compact acoustic decoding
- streaming-oriented state/cache structure
The intended model path is:
- Reference mel -> speaker encoder
- Text tokens -> sparse semantic encoder
- Semantic states + speaker latent -> FiLM conditioning
- Duration predictor -> length regulation
- Pitch and energy predictors -> frame-level controls
- Frame states + speaker + pitch + energy -> sparse acoustic decoder
- Acoustic energy/gating head -> mel spectrogram
- Trained neural vocoder -> waveform
Weights
No trained weights are attached to this repository yet.
Until a full text-to-mel and vocoder training run is complete, this repository should be treated as source code and architecture documentation, not as a finished voice model.
Responsible Use
Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.
- Downloads last month
- -