| --- |
| language: |
| - en |
| license: other |
| library_name: pytorch |
| pipeline_tag: text-to-speech |
| tags: |
| - text-to-speech |
| - streaming-tts |
| - sparse-attention |
| - low-rank |
| - cpu-first |
| - mesko-tts |
| - mesklintech |
| datasets: |
| - keithito/lj_speech |
| --- |
| |
| # Mesko TTS |
|
|
| Mesko TTS is MesklinTech's dedicated text-to-speech research project. |
|
|
| We are actively training Mesko TTS as a fast, streaming-oriented speech model. This repository currently shares the architecture and training code while the full voice system continues to improve. |
|
|
| MesklinTech is open to collaboration with researchers, engineers, product teams, and supporters who are interested in efficient real-time speech AI. To connect with us or support the work, visit: |
|
|
| **https://mesklintech.com** |
|
|
| ## Mission |
|
|
| MesklinTech is building practical AI systems from first principles: compact, efficient, understandable models that can run outside large-lab infrastructure. Mesko TTS is our speech effort: a fast, streaming-oriented TTS stack designed around sparse routing, explicit acoustic control, and low-latency inference. |
|
|
| Our goal is to build a world-class fast streaming TTS system for real-time assistants, accessibility products, education tools, creator workflows, and business voice interfaces. |
|
|
| ## Current Status |
|
|
| Status: **untrained architecture release / training in progress** |
|
|
| What is available now: |
|
|
| - TTS architecture source code |
| - sparse semantic encoder |
| - speaker encoder |
| - duration, pitch, and energy predictors |
| - sparse acoustic decoder |
| - sparse neural vocoder code |
| - LJSpeech training scripts and config structure |
| - no trained model weights are attached to this repository yet |
|
|
| What is not ready yet: |
|
|
| - production-quality speech checkpoint |
| - production-grade trained neural vocoder release |
| - standardized MOS / WER / speaker-similarity benchmark |
| - long-form streaming quality validation |
|
|
| ## Architecture Direction |
|
|
| Mesko TTS is built around: |
|
|
| - low-rank Q/K/V projections |
| - causal sparse candidate attention |
| - local, memory, landmark, and content candidate routing |
| - laminar excitatory/inhibitory refinement |
| - explicit speaker conditioning |
| - explicit duration, pitch, and energy modeling |
| - compact acoustic decoding |
| - streaming-oriented state/cache structure |
|
|
| The intended model path is: |
|
|
| 1. Reference mel -> speaker encoder |
| 2. Text tokens -> sparse semantic encoder |
| 3. Semantic states + speaker latent -> FiLM conditioning |
| 4. Duration predictor -> length regulation |
| 5. Pitch and energy predictors -> frame-level controls |
| 6. Frame states + speaker + pitch + energy -> sparse acoustic decoder |
| 7. Acoustic energy/gating head -> mel spectrogram |
| 8. Trained neural vocoder -> waveform |
|
|
| ## Weights |
|
|
| No trained weights are attached to this repository yet. |
|
|
| Until a full text-to-mel and vocoder training run is complete, this repository should be treated as source code and architecture documentation, not as a finished voice model. |
|
|
| ## Responsible Use |
|
|
| Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care. |
|
|