mesklintech
/

mesko-tts

sparse-attention

Model card Files Files and versions

mesko-tts / README.md

mesklintech's picture

Clarify no trained weights are attached

ee2430c verified 2 days ago

|

history blame contribute delete

3.11 kB

	---
	language:
	- en
	license: other
	library_name: pytorch
	pipeline_tag: text-to-speech
	tags:
	- text-to-speech
	- streaming-tts
	- sparse-attention
	- low-rank
	- cpu-first
	- mesko-tts
	- mesklintech
	datasets:
	- keithito/lj_speech
	---

	# Mesko TTS

	Mesko TTS is MesklinTech's dedicated text-to-speech research project.

	We are actively training Mesko TTS as a fast, streaming-oriented speech model. This repository currently shares the architecture and training code while the full voice system continues to improve.

	MesklinTech is open to collaboration with researchers, engineers, product teams, and supporters who are interested in efficient real-time speech AI. To connect with us or support the work, visit:

	https://mesklintech.com

	## Mission

	MesklinTech is building practical AI systems from first principles: compact, efficient, understandable models that can run outside large-lab infrastructure. Mesko TTS is our speech effort: a fast, streaming-oriented TTS stack designed around sparse routing, explicit acoustic control, and low-latency inference.

	Our goal is to build a world-class fast streaming TTS system for real-time assistants, accessibility products, education tools, creator workflows, and business voice interfaces.

	## Current Status

	Status: untrained architecture release / training in progress

	What is available now:

	- TTS architecture source code
	- sparse semantic encoder
	- speaker encoder
	- duration, pitch, and energy predictors
	- sparse acoustic decoder
	- sparse neural vocoder code
	- LJSpeech training scripts and config structure
	- no trained model weights are attached to this repository yet

	What is not ready yet:

	- production-quality speech checkpoint
	- production-grade trained neural vocoder release
	- standardized MOS / WER / speaker-similarity benchmark
	- long-form streaming quality validation

	## Architecture Direction

	Mesko TTS is built around:

	- low-rank Q/K/V projections
	- causal sparse candidate attention
	- local, memory, landmark, and content candidate routing
	- laminar excitatory/inhibitory refinement
	- explicit speaker conditioning
	- explicit duration, pitch, and energy modeling
	- compact acoustic decoding
	- streaming-oriented state/cache structure

	The intended model path is:

	1. Reference mel -> speaker encoder
	2. Text tokens -> sparse semantic encoder
	3. Semantic states + speaker latent -> FiLM conditioning
	4. Duration predictor -> length regulation
	5. Pitch and energy predictors -> frame-level controls
	6. Frame states + speaker + pitch + energy -> sparse acoustic decoder
	7. Acoustic energy/gating head -> mel spectrogram
	8. Trained neural vocoder -> waveform

	## Weights

	No trained weights are attached to this repository yet.

	Until a full text-to-mel and vocoder training run is complete, this repository should be treated as source code and architecture documentation, not as a finished voice model.

	## Responsible Use

	Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.