mesklintech commited on
Commit
4a29310
·
verified ·
1 Parent(s): 741d8be

Rename to Mesko TTS and mark checkpoint as training in progress

Browse files
README.md CHANGED
@@ -6,7 +6,6 @@ library_name: pytorch
6
  pipeline_tag: text-to-speech
7
  tags:
8
  - text-to-speech
9
- - voice-cloning
10
  - streaming-tts
11
  - sparse-attention
12
  - low-rank
@@ -17,54 +16,57 @@ datasets:
17
  - keithito/lj_speech
18
  ---
19
 
20
- # Mesko TTS Model
21
 
22
- Mesko TTS Model is a research-stage text-to-speech model from MesklinTech, built for fast, controllable, streaming-friendly speech generation.
23
 
24
- The project explores a sparse-energy architecture for TTS: low-rank projections, sparse temporal routing, laminar refinement, explicit speaker conditioning, duration/pitch/energy controls, and a compact acoustic decoder. The goal is to move toward world-class, low-latency TTS that can run efficiently and eventually support real-time voice products.
25
 
26
- ## Company Story
27
 
28
- MesklinTech is building practical AI systems from first principles, with a focus on models that are efficient, understandable, and useful outside large-lab infrastructure. Mesko TTS is part of that mission: a speech model designed around speed, controllability, and a clear path toward streaming voice generation.
29
 
30
- Our long-term vision is to create a world-class TTS stack that can support real-time assistants, education tools, accessibility products, creator workflows, and voice interfaces for businesses. We are still early, but the foundation is intentionally different: compact sparse modules, explicit acoustic controls, and a design that prioritizes fast inference rather than only scaling parameter count.
31
 
32
- We are actively looking for collaborators, partners, and aligned supporters who want to help build a faster, more accessible TTS future. To learn more or support the work, visit:
33
 
34
  **https://mesklintech.com**
35
 
36
- ## What Is Included
37
 
38
- This repository contains a compact LJSpeech-trained checkpoint and the source needed to load and inspect the model.
39
 
40
- - `model.safetensors`: model-only weights converted from local checkpoint `step_8000.pt`
41
- - `config.json`: Mesko/BioVoice TTS architecture and training config
42
- - `tokenizer.json`: tokenizer used for the LJSpeech training run
43
- - `training_summary.json`: checkpoint metrics summary
44
- - `training_metrics_step_8000.json`: metrics saved with the selected checkpoint
45
- - `bio_voice_tts/`: TTS model, audio features, datasets, training, inference, streaming, and vocoder code
46
- - `bio_llm/`: shared sparse-energy model utilities used by the project
47
 
48
- The optimizer state and intermediate checkpoints were intentionally removed to keep the repository small.
 
 
 
 
 
 
49
 
50
- ## Checkpoint Metrics
51
 
52
- Selected checkpoint: `step_8000`
 
 
 
53
 
54
- | Metric | Value |
55
- |---|---:|
56
- | loss | 1.7827 |
57
- | mel_loss / mel_mae | 1.6115 |
58
- | duration_loss | 0.0077 |
59
- | pitch_loss | 0.3701 |
60
- | energy_loss | 1.3269 |
61
- | speaker_cosine proxy | 1.0000 |
62
 
63
- These are internal training metrics from the local run. They are not a substitute for standardized MOS, WER, speaker-verification EER, or human preference evaluation.
64
 
65
- ## Architecture
 
 
 
 
 
 
 
66
 
67
- The model path is:
68
 
69
  1. Reference mel -> speaker encoder
70
  2. Text tokens -> sparse semantic encoder
@@ -73,72 +75,15 @@ The model path is:
73
  5. Pitch and energy predictors -> frame-level controls
74
  6. Frame states + speaker + pitch + energy -> sparse acoustic decoder
75
  7. Acoustic energy/gating head -> mel spectrogram
76
- 8. Optional sparse neural vocoder -> waveform
77
-
78
- Core design ideas:
79
-
80
- - low-rank Q/K/V projections
81
- - causal sparse candidate attention
82
- - local, memory, landmark, and content candidate routing
83
- - laminar excitatory/inhibitory refinement
84
- - explicit duration, pitch, and energy modeling
85
- - compact acoustic decoding
86
- - streaming-oriented structure
87
-
88
- ## Minimal Loading Example
89
-
90
- ```python
91
- import json
92
- import torch
93
- from safetensors.torch import load_file
94
-
95
- from bio_voice_tts import BioVoiceConfig, BioVoiceTTS
96
-
97
- def merge_dataclass(instance, payload):
98
- for key, value in payload.items():
99
- current = getattr(instance, key)
100
- if hasattr(current, "__dataclass_fields__") and isinstance(value, dict):
101
- merge_dataclass(current, value)
102
- else:
103
- setattr(instance, key, value)
104
- return instance
105
-
106
- config = merge_dataclass(BioVoiceConfig(), json.load(open("config.json")))
107
- model = BioVoiceTTS(config)
108
- model.load_state_dict(load_file("model.safetensors"), strict=False)
109
- model.eval()
110
-
111
- token_ids = torch.randint(0, config.semantic.vocab_size, (1, 16))
112
- reference_mel = torch.randn(1, 128, config.audio.n_mels)
113
-
114
- with torch.no_grad():
115
- outputs = model(token_ids, reference_mel)
116
-
117
- print(outputs["mel"].shape)
118
- ```
119
-
120
- For waveform synthesis, pass `outputs["mel"]` through `bio_voice_tts.vocoder.sparse_vocoder.SparseNeuralVocoder`. A separately trained vocoder checkpoint is recommended for production-quality audio.
121
-
122
- ## Current Status
123
-
124
- Mesko TTS Model is an early research checkpoint, not a finished production voice product.
125
-
126
- Current strengths:
127
 
128
- - compact checkpoint
129
- - fast architecture direction
130
- - sparse, inspectable model internals
131
- - explicit acoustic controls
132
- - source included for research and extension
133
 
134
- Current limitations:
135
 
136
- - trained on LJSpeech-style single-speaker data
137
- - no standardized public MOS/WER/EER benchmark yet
138
- - text-to-mel checkpoint; production waveform quality requires a matching vocoder
139
- - some configuration fields are still scaffolded for future work
140
 
141
  ## Responsible Use
142
 
143
- Do not use this model to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.
144
 
 
6
  pipeline_tag: text-to-speech
7
  tags:
8
  - text-to-speech
 
9
  - streaming-tts
10
  - sparse-attention
11
  - low-rank
 
16
  - keithito/lj_speech
17
  ---
18
 
19
+ # Mesko TTS
20
 
21
+ Mesko TTS is MesklinTech's dedicated text-to-speech research project.
22
 
23
+ This repository is currently published as an architecture and training-code release. The previous small checkpoint was useful for code smoke tests, but it did not include a properly trained production vocoder and produced noisy waveform audio. We have therefore marked the project as **not production-trained yet** and are continuing training before publishing a listenable release checkpoint.
24
 
25
+ ## Mission
26
 
27
+ MesklinTech is building practical AI systems from first principles: compact, efficient, understandable models that can run outside large-lab infrastructure. Mesko TTS is our speech effort: a fast, streaming-oriented TTS stack designed around sparse routing, explicit acoustic control, and low-latency inference.
28
 
29
+ Our goal is to build a world-class fast streaming TTS system for real-time assistants, accessibility products, education tools, creator workflows, and business voice interfaces.
30
 
31
+ We are actively looking for collaborators, partners, and aligned supporters who want to help us move from research prototype to a polished voice system. To learn more or support the work, visit:
32
 
33
  **https://mesklintech.com**
34
 
35
+ ## Current Status
36
 
37
+ Status: **research / training in progress**
38
 
39
+ What is available now:
 
 
 
 
 
 
40
 
41
+ - TTS architecture source code
42
+ - sparse semantic encoder
43
+ - speaker encoder
44
+ - duration, pitch, and energy predictors
45
+ - sparse acoustic decoder
46
+ - sparse neural vocoder code
47
+ - LJSpeech training scripts and config structure
48
 
49
+ What is not ready yet:
50
 
51
+ - production-quality speech checkpoint
52
+ - trained neural vocoder release
53
+ - standardized MOS / WER / speaker-similarity benchmark
54
+ - long-form streaming quality validation
55
 
56
+ ## Architecture Direction
 
 
 
 
 
 
 
57
 
58
+ Mesko TTS is built around:
59
 
60
+ - low-rank Q/K/V projections
61
+ - causal sparse candidate attention
62
+ - local, memory, landmark, and content candidate routing
63
+ - laminar excitatory/inhibitory refinement
64
+ - explicit speaker conditioning
65
+ - explicit duration, pitch, and energy modeling
66
+ - compact acoustic decoding
67
+ - streaming-oriented state/cache structure
68
 
69
+ The intended model path is:
70
 
71
  1. Reference mel -> speaker encoder
72
  2. Text tokens -> sparse semantic encoder
 
75
  5. Pitch and energy predictors -> frame-level controls
76
  6. Frame states + speaker + pitch + energy -> sparse acoustic decoder
77
  7. Acoustic energy/gating head -> mel spectrogram
78
+ 8. Trained neural vocoder -> waveform
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
+ ## Why The Checkpoint Was Removed
 
 
 
 
81
 
82
+ The previous uploaded checkpoint could generate mel tensors, but audio generated through a fallback Griffin-Lim renderer was noisy. That is not acceptable for a public TTS release. A real TTS release needs a trained waveform vocoder or a high-quality external vocoder path.
83
 
84
+ Until that is ready, this repository should be treated as source code and architecture documentation, not as a finished voice model.
 
 
 
85
 
86
  ## Responsible Use
87
 
88
+ Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.
89
 
model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:d06b15977562a2d15dd9fef7e013cea65bb147f436750f4b34fa61b85bc48a8f
3
- size 5835812
 
 
 
 
training_metrics_step_8000.json DELETED
@@ -1,12 +0,0 @@
1
- {
2
- "step": 8000,
3
- "metrics": {
4
- "loss": 1.7827284336090088,
5
- "mel_loss": 1.611472487449646,
6
- "duration_loss": 0.007744799368083477,
7
- "pitch_loss": 0.37013131380081177,
8
- "energy_loss": 1.3269374370574951,
9
- "mel_mae": 1.611472487449646,
10
- "speaker_cosine": 1.0
11
- }
12
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
training_summary.json DELETED
@@ -1,14 +0,0 @@
1
- {
2
- "published_checkpoint": "step_8000",
3
- "step": 8000,
4
- "metrics": {
5
- "loss": 1.7827284336090088,
6
- "mel_loss": 1.611472487449646,
7
- "duration_loss": 0.007744799368083477,
8
- "pitch_loss": 0.37013131380081177,
9
- "energy_loss": 1.3269374370574951,
10
- "mel_mae": 1.611472487449646,
11
- "speaker_cosine": 1.0
12
- },
13
- "note": "model.safetensors contains model weights only; optimizer state was intentionally omitted to keep the Hugging Face repo small."
14
- }