Rename to Mesko TTS and mark checkpoint as training in progress
Browse files- README.md +38 -93
- model.safetensors +0 -3
- training_metrics_step_8000.json +0 -12
- training_summary.json +0 -14
README.md
CHANGED
|
@@ -6,7 +6,6 @@ library_name: pytorch
|
|
| 6 |
pipeline_tag: text-to-speech
|
| 7 |
tags:
|
| 8 |
- text-to-speech
|
| 9 |
-
- voice-cloning
|
| 10 |
- streaming-tts
|
| 11 |
- sparse-attention
|
| 12 |
- low-rank
|
|
@@ -17,54 +16,57 @@ datasets:
|
|
| 17 |
- keithito/lj_speech
|
| 18 |
---
|
| 19 |
|
| 20 |
-
# Mesko TTS
|
| 21 |
|
| 22 |
-
Mesko TTS
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
##
|
| 27 |
|
| 28 |
-
MesklinTech is building practical AI systems from first principles,
|
| 29 |
|
| 30 |
-
Our
|
| 31 |
|
| 32 |
-
We are actively looking for collaborators, partners, and aligned supporters who want to help
|
| 33 |
|
| 34 |
**https://mesklintech.com**
|
| 35 |
|
| 36 |
-
##
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
-
- `config.json`: Mesko/BioVoice TTS architecture and training config
|
| 42 |
-
- `tokenizer.json`: tokenizer used for the LJSpeech training run
|
| 43 |
-
- `training_summary.json`: checkpoint metrics summary
|
| 44 |
-
- `training_metrics_step_8000.json`: metrics saved with the selected checkpoint
|
| 45 |
-
- `bio_voice_tts/`: TTS model, audio features, datasets, training, inference, streaming, and vocoder code
|
| 46 |
-
- `bio_llm/`: shared sparse-energy model utilities used by the project
|
| 47 |
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|---|---:|
|
| 56 |
-
| loss | 1.7827 |
|
| 57 |
-
| mel_loss / mel_mae | 1.6115 |
|
| 58 |
-
| duration_loss | 0.0077 |
|
| 59 |
-
| pitch_loss | 0.3701 |
|
| 60 |
-
| energy_loss | 1.3269 |
|
| 61 |
-
| speaker_cosine proxy | 1.0000 |
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
The model path is:
|
| 68 |
|
| 69 |
1. Reference mel -> speaker encoder
|
| 70 |
2. Text tokens -> sparse semantic encoder
|
|
@@ -73,72 +75,15 @@ The model path is:
|
|
| 73 |
5. Pitch and energy predictors -> frame-level controls
|
| 74 |
6. Frame states + speaker + pitch + energy -> sparse acoustic decoder
|
| 75 |
7. Acoustic energy/gating head -> mel spectrogram
|
| 76 |
-
8.
|
| 77 |
-
|
| 78 |
-
Core design ideas:
|
| 79 |
-
|
| 80 |
-
- low-rank Q/K/V projections
|
| 81 |
-
- causal sparse candidate attention
|
| 82 |
-
- local, memory, landmark, and content candidate routing
|
| 83 |
-
- laminar excitatory/inhibitory refinement
|
| 84 |
-
- explicit duration, pitch, and energy modeling
|
| 85 |
-
- compact acoustic decoding
|
| 86 |
-
- streaming-oriented structure
|
| 87 |
-
|
| 88 |
-
## Minimal Loading Example
|
| 89 |
-
|
| 90 |
-
```python
|
| 91 |
-
import json
|
| 92 |
-
import torch
|
| 93 |
-
from safetensors.torch import load_file
|
| 94 |
-
|
| 95 |
-
from bio_voice_tts import BioVoiceConfig, BioVoiceTTS
|
| 96 |
-
|
| 97 |
-
def merge_dataclass(instance, payload):
|
| 98 |
-
for key, value in payload.items():
|
| 99 |
-
current = getattr(instance, key)
|
| 100 |
-
if hasattr(current, "__dataclass_fields__") and isinstance(value, dict):
|
| 101 |
-
merge_dataclass(current, value)
|
| 102 |
-
else:
|
| 103 |
-
setattr(instance, key, value)
|
| 104 |
-
return instance
|
| 105 |
-
|
| 106 |
-
config = merge_dataclass(BioVoiceConfig(), json.load(open("config.json")))
|
| 107 |
-
model = BioVoiceTTS(config)
|
| 108 |
-
model.load_state_dict(load_file("model.safetensors"), strict=False)
|
| 109 |
-
model.eval()
|
| 110 |
-
|
| 111 |
-
token_ids = torch.randint(0, config.semantic.vocab_size, (1, 16))
|
| 112 |
-
reference_mel = torch.randn(1, 128, config.audio.n_mels)
|
| 113 |
-
|
| 114 |
-
with torch.no_grad():
|
| 115 |
-
outputs = model(token_ids, reference_mel)
|
| 116 |
-
|
| 117 |
-
print(outputs["mel"].shape)
|
| 118 |
-
```
|
| 119 |
-
|
| 120 |
-
For waveform synthesis, pass `outputs["mel"]` through `bio_voice_tts.vocoder.sparse_vocoder.SparseNeuralVocoder`. A separately trained vocoder checkpoint is recommended for production-quality audio.
|
| 121 |
-
|
| 122 |
-
## Current Status
|
| 123 |
-
|
| 124 |
-
Mesko TTS Model is an early research checkpoint, not a finished production voice product.
|
| 125 |
-
|
| 126 |
-
Current strengths:
|
| 127 |
|
| 128 |
-
|
| 129 |
-
- fast architecture direction
|
| 130 |
-
- sparse, inspectable model internals
|
| 131 |
-
- explicit acoustic controls
|
| 132 |
-
- source included for research and extension
|
| 133 |
|
| 134 |
-
|
| 135 |
|
| 136 |
-
|
| 137 |
-
- no standardized public MOS/WER/EER benchmark yet
|
| 138 |
-
- text-to-mel checkpoint; production waveform quality requires a matching vocoder
|
| 139 |
-
- some configuration fields are still scaffolded for future work
|
| 140 |
|
| 141 |
## Responsible Use
|
| 142 |
|
| 143 |
-
Do not use this
|
| 144 |
|
|
|
|
| 6 |
pipeline_tag: text-to-speech
|
| 7 |
tags:
|
| 8 |
- text-to-speech
|
|
|
|
| 9 |
- streaming-tts
|
| 10 |
- sparse-attention
|
| 11 |
- low-rank
|
|
|
|
| 16 |
- keithito/lj_speech
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# Mesko TTS
|
| 20 |
|
| 21 |
+
Mesko TTS is MesklinTech's dedicated text-to-speech research project.
|
| 22 |
|
| 23 |
+
This repository is currently published as an architecture and training-code release. The previous small checkpoint was useful for code smoke tests, but it did not include a properly trained production vocoder and produced noisy waveform audio. We have therefore marked the project as **not production-trained yet** and are continuing training before publishing a listenable release checkpoint.
|
| 24 |
|
| 25 |
+
## Mission
|
| 26 |
|
| 27 |
+
MesklinTech is building practical AI systems from first principles: compact, efficient, understandable models that can run outside large-lab infrastructure. Mesko TTS is our speech effort: a fast, streaming-oriented TTS stack designed around sparse routing, explicit acoustic control, and low-latency inference.
|
| 28 |
|
| 29 |
+
Our goal is to build a world-class fast streaming TTS system for real-time assistants, accessibility products, education tools, creator workflows, and business voice interfaces.
|
| 30 |
|
| 31 |
+
We are actively looking for collaborators, partners, and aligned supporters who want to help us move from research prototype to a polished voice system. To learn more or support the work, visit:
|
| 32 |
|
| 33 |
**https://mesklintech.com**
|
| 34 |
|
| 35 |
+
## Current Status
|
| 36 |
|
| 37 |
+
Status: **research / training in progress**
|
| 38 |
|
| 39 |
+
What is available now:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
- TTS architecture source code
|
| 42 |
+
- sparse semantic encoder
|
| 43 |
+
- speaker encoder
|
| 44 |
+
- duration, pitch, and energy predictors
|
| 45 |
+
- sparse acoustic decoder
|
| 46 |
+
- sparse neural vocoder code
|
| 47 |
+
- LJSpeech training scripts and config structure
|
| 48 |
|
| 49 |
+
What is not ready yet:
|
| 50 |
|
| 51 |
+
- production-quality speech checkpoint
|
| 52 |
+
- trained neural vocoder release
|
| 53 |
+
- standardized MOS / WER / speaker-similarity benchmark
|
| 54 |
+
- long-form streaming quality validation
|
| 55 |
|
| 56 |
+
## Architecture Direction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
Mesko TTS is built around:
|
| 59 |
|
| 60 |
+
- low-rank Q/K/V projections
|
| 61 |
+
- causal sparse candidate attention
|
| 62 |
+
- local, memory, landmark, and content candidate routing
|
| 63 |
+
- laminar excitatory/inhibitory refinement
|
| 64 |
+
- explicit speaker conditioning
|
| 65 |
+
- explicit duration, pitch, and energy modeling
|
| 66 |
+
- compact acoustic decoding
|
| 67 |
+
- streaming-oriented state/cache structure
|
| 68 |
|
| 69 |
+
The intended model path is:
|
| 70 |
|
| 71 |
1. Reference mel -> speaker encoder
|
| 72 |
2. Text tokens -> sparse semantic encoder
|
|
|
|
| 75 |
5. Pitch and energy predictors -> frame-level controls
|
| 76 |
6. Frame states + speaker + pitch + energy -> sparse acoustic decoder
|
| 77 |
7. Acoustic energy/gating head -> mel spectrogram
|
| 78 |
+
8. Trained neural vocoder -> waveform
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
+
## Why The Checkpoint Was Removed
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
+
The previous uploaded checkpoint could generate mel tensors, but audio generated through a fallback Griffin-Lim renderer was noisy. That is not acceptable for a public TTS release. A real TTS release needs a trained waveform vocoder or a high-quality external vocoder path.
|
| 83 |
|
| 84 |
+
Until that is ready, this repository should be treated as source code and architecture documentation, not as a finished voice model.
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
## Responsible Use
|
| 87 |
|
| 88 |
+
Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.
|
| 89 |
|
model.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:d06b15977562a2d15dd9fef7e013cea65bb147f436750f4b34fa61b85bc48a8f
|
| 3 |
-
size 5835812
|
|
|
|
|
|
|
|
|
|
|
|
training_metrics_step_8000.json
DELETED
|
@@ -1,12 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"step": 8000,
|
| 3 |
-
"metrics": {
|
| 4 |
-
"loss": 1.7827284336090088,
|
| 5 |
-
"mel_loss": 1.611472487449646,
|
| 6 |
-
"duration_loss": 0.007744799368083477,
|
| 7 |
-
"pitch_loss": 0.37013131380081177,
|
| 8 |
-
"energy_loss": 1.3269374370574951,
|
| 9 |
-
"mel_mae": 1.611472487449646,
|
| 10 |
-
"speaker_cosine": 1.0
|
| 11 |
-
}
|
| 12 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
training_summary.json
DELETED
|
@@ -1,14 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"published_checkpoint": "step_8000",
|
| 3 |
-
"step": 8000,
|
| 4 |
-
"metrics": {
|
| 5 |
-
"loss": 1.7827284336090088,
|
| 6 |
-
"mel_loss": 1.611472487449646,
|
| 7 |
-
"duration_loss": 0.007744799368083477,
|
| 8 |
-
"pitch_loss": 0.37013131380081177,
|
| 9 |
-
"energy_loss": 1.3269374370574951,
|
| 10 |
-
"mel_mae": 1.611472487449646,
|
| 11 |
-
"speaker_cosine": 1.0
|
| 12 |
-
},
|
| 13 |
-
"note": "model.safetensors contains model weights only; optimizer state was intentionally omitted to keep the Hugging Face repo small."
|
| 14 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|