Rename to Mesko TTS and mark checkpoint as training in progress

Browse files

Files changed (4) hide show

README.md +38 -93
model.safetensors +0 -3
training_metrics_step_8000.json +0 -12
training_summary.json +0 -14

README.md CHANGED Viewed

@@ -6,7 +6,6 @@ library_name: pytorch
 pipeline_tag: text-to-speech
 tags:
 - text-to-speech
-- voice-cloning
 - streaming-tts
 - sparse-attention
 - low-rank
@@ -17,54 +16,57 @@ datasets:
 - keithito/lj_speech
 ---
-# Mesko TTS Model
-Mesko TTS Model is a research-stage text-to-speech model from MesklinTech, built for fast, controllable, streaming-friendly speech generation.
-The project explores a sparse-energy architecture for TTS: low-rank projections, sparse temporal routing, laminar refinement, explicit speaker conditioning, duration/pitch/energy controls, and a compact acoustic decoder. The goal is to move toward world-class, low-latency TTS that can run efficiently and eventually support real-time voice products.
-## Company Story
-MesklinTech is building practical AI systems from first principles, with a focus on models that are efficient, understandable, and useful outside large-lab infrastructure. Mesko TTS is part of that mission: a speech model designed around speed, controllability, and a clear path toward streaming voice generation.
-Our long-term vision is to create a world-class TTS stack that can support real-time assistants, education tools, accessibility products, creator workflows, and voice interfaces for businesses. We are still early, but the foundation is intentionally different: compact sparse modules, explicit acoustic controls, and a design that prioritizes fast inference rather than only scaling parameter count.
-We are actively looking for collaborators, partners, and aligned supporters who want to help build a faster, more accessible TTS future. To learn more or support the work, visit:
 **https://mesklintech.com**
-## What Is Included
-This repository contains a compact LJSpeech-trained checkpoint and the source needed to load and inspect the model.
-- `model.safetensors`: model-only weights converted from local checkpoint `step_8000.pt`
-- `config.json`: Mesko/BioVoice TTS architecture and training config
-- `tokenizer.json`: tokenizer used for the LJSpeech training run
-- `training_summary.json`: checkpoint metrics summary
-- `training_metrics_step_8000.json`: metrics saved with the selected checkpoint
-- `bio_voice_tts/`: TTS model, audio features, datasets, training, inference, streaming, and vocoder code
-- `bio_llm/`: shared sparse-energy model utilities used by the project
-The optimizer state and intermediate checkpoints were intentionally removed to keep the repository small.
-## Checkpoint Metrics
-Selected checkpoint: `step_8000`
-| Metric | Value |
-|---|---:|
-| loss | 1.7827 |
-| mel_loss / mel_mae | 1.6115 |
-| duration_loss | 0.0077 |
-| pitch_loss | 0.3701 |
-| energy_loss | 1.3269 |
-| speaker_cosine proxy | 1.0000 |
-These are internal training metrics from the local run. They are not a substitute for standardized MOS, WER, speaker-verification EER, or human preference evaluation.
-## Architecture
-The model path is:
 1. Reference mel -> speaker encoder
 2. Text tokens -> sparse semantic encoder
@@ -73,72 +75,15 @@ The model path is:
 5. Pitch and energy predictors -> frame-level controls
 6. Frame states + speaker + pitch + energy -> sparse acoustic decoder
 7. Acoustic energy/gating head -> mel spectrogram
-8. Optional sparse neural vocoder -> waveform
-Core design ideas:
-- low-rank Q/K/V projections
-- causal sparse candidate attention
-- local, memory, landmark, and content candidate routing
-- laminar excitatory/inhibitory refinement
-- explicit duration, pitch, and energy modeling
-- compact acoustic decoding
-- streaming-oriented structure
-## Minimal Loading Example
-```python
-import json
-import torch
-from safetensors.torch import load_file
-from bio_voice_tts import BioVoiceConfig, BioVoiceTTS
-def merge_dataclass(instance, payload):
-    for key, value in payload.items():
-        current = getattr(instance, key)
-        if hasattr(current, "__dataclass_fields__") and isinstance(value, dict):
-            merge_dataclass(current, value)
-        else:
-            setattr(instance, key, value)
-    return instance
-config = merge_dataclass(BioVoiceConfig(), json.load(open("config.json")))
-model = BioVoiceTTS(config)
-model.load_state_dict(load_file("model.safetensors"), strict=False)
-model.eval()
-token_ids = torch.randint(0, config.semantic.vocab_size, (1, 16))
-reference_mel = torch.randn(1, 128, config.audio.n_mels)
-with torch.no_grad():
-    outputs = model(token_ids, reference_mel)
-print(outputs["mel"].shape)
-```
-For waveform synthesis, pass `outputs["mel"]` through `bio_voice_tts.vocoder.sparse_vocoder.SparseNeuralVocoder`. A separately trained vocoder checkpoint is recommended for production-quality audio.
-## Current Status
-Mesko TTS Model is an early research checkpoint, not a finished production voice product.
-Current strengths:
-- compact checkpoint
-- fast architecture direction
-- sparse, inspectable model internals
-- explicit acoustic controls
-- source included for research and extension
-Current limitations:
-- trained on LJSpeech-style single-speaker data
-- no standardized public MOS/WER/EER benchmark yet
-- text-to-mel checkpoint; production waveform quality requires a matching vocoder
-- some configuration fields are still scaffolded for future work
 ## Responsible Use
-Do not use this model to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.

 pipeline_tag: text-to-speech
 tags:
 - text-to-speech
 - streaming-tts
 - sparse-attention
 - low-rank
 - keithito/lj_speech
 ---
+# Mesko TTS
+Mesko TTS is MesklinTech's dedicated text-to-speech research project.
+This repository is currently published as an architecture and training-code release. The previous small checkpoint was useful for code smoke tests, but it did not include a properly trained production vocoder and produced noisy waveform audio. We have therefore marked the project as **not production-trained yet** and are continuing training before publishing a listenable release checkpoint.
+## Mission
+MesklinTech is building practical AI systems from first principles: compact, efficient, understandable models that can run outside large-lab infrastructure. Mesko TTS is our speech effort: a fast, streaming-oriented TTS stack designed around sparse routing, explicit acoustic control, and low-latency inference.
+Our goal is to build a world-class fast streaming TTS system for real-time assistants, accessibility products, education tools, creator workflows, and business voice interfaces.
+We are actively looking for collaborators, partners, and aligned supporters who want to help us move from research prototype to a polished voice system. To learn more or support the work, visit:
 **https://mesklintech.com**
+## Current Status
+Status: **research / training in progress**
+What is available now:
+- TTS architecture source code
+- sparse semantic encoder
+- speaker encoder
+- duration, pitch, and energy predictors
+- sparse acoustic decoder
+- sparse neural vocoder code
+- LJSpeech training scripts and config structure
+What is not ready yet:
+- production-quality speech checkpoint
+- trained neural vocoder release
+- standardized MOS / WER / speaker-similarity benchmark
+- long-form streaming quality validation
+## Architecture Direction
+Mesko TTS is built around:
+- low-rank Q/K/V projections
+- causal sparse candidate attention
+- local, memory, landmark, and content candidate routing
+- laminar excitatory/inhibitory refinement
+- explicit speaker conditioning
+- explicit duration, pitch, and energy modeling
+- compact acoustic decoding
+- streaming-oriented state/cache structure
+The intended model path is:
 1. Reference mel -> speaker encoder
 2. Text tokens -> sparse semantic encoder
 5. Pitch and energy predictors -> frame-level controls
 6. Frame states + speaker + pitch + energy -> sparse acoustic decoder
 7. Acoustic energy/gating head -> mel spectrogram
+8. Trained neural vocoder -> waveform
+## Why The Checkpoint Was Removed
+The previous uploaded checkpoint could generate mel tensors, but audio generated through a fallback Griffin-Lim renderer was noisy. That is not acceptable for a public TTS release. A real TTS release needs a trained waveform vocoder or a high-quality external vocoder path.
+Until that is ready, this repository should be treated as source code and architecture documentation, not as a finished voice model.
 ## Responsible Use
+Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.

model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d06b15977562a2d15dd9fef7e013cea65bb147f436750f4b34fa61b85bc48a8f
-size 5835812

training_metrics_step_8000.json DELETED Viewed

@@ -1,12 +0,0 @@
-{
-  "step": 8000,
-  "metrics": {
-    "loss": 1.7827284336090088,
-    "mel_loss": 1.611472487449646,
-    "duration_loss": 0.007744799368083477,
-    "pitch_loss": 0.37013131380081177,
-    "energy_loss": 1.3269374370574951,
-    "mel_mae": 1.611472487449646,
-    "speaker_cosine": 1.0
-  }
-}

training_summary.json DELETED Viewed

@@ -1,14 +0,0 @@
-{
-  "published_checkpoint": "step_8000",
-  "step": 8000,
-  "metrics": {
-    "loss": 1.7827284336090088,
-    "mel_loss": 1.611472487449646,
-    "duration_loss": 0.007744799368083477,
-    "pitch_loss": 0.37013131380081177,
-    "energy_loss": 1.3269374370574951,
-    "mel_mae": 1.611472487449646,
-    "speaker_cosine": 1.0
-  },
-  "note": "model.safetensors contains model weights only; optimizer state was intentionally omitted to keep the Hugging Face repo small."
-}