mesklintech commited on
Commit
09b5e55
·
verified ·
1 Parent(s): 4a29310

Add experimental real-data vocoder training artifacts

Browse files
README.md CHANGED
@@ -20,7 +20,7 @@ datasets:
20
 
21
  Mesko TTS is MesklinTech's dedicated text-to-speech research project.
22
 
23
- This repository is currently published as an architecture and training-code release. The previous small checkpoint was useful for code smoke tests, but it did not include a properly trained production vocoder and produced noisy waveform audio. We have therefore marked the project as **not production-trained yet** and are continuing training before publishing a listenable release checkpoint.
24
 
25
  ## Mission
26
 
@@ -45,11 +45,13 @@ What is available now:
45
  - sparse acoustic decoder
46
  - sparse neural vocoder code
47
  - LJSpeech training scripts and config structure
 
 
48
 
49
  What is not ready yet:
50
 
51
  - production-quality speech checkpoint
52
- - trained neural vocoder release
53
  - standardized MOS / WER / speaker-similarity benchmark
54
  - long-form streaming quality validation
55
 
@@ -77,13 +79,20 @@ The intended model path is:
77
  7. Acoustic energy/gating head -> mel spectrogram
78
  8. Trained neural vocoder -> waveform
79
 
80
- ## Why The Checkpoint Was Removed
81
 
82
- The previous uploaded checkpoint could generate mel tensors, but audio generated through a fallback Griffin-Lim renderer was noisy. That is not acceptable for a public TTS release. A real TTS release needs a trained waveform vocoder or a high-quality external vocoder path.
83
 
84
- Until that is ready, this repository should be treated as source code and architecture documentation, not as a finished voice model.
 
 
 
 
 
 
 
 
85
 
86
  ## Responsible Use
87
 
88
  Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.
89
-
 
20
 
21
  Mesko TTS is MesklinTech's dedicated text-to-speech research project.
22
 
23
+ This repository is currently published as an architecture and training-code release with experimental checkpoints. The first small checkpoint was useful for code smoke tests, but it did not include a properly trained production vocoder and produced noisy waveform audio through fallback rendering. We have therefore marked the project as **not production-trained yet** and are continuing training before publishing a polished voice release.
24
 
25
  ## Mission
26
 
 
45
  - sparse acoustic decoder
46
  - sparse neural vocoder code
47
  - LJSpeech training scripts and config structure
48
+ - experimental text-to-mel checkpoint
49
+ - experimental real-data vocoder checkpoint
50
 
51
  What is not ready yet:
52
 
53
  - production-quality speech checkpoint
54
+ - production-grade trained neural vocoder release
55
  - standardized MOS / WER / speaker-similarity benchmark
56
  - long-form streaming quality validation
57
 
 
79
  7. Acoustic energy/gating head -> mel spectrogram
80
  8. Trained neural vocoder -> waveform
81
 
82
+ ## Experimental Checkpoints
83
 
84
+ The current experimental files are:
85
 
86
+ - `experimental/text_to_mel_step_8000.safetensors`
87
+ - `experimental/vocoder_realdata_step_3000.safetensors`
88
+ - `experimental/vocoder_config.json`
89
+ - `experimental/vocoder_metrics_step_3000.json`
90
+ - `samples/hello_world_i_am_fine_vocoder_step3000.wav`
91
+
92
+ The vocoder checkpoint was trained on real LJSpeech mel/waveform segments for a short in-session run. It is better aligned than the previous random-data smoke path, but it is still an early experimental checkpoint and should not be treated as final voice quality.
93
+
94
+ Until a longer vocoder and text-to-mel training run is complete, this repository should be treated as source code, architecture documentation, and experimental research weights, not as a finished voice model.
95
 
96
  ## Responsible Use
97
 
98
  Do not use this project to impersonate people, clone voices without consent, commit fraud, or create misleading audio. Voice technology should be built and used with permission, transparency, and care.
 
experimental/text_to_mel_step_8000.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d4f3632f75d048cdbd37829e33909e99883d47fc99ce418aa5523469d7c86a91
3
+ size 5835780
experimental/vocoder_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "n_mels": 80,
3
+ "channels": 64,
4
+ "residual_layers": 4,
5
+ "upsample_scales": [
6
+ 8,
7
+ 5,
8
+ 3,
9
+ 2
10
+ ],
11
+ "sample_rate": 24000
12
+ }
experimental/vocoder_metrics_step_3000.json ADDED
The diff for this file is too large to render. See raw diff
 
experimental/vocoder_realdata_step_3000.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35c83979648836ca7d1651e598aeafb6c4e8d65a5e15fddf0bd58953a0f4c41e
3
+ size 968980
samples/hello_world_i_am_fine_vocoder_step3000.wav ADDED
Binary file (57.2 kB). View file