IsGarrido commited on
Commit
349743b
·
verified ·
1 Parent(s): 10e0f12

Upload supertonic version gender_classifier_en_modernbert_base

Browse files
.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ window.json
2
+ filter_bank.json
3
+ style_extractor.onnx
4
+ *.npy
LICENSE ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ BigScience Open RAIL-M License
2
+ dated August 18, 2022
3
+
4
+ Section I: PREAMBLE
5
+
6
+ This Open RAIL-M License was created by BigScience, a collaborative open innovation project aimed at
7
+ the responsible development and use of large multilingual datasets and Large Language Models
8
+ (“LLMs”). While a similar license was originally designed for the BLOOM model, we decided to adapt it
9
+ and create this license in order to propose a general open and responsible license applicable to other
10
+ machine learning based AI models (e.g. multimodal generative models).
11
+ In short, this license strives for both the open and responsible downstream use of the accompanying
12
+ model. When it comes to the open character, we took inspiration from open source permissive licenses
13
+ regarding the grant of IP rights. Referring to the downstream responsible use, we added use-based
14
+ restrictions not permitting the use of the Model in very specific scenarios, in order for the licensor to be
15
+ able to enforce the license in case potential misuses of the Model may occur. Even though downstream
16
+ derivative versions of the model could be released under different licensing terms, the latter will always
17
+ have to include - at minimum - the same use-based restrictions as the ones in the original license (this
18
+ license).
19
+ The development and use of artificial intelligence (“AI”), does not come without concerns. The world has
20
+ witnessed how AI techniques may, in some instances, become risky for the public in general. These risks
21
+ come in many forms, from racial discrimination to the misuse of sensitive information.
22
+ BigScience believes in the intersection between open and responsible AI development; thus, this License
23
+ aims to strike a balance between both in order to enable responsible open-science in the field of AI.
24
+ This License governs the use of the model (and its derivatives) and is informed by the model card
25
+ associated with the model.
26
+
27
+ NOW THEREFORE, You and Licensor agree as follows:
28
+
29
+ 1. Definitions
30
+ (a) "License" means the terms and conditions for use, reproduction, and Distribution as defined in
31
+ this document.
32
+ (b) “Data” means a collection of information and/or content extracted from the dataset used with the
33
+ Model, including to train, pretrain, or otherwise evaluate the Model. The Data is not licensed under
34
+ this License.
35
+ (c)“Output” means the results of operating a Model as embodied in informational content resulting
36
+ therefrom.
37
+ (d)“Model” means any accompanying machine-learning based assemblies (including checkpoints),
38
+ consisting of learnt weights, parameters (including optimizer states), corresponding to the model
39
+ architecture as embodied in the Complementary Material, that have been trained or tuned, in whole or
40
+ in part on the Data, using the Complementary Material.
41
+ (e) “Derivatives of the Model” means all modifications to the Model, works based on the Model, or any
42
+ other model which is created or initialized by transfer of patterns of the weights, parameters,
43
+ activations or output of the Model, to the other model, in order to cause the other model to perform
44
+ similarly to the Model, including - but not limited to - distillation methods entailing the use of
45
+ intermediate data representations or methods based on the generation of synthetic data by the Model
46
+ for training the other model.
47
+ (f)“Complementary Material” means the accompanying source code and scripts used to define,
48
+ run, load, benchmark or evaluate the Model, and used to prepare data for training or evaluation, if
49
+ any. This includes any accompanying documentation, tutorials, examples, etc, if any.
50
+ (g) “Distribution” means any transmission, reproduction, publication or other sharing of the Model or
51
+ Derivatives of the Model to a third party, including providing the Model as a hosted service made
52
+ available by electronic or other remote means - e.g. API-based or web access.
53
+ (h) “Licensor” means the copyright owner or entity authorized by the copyright owner that is
54
+ granting the License, including the persons or entities that may have rights in the Model and/or
55
+ distributing the Model.
56
+ (i) "You" (or "Your") means an individual or Legal Entity exercising permissions granted by this
57
+ License and/or making use of the Model for whichever purpose and in any field of use, including
58
+ usage of the Model in an end-use application - e.g. chatbot, translator, image generator.
59
+ (j) “Third Parties” means individuals or legal entities that are not under common control with
60
+ Licensor or You.
61
+ (k) "Contribution" means any work of authorship, including the original version of the Model and
62
+ any modifications or additions to that Model or Derivatives of the Model thereof, that is
63
+ intentionally submitted to Licensor for inclusion in the Model by the copyright owner or by an
64
+ individual or Legal Entity authorized to submit on behalf of the copyright owner. For the
65
+ purposes of this definition,
66
+ “submitted” means any form of electronic, verbal, or written
67
+ communication sent to the Licensor or its representatives, including but not limited to
68
+ communication on electronic mailing lists, source code control systems, and issue tracking
69
+ systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and
70
+ improving the Model, but excluding communication that is conspicuously marked or otherwise
71
+ designated in writing by the copyright owner as "Not a Contribution."
72
+ (l) "Contributor" means Licensor and any individual or Legal Entity on behalf of whom a
73
+ Contribution has been received by Licensor and subsequently incorporated within the Model.
74
+
75
+
76
+ Section II: INTELLECTUAL PROPERTY RIGHTS
77
+
78
+ Both copyright and patent grants apply to the Model, Derivatives of the Model and Complementary
79
+ Material. The Model and Derivatives of the Model are subject to additional terms as described in Section III.
80
+
81
+ 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor
82
+ hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the
83
+ Complementary Material, the Model, and Derivatives of the Model.
84
+
85
+ 3. Grant of Patent License. Subject to the terms and conditions of this License and where and as
86
+ applicable, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge,
87
+ royalty-free, irrevocable (except as stated in this paragraph) patent license to make, have made, use, offer
88
+ to sell, sell, import, and otherwise transfer the Model and the Complementary Material, where such
89
+ license applies only to those patent claims licensable by such Contributor that are necessarily infringed by
90
+ their Contribution(s) alone or by combination of their Contribution(s) with the Model to which such
91
+ Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim
92
+ or counterclaim in a lawsuit) alleging that the Model and/or Complementary Material or a Contribution
93
+ incorporated within the Model and/or Complementary Material constitutes direct or contributory patent
94
+ infringement, then any patent licenses granted to You under this License for the Model and/or Work shall
95
+ terminate as of the date such litigation is asserted or filed.
96
+ Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION
97
+
98
+ 4. Distribution and Redistribution. You may host for Third Party remote access purposes (e.g.
99
+ software-as-a-service), reproduce and distribute copies of the Model or Derivatives of the Model thereof
100
+ in any medium, with or without modifications, provided that You meet the following conditions:
101
+
102
+ a. Use-based restrictions as referenced in paragraph 5 MUST be included as an enforceable provision
103
+ by You in any type of legal agreement (e.g. a license) governing the use and/or distribution of the
104
+ Model or Derivatives of the Model, and You shall give notice to subsequent users You Distribute to,
105
+ that the Model or Derivatives of the Model are subject to paragraph 5. This provision does not apply
106
+ to the use of Complementary Material.
107
+
108
+ b. You must give any Third Party recipients of the Model or Derivatives of the Model a copy of this
109
+ License;
110
+
111
+ c. You must cause any modified files to carry prominent notices stating that You changed the files;
112
+
113
+ d. You must retain all copyright, patent, trademark, and attribution notices excluding those notices
114
+ that do not pertain to any part of the Model, Derivatives of the Model.
115
+ You may add Your own copyright statement to Your modifications and may provide additional or
116
+ different license terms and conditions - respecting paragraph 4.a.
117
+ - for use, reproduction, or Distribution
118
+ of Your modifications, or for any such Derivatives of the Model as a whole, provided Your use,
119
+ reproduction, and Distribution of the Model otherwise complies with the conditions stated in this License.
120
+
121
+ 5. Use-based restrictions. The restrictions set forth in Attachment A are considered Use-based restrictions.
122
+ Therefore You cannot use the Model and the Derivatives of the Model for the specified restricted uses. You
123
+ may use the Model subject to this License, including only for lawful purposes and in accordance with the
124
+ License. Use may include creating any content with, finetuning, updating, running, training, evaluating and/or
125
+ reparametrizing the Model. You shall require all of Your users who use the Model or a Derivative of the Model
126
+ to comply with the terms of this paragraph (paragraph 5).
127
+
128
+ 6. The Output You Generate. Except as set forth herein, Licensor claims no rights in the Output You
129
+ generate using the Model. You are accountable for the Output you generate and its subsequent uses. No
130
+ use of the output can contravene any provision as stated in the License.
131
+
132
+ Section IV: OTHER PROVISIONS
133
+
134
+ 7. Updates and Runtime Restrictions. To the maximum extent permitted by law, Licensor reserves the
135
+ right to restrict (remotely or otherwise) usage of the Model in violation of this License, update the Model
136
+ through electronic means, or modify the Output of the Model based on updates. You shall undertake
137
+ reasonable efforts to use the latest version of the Model.
138
+
139
+ 8. Trademarks and related. Nothing in this License permits You to make use of Licensors’ trademarks,
140
+ trade names, logos or to otherwise suggest endorsement or misrepresent the relationship between the
141
+ parties; and any rights not expressly granted herein are reserved by the Licensors.
142
+
143
+ 9. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides
144
+ the Model and the Complementary Material (and each Contributor provides its Contributions) on an "AS
145
+ IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
146
+ including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT,
147
+ MERCHANTABILITY , or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for
148
+ determining the appropriateness of using or redistributing the Model, Derivatives of the Model, and the
149
+ Complementary Material and assume any risks associated with Your exercise of permissions under this
150
+ License.
151
+
152
+ 10. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence),
153
+ contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or
154
+ agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect,
155
+ special, incidental, or consequential damages of any character arising as a result of this License or out of
156
+ the use or inability to use the Model and the Complementary Material (including but not limited to
157
+ damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other
158
+ commercial damages or losses), even if such Contributor has been advised of the possibility of such
159
+ damages.
160
+
161
+ 11. Accepting Warranty or Additional Liability. While redistributing the Model, Derivatives of the
162
+ Model and the Complementary Material thereof, You may choose to offer, and charge a fee for, acceptance
163
+ of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License.
164
+ However, in accepting such obligations, You may act only on Your own behalf and on Your sole
165
+ responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and
166
+ hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor
167
+ by reason of your accepting any such warranty or additional liability.
168
+
169
+ 12. If any provision of this License is held to be invalid, illegal or unenforceable, the remaining
170
+ provisions shall be unaffected thereby and remain valid as if such provision had not been set forth herein.
171
+
172
+ END OF TERMS AND CONDITIONS
173
+
174
+ Attachment A
175
+
176
+ Use Restrictions
177
+
178
+ You agree not to use the Model or Derivatives of the Model:
179
+ (a) In any way that violates any applicable national, federal, state, local or international law
180
+ or regulation;
181
+ (b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any
182
+ way;
183
+ (c) To generate or disseminate verifiably false information and/or content with the purpose of
184
+ harming others;
185
+ (d) To generate or disseminate personal identifiable information that can be used to harm an
186
+ individual;
187
+ (e) To generate or disseminate information and/or content (e.g. images, code, posts, articles),
188
+ and place the information and/or content in any context (e.g. bot generating tweets)
189
+ without expressly and intelligibly disclaiming that the information and/or content is
190
+ machine generated;
191
+ (f) To defame, disparage or otherwise harass others;
192
+ (g) To impersonate or attempt to impersonate (e.g. deepfakes) others without their consent;
193
+ (h) For fully automated decision making that adversely impacts an individual’s legal rights or
194
+ otherwise creates or modifies a binding, enforceable obligation;
195
+ (i) For any use intended to or which has the effect of discriminating against or harming
196
+ individuals or groups based on online or offline social behavior or known or predicted
197
+ personal or personality characteristics;
198
+ (j) To exploit any of the vulnerabilities of a specific group of persons based on their age,
199
+ social, physical or mental characteristics, in order to materially distort the behavior of a
200
+ person pertaining to that group in a manner that causes or is likely to cause that person or
201
+ another person physical or psychological harm;
202
+ (k) For any use intended to or which has the effect of discriminating against individuals or
203
+ groups based on legally protected characteristics or categories;
204
+ (l) To provide medical advice and medical results interpretation;
205
+ (m) To generate or disseminate information for the purpose to be used for administration of
206
+ justice, law enforcement, immigration or asylum processes, such as predicting an
207
+ individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal
208
+ relationships between assertions made in documents, indiscriminate and
209
+ arbitrarily-targeted use).
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: openrail
3
+ language:
4
+ - en
5
+ pipeline_tag: text-to-speech
6
+ library_name: supertonic
7
+ ---
8
+
9
+ # Supertonic — Lightning Fast, On-Device TTS
10
+
11
+ **Supertonic** is a lightning-fast, on-device text-to-speech system designed for **extreme performance** with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
12
+
13
+ > 🎧 **Try it now**: Experience Supertonic in your browser with our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo), or [**Hugging Face app**](https://huggingface.co/spaces/akhaliq/supertonic) or get started with pre-trained models from [**Hugging Face Hub**](https://huggingface.co/Supertone/supertonic)
14
+
15
+ > 🛠 **GitHub Repository**
16
+ > To use Supertonic most easily, visit the official GitHub repository:
17
+ > https://github.com/supertone-inc/supertonic
18
+ > You’ll find multi-language example codes.
19
+
20
+ ### Table of Contents
21
+
22
+ - [Why Supertonic?](#why-supertonic)
23
+ - [Language Support](#language-support)
24
+ - [Getting Started](#getting-started)
25
+ - [Performance](#performance)
26
+ - [Citation](#citation)
27
+ - [License](#license)
28
+
29
+ ## Why Supertonic?
30
+
31
+ - **⚡ Blazingly Fast**: Generates speech up to **167× faster than real-time** on consumer hardware (M4 Pro)—unmatched by any other TTS system
32
+ - **🪶 Ultra Lightweight**: Only **66M parameters**, optimized for efficient on-device performance with minimal footprint
33
+ - **📱 On-Device Capable**: **Complete privacy** and **zero latency**—all processing happens locally on your device
34
+ - **🎨 Natural Text Handling**: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
35
+ - **⚙️ Highly Configurable**: Adjust inference steps, batch processing, and other parameters to match your specific needs
36
+ - **🧩 Flexible Deployment**: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.
37
+
38
+
39
+ ## Language Support
40
+
41
+ We provide ready-to-use TTS inference examples across multiple ecosystems:
42
+
43
+ | Language/Platform | Path | Description |
44
+ |-------------------|------|-------------|
45
+ | [**Python**] | `py/` | ONNX Runtime inference |
46
+ | [**Node.js**] | `nodejs/` | Server-side JavaScript |
47
+ | [**Browser**] | `web/` | WebGPU/WASM inference |
48
+ | [**Java**] | `java/` | Cross-platform JVM |
49
+ | [**C++**] | `cpp/` | High-performance C++ |
50
+ | [**C#**] | `csharp/` | .NET ecosystem |
51
+ | [**Go**] | `go/` | Go implementation |
52
+ | [**Swift**] | `swift/` | macOS applications |
53
+ | [**iOS**] | `ios/` | Native iOS apps |
54
+ | [**Rust**] | `rust/` | Memory-safe systems |
55
+ | [**Flutter**] | `flutter/` | Cross-platform apps |
56
+
57
+ > For detailed usage instructions, please refer to the README.md in each language directory.
58
+
59
+ ## Getting Started
60
+
61
+ First, clone the repository:
62
+
63
+ ```bash
64
+ git clone https://github.com/supertone-inc/supertonic.git
65
+ cd supertonic
66
+ ```
67
+
68
+ ### Prerequisites
69
+
70
+ Before running the examples, download the ONNX models and preset voices, and place them in the `assets` directory:
71
+
72
+ ```bash
73
+ git clone https://huggingface.co/Supertone/supertonic assets
74
+ ```
75
+
76
+ > **Note:** The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.
77
+ > - macOS: `brew install git-lfs && git lfs install`
78
+ > - Generic: see `https://git-lfs.com` for installers
79
+
80
+
81
+ ### Technical Details
82
+
83
+ - **Runtime**: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
84
+ - **Browser Support**: onnxruntime-web for client-side inference
85
+ - **Batch Processing**: Supports batch inference for improved throughput
86
+ - **Audio Output**: Outputs 16-bit WAV files
87
+
88
+ ## Performance
89
+
90
+ We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
91
+
92
+ **Metrics:**
93
+ - **Characters per Second**: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
94
+ - **Real-time Factor (RTF)**: Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
95
+
96
+ ### Characters per Second
97
+ | System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
98
+ |--------|-----------------|----------------|-----------------|
99
+ | **Supertonic** (M4 pro - CPU) | 912 | 1048 | 1263 |
100
+ | **Supertonic** (M4 pro - WebGPU) | 996 | 1801 | 2509 |
101
+ | **Supertonic** (RTX4090) | 2615 | 6548 | 12164 |
102
+ | `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 144 | 209 | 287 |
103
+ | `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 37 | 55 | 82 |
104
+ | `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 12 | 18 | 24 |
105
+ | `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 38 | 64 | 92 |
106
+ | `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 104 | 107 | 117 |
107
+ | `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 37 | 42 | 47 |
108
+
109
+ > **Notes:**
110
+ > `API` = Cloud-based API services (measured from Seoul)
111
+ > `Open` = Open-source models
112
+ > Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
113
+ > Supertonic (RTX4090): Tested with PyTorch model
114
+ > Kokoro: Tested on M4 Pro CPU with ONNX
115
+ > NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
116
+
117
+ ### Real-time Factor
118
+
119
+ | System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
120
+ |--------|-----------------|----------------|-----------------|
121
+ | **Supertonic** (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
122
+ | **Supertonic** (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
123
+ | **Supertonic** (RTX4090) | 0.005 | 0.002 | 0.001 |
124
+ | `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 0.133 | 0.077 | 0.057 |
125
+ | `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 0.471 | 0.302 | 0.201 |
126
+ | `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 1.060 | 0.673 | 0.541 |
127
+ | `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 0.372 | 0.206 | 0.163 |
128
+ | `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 0.144 | 0.124 | 0.126 |
129
+ | `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 0.390 | 0.338 | 0.343 |
130
+
131
+ <details>
132
+ <summary><b>Additional Performance Data (5-step inference)</b></summary>
133
+
134
+ <br>
135
+
136
+ **Characters per Second (5-step)**
137
+
138
+ | System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
139
+ |--------|-----------------|----------------|-----------------|
140
+ | **Supertonic** (M4 pro - CPU) | 596 | 691 | 850 |
141
+ | **Supertonic** (M4 pro - WebGPU) | 570 | 1118 | 1546 |
142
+ | **Supertonic** (RTX4090) | 1286 | 3757 | 6242 |
143
+
144
+ **Real-time Factor (5-step)**
145
+
146
+ | System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
147
+ |--------|-----------------|----------------|-----------------|
148
+ | **Supertonic** (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
149
+ | **Supertonic** (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
150
+ | **Supertonic** (RTX4090) | 0.011 | 0.004 | 0.002 |
151
+
152
+ </details>
153
+
154
+ ## License
155
+
156
+ This project’s sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
157
+
158
+ The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic/blob/main/LICENSE) file for details.
159
+
160
+ This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
161
+
162
+ Copyright (c) 2025 Supertone Inc.
config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "model_name": "Supertonic",
3
+ "model_type": "onnx",
4
+ "description": "This is a stub config for Hugging Face download counting. The actual model is located at onnx/"
5
+ }
onnx/duration_predictor.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b861580c56a0cba2a2b82aa697ecb3c5a163c3240c60a0ddfac369d21d054092
3
+ size 1500789
onnx/text_encoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba0c8ea74aeb5df00d21a89b8d47c71317f47120232e3deef95024dba37dbd88
3
+ size 27348373
onnx/tts.json ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "tts_version": "v1.5.0",
3
+ "split": "opensource-en",
4
+ "ttl_ckpt_path": "unknown.pt",
5
+ "dp_ckpt_path": "unknown.pt",
6
+ "ae_ckpt_path": "unknown.pt",
7
+ "ttl_train": "unknown",
8
+ "dp_train": "unknown",
9
+ "ae_train": "unknown",
10
+ "ttl": {
11
+ "latent_dim": 24,
12
+ "chunk_compress_factor": 6,
13
+ "batch_expander": {
14
+ "n_batch_expand": 6
15
+ },
16
+ "normalizer": {
17
+ "scale": 0.25
18
+ },
19
+ "text_encoder": {
20
+ "char_dict_path": "resources/metadata/char_dict/opensource-en/char_dict.json",
21
+ "text_embedder": {
22
+ "char_dict_path": "resources/metadata/char_dict/opensource-en/char_dict.json",
23
+ "char_emb_dim": 256
24
+ },
25
+ "convnext": {
26
+ "idim": 256,
27
+ "ksz": 5,
28
+ "intermediate_dim": 1024,
29
+ "num_layers": 6,
30
+ "dilation_lst": [
31
+ 1,
32
+ 1,
33
+ 1,
34
+ 1,
35
+ 1,
36
+ 1
37
+ ]
38
+ },
39
+ "attn_encoder": {
40
+ "hidden_channels": 256,
41
+ "filter_channels": 1024,
42
+ "n_heads": 4,
43
+ "n_layers": 4,
44
+ "p_dropout": 0.0
45
+ },
46
+ "proj_out": {
47
+ "idim": 256,
48
+ "odim": 256
49
+ }
50
+ },
51
+ "flow_matching": {
52
+ "sig_min": 0
53
+ },
54
+ "style_encoder": {
55
+ "proj_in": {
56
+ "ldim": 24,
57
+ "chunk_compress_factor": 6,
58
+ "odim": 256
59
+ },
60
+ "convnext": {
61
+ "idim": 256,
62
+ "ksz": 5,
63
+ "intermediate_dim": 1024,
64
+ "num_layers": 6,
65
+ "dilation_lst": [
66
+ 1,
67
+ 1,
68
+ 1,
69
+ 1,
70
+ 1,
71
+ 1
72
+ ]
73
+ },
74
+ "style_token_layer": {
75
+ "input_dim": 256,
76
+ "n_style": 50,
77
+ "style_key_dim": 256,
78
+ "style_value_dim": 256,
79
+ "prototype_dim": 256,
80
+ "n_units": 256,
81
+ "n_heads": 2
82
+ }
83
+ },
84
+ "speech_prompted_text_encoder": {
85
+ "text_dim": 256,
86
+ "style_dim": 256,
87
+ "n_units": 256,
88
+ "n_heads": 2
89
+ },
90
+ "uncond_masker": {
91
+ "prob_both_uncond": 0.04,
92
+ "prob_text_uncond": 0.01,
93
+ "std": 0.1,
94
+ "text_dim": 256,
95
+ "n_style": 50,
96
+ "style_key_dim": 256,
97
+ "style_value_dim": 256
98
+ },
99
+ "vector_field": {
100
+ "proj_in": {
101
+ "ldim": 24,
102
+ "chunk_compress_factor": 6,
103
+ "odim": 512
104
+ },
105
+ "time_encoder": {
106
+ "time_dim": 64,
107
+ "hdim": 256
108
+ },
109
+ "main_blocks": {
110
+ "n_blocks": 4,
111
+ "time_cond_layer": {
112
+ "idim": 512,
113
+ "time_dim": 64
114
+ },
115
+ "style_cond_layer": {
116
+ "idim": 512,
117
+ "style_dim": 256
118
+ },
119
+ "text_cond_layer": {
120
+ "idim": 512,
121
+ "text_dim": 256,
122
+ "n_heads": 4,
123
+ "use_residual": true,
124
+ "rotary_base": 10000,
125
+ "rotary_scale": 10
126
+ },
127
+ "convnext_0": {
128
+ "idim": 512,
129
+ "ksz": 5,
130
+ "intermediate_dim": 1024,
131
+ "num_layers": 4,
132
+ "dilation_lst": [
133
+ 1,
134
+ 2,
135
+ 4,
136
+ 8
137
+ ]
138
+ },
139
+ "convnext_1": {
140
+ "idim": 512,
141
+ "ksz": 5,
142
+ "intermediate_dim": 1024,
143
+ "num_layers": 1,
144
+ "dilation_lst": [
145
+ 1
146
+ ]
147
+ },
148
+ "convnext_2": {
149
+ "idim": 512,
150
+ "ksz": 5,
151
+ "intermediate_dim": 1024,
152
+ "num_layers": 1,
153
+ "dilation_lst": [
154
+ 1
155
+ ]
156
+ }
157
+ },
158
+ "last_convnext": {
159
+ "idim": 512,
160
+ "ksz": 5,
161
+ "intermediate_dim": 1024,
162
+ "num_layers": 4,
163
+ "dilation_lst": [
164
+ 1,
165
+ 1,
166
+ 1,
167
+ 1
168
+ ]
169
+ },
170
+ "proj_out": {
171
+ "idim": 512,
172
+ "chunk_compress_factor": 6,
173
+ "ldim": 24
174
+ }
175
+ }
176
+ },
177
+ "ae": {
178
+ "sample_rate": 44100,
179
+ "n_delay": 0,
180
+ "base_chunk_size": 512,
181
+ "chunk_compress_factor": 1,
182
+ "ldim": 24,
183
+ "encoder": {
184
+ "spec_processor": {
185
+ "n_fft": 2048,
186
+ "win_length": 2048,
187
+ "hop_length": 512,
188
+ "n_mels": 228,
189
+ "sample_rate": 44100,
190
+ "eps": 1e-05,
191
+ "norm_mean": 0.0,
192
+ "norm_std": 1.0
193
+ },
194
+ "ksz_init": 7,
195
+ "ksz": 7,
196
+ "num_layers": 10,
197
+ "dilation_lst": [
198
+ 1,
199
+ 1,
200
+ 1,
201
+ 1,
202
+ 1,
203
+ 1,
204
+ 1,
205
+ 1,
206
+ 1,
207
+ 1
208
+ ],
209
+ "intermediate_dim": 2048,
210
+ "idim": 1253,
211
+ "hdim": 512,
212
+ "odim": 24
213
+ },
214
+ "decoder": {
215
+ "ksz_init": 7,
216
+ "ksz": 7,
217
+ "num_layers": 10,
218
+ "dilation_lst": [
219
+ 1,
220
+ 2,
221
+ 4,
222
+ 1,
223
+ 2,
224
+ 4,
225
+ 1,
226
+ 1,
227
+ 1,
228
+ 1
229
+ ],
230
+ "intermediate_dim": 2048,
231
+ "idim": 24,
232
+ "hdim": 512,
233
+ "head": {
234
+ "idim": 512,
235
+ "hdim": 2048,
236
+ "odim": 512,
237
+ "ksz": 3
238
+ }
239
+ }
240
+ },
241
+ "dp": {
242
+ "latent_dim": 24,
243
+ "chunk_compress_factor": 6,
244
+ "normalizer": {
245
+ "scale": 1.0
246
+ },
247
+ "sentence_encoder": {
248
+ "char_emb_dim": 64,
249
+ "char_dict_path": "resources/metadata/char_dict/opensource-en/char_dict.json",
250
+ "text_embedder": {
251
+ "char_dict_path": "resources/metadata/char_dict/opensource-en/char_dict.json",
252
+ "char_emb_dim": 64
253
+ },
254
+ "convnext": {
255
+ "idim": 64,
256
+ "ksz": 5,
257
+ "intermediate_dim": 256,
258
+ "num_layers": 6,
259
+ "dilation_lst": [
260
+ 1,
261
+ 1,
262
+ 1,
263
+ 1,
264
+ 1,
265
+ 1
266
+ ]
267
+ },
268
+ "attn_encoder": {
269
+ "hidden_channels": 64,
270
+ "filter_channels": 256,
271
+ "n_heads": 2,
272
+ "n_layers": 2,
273
+ "p_dropout": 0.0
274
+ },
275
+ "proj_out": {
276
+ "idim": 64,
277
+ "odim": 64
278
+ }
279
+ },
280
+ "style_encoder": {
281
+ "proj_in": {
282
+ "ldim": 24,
283
+ "chunk_compress_factor": 6,
284
+ "odim": 64
285
+ },
286
+ "convnext": {
287
+ "idim": 64,
288
+ "ksz": 5,
289
+ "intermediate_dim": 256,
290
+ "num_layers": 4,
291
+ "dilation_lst": [
292
+ 1,
293
+ 1,
294
+ 1,
295
+ 1
296
+ ]
297
+ },
298
+ "style_token_layer": {
299
+ "input_dim": 64,
300
+ "n_style": 8,
301
+ "style_key_dim": 0,
302
+ "style_value_dim": 16,
303
+ "prototype_dim": 64,
304
+ "n_units": 64,
305
+ "n_heads": 2
306
+ }
307
+ },
308
+ "predictor": {
309
+ "sentence_dim": 64,
310
+ "n_style": 8,
311
+ "style_dim": 16,
312
+ "hdim": 128,
313
+ "n_layer": 2
314
+ }
315
+ }
316
+ }
onnx/tts.yml ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ tts_version: "v1.5.0"
2
+
3
+ split: "opensource-en"
4
+
5
+ ttl_ckpt_path: "unknown.pt"
6
+
7
+ dp_ckpt_path: "unknown.pt"
8
+
9
+ ae_ckpt_path: "unknown.pt"
10
+
11
+ ttl_train: "unknown"
12
+
13
+ dp_train: "unknown"
14
+
15
+ ae_train: "unknown"
16
+
17
+ ttl:
18
+ latent_dim: 24
19
+ chunk_compress_factor: 6
20
+ batch_expander:
21
+ n_batch_expand: 6
22
+ normalizer:
23
+ scale: 0.25
24
+ text_encoder:
25
+ char_dict_path: "resources/metadata/char_dict/opensource-en/char_dict.json"
26
+ text_embedder:
27
+ char_dict_path: "resources/metadata/char_dict/opensource-en/char_dict.json"
28
+ char_emb_dim: 256
29
+ convnext:
30
+ idim: 256
31
+ ksz: 5
32
+ intermediate_dim: 1024
33
+ num_layers: 6
34
+ dilation_lst: [1, 1, 1, 1, 1, 1]
35
+ attn_encoder:
36
+ hidden_channels: 256
37
+ filter_channels: 1024
38
+ n_heads: 4
39
+ n_layers: 4
40
+ p_dropout: 0.0
41
+ proj_out:
42
+ idim: 256
43
+ odim: 256
44
+ flow_matching:
45
+ sig_min: 0
46
+ style_encoder:
47
+ proj_in:
48
+ ldim: 24
49
+ chunk_compress_factor: 6
50
+ odim: 256
51
+ convnext:
52
+ idim: 256
53
+ ksz: 5
54
+ intermediate_dim: 1024
55
+ num_layers: 6
56
+ dilation_lst: [1, 1, 1, 1, 1, 1]
57
+ style_token_layer:
58
+ input_dim: 256
59
+ n_style: 50
60
+ style_key_dim: 256
61
+ style_value_dim: 256
62
+ prototype_dim: 256
63
+ n_units: 256
64
+ n_heads: 2
65
+ speech_prompted_text_encoder:
66
+ text_dim: 256
67
+ style_dim: 256
68
+ n_units: 256
69
+ n_heads: 2
70
+ uncond_masker:
71
+ prob_both_uncond: 0.04
72
+ prob_text_uncond: 0.01
73
+ std: 0.1
74
+ text_dim: 256
75
+ n_style: 50
76
+ style_key_dim: 256
77
+ style_value_dim: 256
78
+ vector_field:
79
+ proj_in:
80
+ ldim: 24
81
+ chunk_compress_factor: 6
82
+ odim: 512
83
+ time_encoder:
84
+ time_dim: 64
85
+ hdim: 256
86
+ main_blocks:
87
+ n_blocks: 4
88
+ time_cond_layer:
89
+ idim: 512
90
+ time_dim: 64
91
+ style_cond_layer:
92
+ idim: 512
93
+ style_dim: 256
94
+ text_cond_layer:
95
+ idim: 512
96
+ text_dim: 256
97
+ n_heads: 4
98
+ use_residual: True
99
+ rotary_base: 10000
100
+ rotary_scale: 10
101
+ convnext_0:
102
+ idim: 512
103
+ ksz: 5
104
+ intermediate_dim: 1024
105
+ num_layers: 4
106
+ dilation_lst: [1, 2, 4, 8]
107
+ convnext_1:
108
+ idim: 512
109
+ ksz: 5
110
+ intermediate_dim: 1024
111
+ num_layers: 1
112
+ dilation_lst: [1]
113
+ convnext_2:
114
+ idim: 512
115
+ ksz: 5
116
+ intermediate_dim: 1024
117
+ num_layers: 1
118
+ dilation_lst: [1]
119
+ last_convnext:
120
+ idim: 512
121
+ ksz: 5
122
+ intermediate_dim: 1024
123
+ num_layers: 4
124
+ dilation_lst: [1, 1, 1, 1]
125
+ proj_out:
126
+ idim: 512
127
+ chunk_compress_factor: 6
128
+ ldim: 24
129
+
130
+ ae:
131
+ sample_rate: 44100
132
+ n_delay: 0
133
+ base_chunk_size: 512
134
+ chunk_compress_factor: 1
135
+ ldim: 24
136
+ encoder:
137
+ spec_processor:
138
+ n_fft: 2048
139
+ win_length: 2048
140
+ hop_length: 512
141
+ n_mels: 228
142
+ sample_rate: 44100
143
+ eps: 1e-05
144
+ norm_mean: 0.0
145
+ norm_std: 1.0
146
+ ksz_init: 7
147
+ ksz: 7
148
+ num_layers: 10
149
+ dilation_lst: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
150
+ intermediate_dim: 2048
151
+ idim: 1253
152
+ hdim: 512
153
+ odim: 24
154
+ decoder:
155
+ ksz_init: 7
156
+ ksz: 7
157
+ num_layers: 10
158
+ dilation_lst: [1, 2, 4, 1, 2, 4, 1, 1, 1, 1]
159
+ intermediate_dim: 2048
160
+ idim: 24
161
+ hdim: 512
162
+ head:
163
+ idim: 512
164
+ hdim: 2048
165
+ odim: 512
166
+ ksz: 3
167
+
168
+ dp:
169
+ latent_dim: 24
170
+ chunk_compress_factor: 6
171
+ normalizer:
172
+ scale: 1.0
173
+ sentence_encoder:
174
+ char_emb_dim: 64
175
+ char_dict_path: "resources/metadata/char_dict/opensource-en/char_dict.json"
176
+ text_embedder:
177
+ char_dict_path: "resources/metadata/char_dict/opensource-en/char_dict.json"
178
+ char_emb_dim: 64
179
+ convnext:
180
+ idim: 64
181
+ ksz: 5
182
+ intermediate_dim: 256
183
+ num_layers: 6
184
+ dilation_lst: [1, 1, 1, 1, 1, 1]
185
+ attn_encoder:
186
+ hidden_channels: 64
187
+ filter_channels: 256
188
+ n_heads: 2
189
+ n_layers: 2
190
+ p_dropout: 0.0
191
+ proj_out:
192
+ idim: 64
193
+ odim: 64
194
+ style_encoder:
195
+ proj_in:
196
+ ldim: 24
197
+ chunk_compress_factor: 6
198
+ odim: 64
199
+ convnext:
200
+ idim: 64
201
+ ksz: 5
202
+ intermediate_dim: 256
203
+ num_layers: 4
204
+ dilation_lst: [1, 1, 1, 1]
205
+ style_token_layer:
206
+ input_dim: 64
207
+ n_style: 8
208
+ style_key_dim: 0
209
+ style_value_dim: 16
210
+ prototype_dim: 64
211
+ n_units: 64
212
+ n_heads: 2
213
+ predictor:
214
+ sentence_dim: 64
215
+ n_style: 8
216
+ style_dim: 16
217
+ hdim: 128
218
+ n_layer: 2
219
+
220
+ unicode_indexer_path: "/data/public/model/supertonic/tts/v1.5.0/opensource-en/onnx/unicode_indexer.npy"
221
+ unicode_indexer_json_path: "/data/public/model/supertonic/tts/v1.5.0/opensource-en/onnx/unicode_indexer.json"
222
+ window_path: "/data/public/model/supertonic/tts/v1.5.0/opensource-en/onnx/window.json"
223
+ filter_bank_path: "/data/public/model/supertonic/tts/v1.5.0/opensource-en/onnx/filter_bank.json"
onnx/unicode_indexer.json ADDED
The diff for this file is too large to render. See raw diff
 
onnx/vector_estimator.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3f82ecd2e9decc4e2236048b03628a1c1d5f14a792ba274a59b7325107aa6a6
3
+ size 132471364
onnx/vocoder.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19bd51f47a186069c752403518a40f7ea4c647455056d2511f7249691ecddf7c
3
+ size 101405066
voice_styles/F1.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F2.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F3.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F4.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/F5.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M1.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M2.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M3.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M4.json ADDED
The diff for this file is too large to render. See raw diff
 
voice_styles/M5.json ADDED
The diff for this file is too large to render. See raw diff