ANLGBOY commited on Jan 6

Commit

75e6727

1 Parent(s): 53f3023

init

Browse files

Files changed (22) hide show

.gitattributes +2 -0
.gitignore +5 -0
LICENSE +209 -0
README.md +123 -0
config.json +5 -0
img/supertonic_preview_0.1.jpg +3 -0
onnx/duration_predictor.onnx +3 -0
onnx/text_encoder.onnx +3 -0
onnx/tts.json +316 -0
onnx/unicode_indexer.json +0 -0
onnx/vector_estimator.onnx +3 -0
onnx/vocoder.onnx +3 -0
voice_styles/F1.json +0 -0
voice_styles/F2.json +0 -0
voice_styles/F3.json +0 -0
voice_styles/F4.json +0 -0
voice_styles/F5.json +0 -0
voice_styles/M1.json +0 -0
voice_styles/M2.json +0 -0
voice_styles/M3.json +0 -0
voice_styles/M4.json +0 -0
voice_styles/M5.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.img filter=lfs diff=lfs merge=lfs -text
+*.jpg filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+window.json
+filter_bank.json
+style_extractor.onnx
+*.yml
+*.npy

LICENSE ADDED Viewed

	@@ -0,0 +1,209 @@

+BigScience Open RAIL-M License
+dated August 18, 2022
+Section I: PREAMBLE
+This Open RAIL-M License was created by BigScience, a collaborative open innovation project aimed at
+the responsible development and use of large multilingual datasets and Large Language Models
+(“LLMs”). While a similar license was originally designed for the BLOOM model, we decided to adapt it
+and create this license in order to propose a general open and responsible license applicable to other
+machine learning based AI models (e.g. multimodal generative models).
+In short, this license strives for both the open and responsible downstream use of the accompanying
+model. When it comes to the open character, we took inspiration from open source permissive licenses
+regarding the grant of IP rights. Referring to the downstream responsible use, we added use-based
+restrictions not permitting the use of the Model in very specific scenarios, in order for the licensor to be
+able to enforce the license in case potential misuses of the Model may occur. Even though downstream
+derivative versions of the model could be released under different licensing terms, the latter will always
+have to include - at minimum - the same use-based restrictions as the ones in the original license (this
+license).
+The development and use of artificial intelligence (“AI”), does not come without concerns. The world has
+witnessed how AI techniques may, in some instances, become risky for the public in general. These risks
+come in many forms, from racial discrimination to the misuse of sensitive information.
+BigScience believes in the intersection between open and responsible AI development; thus, this License
+aims to strike a balance between both in order to enable responsible open-science in the field of AI.
+This License governs the use of the model (and its derivatives) and is informed by the model card
+associated with the model.
+NOW THEREFORE, You and Licensor agree as follows:
+1. Definitions
+(a) "License" means the terms and conditions for use, reproduction, and Distribution as defined in
+this document.
+(b) “Data” means a collection of information and/or content extracted from the dataset used with the
+Model, including to train, pretrain, or otherwise evaluate the Model. The Data is not licensed under
+this License.
+(c)“Output” means the results of operating a Model as embodied in informational content resulting
+therefrom.
+(d)“Model” means any accompanying machine-learning based assemblies (including checkpoints),
+consisting of learnt weights, parameters (including optimizer states), corresponding to the model
+architecture as embodied in the Complementary Material, that have been trained or tuned, in whole or
+in part on the Data, using the Complementary Material.
+(e) “Derivatives of the Model” means all modifications to the Model, works based on the Model, or any
+other model which is created or initialized by transfer of patterns of the weights, parameters,
+activations or output of the Model, to the other model, in order to cause the other model to perform
+similarly to the Model, including - but not limited to - distillation methods entailing the use of
+intermediate data representations or methods based on the generation of synthetic data by the Model
+for training the other model.
+(f)“Complementary Material” means the accompanying source code and scripts used to define,
+run, load, benchmark or evaluate the Model, and used to prepare data for training or evaluation, if
+any. This includes any accompanying documentation, tutorials, examples, etc, if any.
+(g) “Distribution” means any transmission, reproduction, publication or other sharing of the Model or
+Derivatives of the Model to a third party, including providing the Model as a hosted service made
+available by electronic or other remote means - e.g. API-based or web access.
+(h) “Licensor” means the copyright owner or entity authorized by the copyright owner that is
+granting the License, including the persons or entities that may have rights in the Model and/or
+distributing the Model.
+(i) "You" (or "Your") means an individual or Legal Entity exercising permissions granted by this
+License and/or making use of the Model for whichever purpose and in any field of use, including
+usage of the Model in an end-use application - e.g. chatbot, translator, image generator.
+(j) “Third Parties” means individuals or legal entities that are not under common control with
+Licensor or You.
+(k) "Contribution" means any work of authorship, including the original version of the Model and
+any modifications or additions to that Model or Derivatives of the Model thereof, that is
+intentionally submitted to Licensor for inclusion in the Model by the copyright owner or by an
+individual or Legal Entity authorized to submit on behalf of the copyright owner. For the
+purposes of this definition,
+“submitted” means any form of electronic, verbal, or written
+communication sent to the Licensor or its representatives, including but not limited to
+communication on electronic mailing lists, source code control systems, and issue tracking
+systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and
+improving the Model, but excluding communication that is conspicuously marked or otherwise
+designated in writing by the copyright owner as "Not a Contribution."
+(l) "Contributor" means Licensor and any individual or Legal Entity on behalf of whom a
+Contribution has been received by Licensor and subsequently incorporated within the Model.
+Section II: INTELLECTUAL PROPERTY RIGHTS
+Both copyright and patent grants apply to the Model, Derivatives of the Model and Complementary
+Material. The Model and Derivatives of the Model are subject to additional terms as described in Section III.
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor
+hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the
+Complementary Material, the Model, and Derivatives of the Model.
+3. Grant of Patent License. Subject to the terms and conditions of this License and where and as
+applicable, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge,
+royalty-free, irrevocable (except as stated in this paragraph) patent license to make, have made, use, offer
+to sell, sell, import, and otherwise transfer the Model and the Complementary Material, where such
+license applies only to those patent claims licensable by such Contributor that are necessarily infringed by
+their Contribution(s) alone or by combination of their Contribution(s) with the Model to which such
+Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim
+or counterclaim in a lawsuit) alleging that the Model and/or Complementary Material or a Contribution
+incorporated within the Model and/or Complementary Material constitutes direct or contributory patent
+infringement, then any patent licenses granted to You under this License for the Model and/or Work shall
+terminate as of the date such litigation is asserted or filed.
+Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION
+4. Distribution and Redistribution. You may host for Third Party remote access purposes (e.g.
+software-as-a-service), reproduce and distribute copies of the Model or Derivatives of the Model thereof
+in any medium, with or without modifications, provided that You meet the following conditions:
+a. Use-based restrictions as referenced in paragraph 5 MUST be included as an enforceable provision
+by You in any type of legal agreement (e.g. a license) governing the use and/or distribution of the
+Model or Derivatives of the Model, and You shall give notice to subsequent users You Distribute to,
+that the Model or Derivatives of the Model are subject to paragraph 5. This provision does not apply
+to the use of Complementary Material.
+b. You must give any Third Party recipients of the Model or Derivatives of the Model a copy of this
+License;
+c. You must cause any modified files to carry prominent notices stating that You changed the files;
+d. You must retain all copyright, patent, trademark, and attribution notices excluding those notices
+that do not pertain to any part of the Model, Derivatives of the Model.
+You may add Your own copyright statement to Your modifications and may provide additional or
+different license terms and conditions - respecting paragraph 4.a.
+- for use, reproduction, or Distribution
+of Your modifications, or for any such Derivatives of the Model as a whole, provided Your use,
+reproduction, and Distribution of the Model otherwise complies with the conditions stated in this License.
+5. Use-based restrictions. The restrictions set forth in Attachment A are considered Use-based restrictions.
+Therefore You cannot use the Model and the Derivatives of the Model for the specified restricted uses. You
+may use the Model subject to this License, including only for lawful purposes and in accordance with the
+License. Use may include creating any content with, finetuning, updating, running, training, evaluating and/or
+reparametrizing the Model. You shall require all of Your users who use the Model or a Derivative of the Model
+to comply with the terms of this paragraph (paragraph 5).
+6. The Output You Generate. Except as set forth herein, Licensor claims no rights in the Output You
+generate using the Model. You are accountable for the Output you generate and its subsequent uses. No
+use of the output can contravene any provision as stated in the License.
+Section IV: OTHER PROVISIONS
+7. Updates and Runtime Restrictions. To the maximum extent permitted by law, Licensor reserves the
+right to restrict (remotely or otherwise) usage of the Model in violation of this License, update the Model
+through electronic means, or modify the Output of the Model based on updates. You shall undertake
+reasonable efforts to use the latest version of the Model.
+8. Trademarks and related. Nothing in this License permits You to make use of Licensors’ trademarks,
+trade names, logos or to otherwise suggest endorsement or misrepresent the relationship between the
+parties; and any rights not expressly granted herein are reserved by the Licensors.
+9. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides
+the Model and the Complementary Material (and each Contributor provides its Contributions) on an "AS
+IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
+including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT,
+MERCHANTABILITY , or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for
+determining the appropriateness of using or redistributing the Model, Derivatives of the Model, and the
+Complementary Material and assume any risks associated with Your exercise of permissions under this
+License.
+10. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence),
+contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or
+agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect,
+special, incidental, or consequential damages of any character arising as a result of this License or out of
+the use or inability to use the Model and the Complementary Material (including but not limited to
+damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other
+commercial damages or losses), even if such Contributor has been advised of the possibility of such
+damages.
+11. Accepting Warranty or Additional Liability. While redistributing the Model, Derivatives of the
+Model and the Complementary Material thereof, You may choose to offer, and charge a fee for, acceptance
+of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License.
+However, in accepting such obligations, You may act only on Your own behalf and on Your sole
+responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and
+hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor
+by reason of your accepting any such warranty or additional liability.
+12. If any provision of this License is held to be invalid, illegal or unenforceable, the remaining
+provisions shall be unaffected thereby and remain valid as if such provision had not been set forth herein.
+END OF TERMS AND CONDITIONS
+Attachment A
+Use Restrictions
+You agree not to use the Model or Derivatives of the Model:
+(a) In any way that violates any applicable national, federal, state, local or international law
+or regulation;
+(b) For the purpose of exploiting, harming or attempting to exploit or harm minors in any
+way;
+(c) To generate or disseminate verifiably false information and/or content with the purpose of
+harming others;
+(d) To generate or disseminate personal identifiable information that can be used to harm an
+individual;
+(e) To generate or disseminate information and/or content (e.g. images, code, posts, articles),
+and place the information and/or content in any context (e.g. bot generating tweets)
+without expressly and intelligibly disclaiming that the information and/or content is
+machine generated;
+(f) To defame, disparage or otherwise harass others;
+(g) To impersonate or attempt to impersonate (e.g. deepfakes) others without their consent;
+(h) For fully automated decision making that adversely impacts an individual’s legal rights or
+otherwise creates or modifies a binding, enforceable obligation;
+(i) For any use intended to or which has the effect of discriminating against or harming
+individuals or groups based on online or offline social behavior or known or predicted
+personal or personality characteristics;
+(j) To exploit any of the vulnerabilities of a specific group of persons based on their age,
+social, physical or mental characteristics, in order to materially distort the behavior of a
+person pertaining to that group in a manner that causes or is likely to cause that person or
+another person physical or psychological harm;
+(k) For any use intended to or which has the effect of discriminating against individuals or
+groups based on legally protected characteristics or categories;
+(l) To provide medical advice and medical results interpretation;
+(m) To generate or disseminate information for the purpose to be used for administration of
+justice, law enforcement, immigration or asylum processes, such as predicting an
+individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal
+relationships between assertions made in documents, indiscriminate and
+arbitrarily-targeted use).

README.md ADDED Viewed

	@@ -0,0 +1,123 @@

+---
+license: openrail
+language:
+    - en
+    - ko
+    - es
+    - pt
+    - fr
+pipeline_tag: text-to-speech
+tags:
+    - text-to-speech
+    - speech-synthesis
+    - tts
+    - onnx
+library_name: supertonic
+---
+# Supertonic 2 — Lightning Fast, On-Device TTS, Multilingual TTS
+![Supertonic Preview](img/supertonic_preview_0.1.jpg)
+<p align="center">
+  <a href="https://huggingface.co/spaces/Supertone/supertonic-2"><img src="https://img.shields.io/badge/🤗_Demo-Hugging_Face-yellow?style=for-the-badge" alt="Demo"></a>
+  <a href="https://github.com/supertone-inc/supertonic"><img src="https://img.shields.io/badge/💻_Code-GitHub-black?style=for-the-badge&logo=github" alt="Code"></a>
+</p>
+**Supertonic** is a lightning-fast, on-device text-to-speech system designed for **extreme performance** with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
+## What's New in Supertonic 2
+**Supertonic 2** extends multilingual capabilities while maintaining the same inference speed and efficiency as the original.
+### 🌍 Multilingual Support
+| Language | Code |
+|----------|------|
+| English | `en` |
+| Korean | `ko` |
+| Spanish | `es` |
+| Portuguese | `pt` |
+| French | `fr` |
+### ⚡ Same Speed, More Languages
+- **No speed degradation**: Supertonic 2 delivers the same ultra-fast inference speed as the original—up to **167× faster than real-time**
+- **Efficient architecture**: Only **66M parameters**, optimized for on-device deployment
+- **Cross-language consistency**: All supported languages share the same model architecture and inference pipeline
+## Performance
+We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
+**Metrics:**
+- **Characters per Second**: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
+- **Real-time Factor (RTF)**: Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
+### Characters per Second
+| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
+|--------|-----------------|----------------|-----------------|
+| **Supertonic** (M4 pro - CPU) | 912 | 1048 | 1263 |
+| **Supertonic** (M4 pro - WebGPU) | 996 | 1801 | 2509 |
+| **Supertonic** (RTX4090) | 2615 | 6548 | 12164 |
+| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 144 | 209 | 287 |
+| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 37 | 55 | 82 |
+| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 12 | 18 | 24 |
+| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 38 | 64 | 92 |
+| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 104 | 107 | 117 |
+| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 37 | 42 | 47 |
+> **Notes:**
+> `API` = Cloud-based API services (measured from Seoul)
+> `Open` = Open-source models
+> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
+> Supertonic (RTX4090): Tested with PyTorch model
+> Kokoro: Tested on M4 Pro CPU with ONNX
+> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
+### Real-time Factor
+| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
+|--------|-----------------|----------------|-----------------|
+| **Supertonic** (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
+| **Supertonic** (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
+| **Supertonic** (RTX4090) | 0.005 | 0.002 | 0.001 |
+| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 0.133 | 0.077 | 0.057 |
+| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 0.471 | 0.302 | 0.201 |
+| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 1.060 | 0.673 | 0.541 |
+| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 0.372 | 0.206 | 0.163 |
+| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 0.144 | 0.124 | 0.126 |
+| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 0.390 | 0.338 | 0.343 |
+<details>
+<summary><b>Additional Performance Data (5-step inference)</b></summary>
+<br>
+**Characters per Second (5-step)**
+| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
+|--------|-----------------|----------------|-----------------|
+| **Supertonic** (M4 pro - CPU) | 596 | 691 | 850 |
+| **Supertonic** (M4 pro - WebGPU) | 570 | 1118 | 1546 |
+| **Supertonic** (RTX4090) | 1286 | 3757 | 6242 |
+**Real-time Factor (5-step)**
+| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
+|--------|-----------------|----------------|-----------------|
+| **Supertonic** (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
+| **Supertonic** (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
+| **Supertonic** (RTX4090) | 0.011 | 0.004 | 0.002 |
+</details>
+## License
+This project’s sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
+The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic-2/blob/main/LICENSE) file for details.
+This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
+Copyright (c) 2026 Supertone Inc.

config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "model_name": "Supertonic 2",
+  "model_type": "onnx",
+  "description": "This is a stub config for Hugging Face download counting. The actual model is located at onnx/"
+}

img/supertonic_preview_0.1.jpg ADDED Viewed

Git LFS Details

SHA256: 4648945559928f84ad00aa91c76ef6bf1d29f60617f81114e49afaa8c4f390df
Pointer size: 131 Bytes
Size of remote file: 785 kB

onnx/duration_predictor.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6d556b3691165c364be91dc0bd894656b5949f5acd2750d8ec2f954010845011
+size 1521526

onnx/text_encoder.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dd5f535ed629f7df86071043e15f541ce1b2ab7f1bdbce4c7892b307bca79fa3
+size 27431318

onnx/tts.json ADDED Viewed

	@@ -0,0 +1,316 @@

+{
+    "tts_version": "v1.6.0",
+    "split": "opensource-multilingual",
+    "ttl_ckpt_path": "unknown.pt",
+    "dp_ckpt_path": "unknown.pt",
+    "ae_ckpt_path": "unknown.pt",
+    "ttl_train": "unknown",
+    "dp_train": "unknown",
+    "ae_train": "unknown",
+    "ttl": {
+        "latent_dim": 24,
+        "chunk_compress_factor": 6,
+        "batch_expander": {
+            "n_batch_expand": 6
+        },
+        "normalizer": {
+            "scale": 0.25
+        },
+        "text_encoder": {
+            "char_dict_path": "resources/metadata/char_dict/opensource-multilingual2/char_dict.json",
+            "text_embedder": {
+                "char_dict_path": "resources/metadata/char_dict/opensource-multilingual2/char_dict.json",
+                "char_emb_dim": 256
+            },
+            "convnext": {
+                "idim": 256,
+                "ksz": 5,
+                "intermediate_dim": 1024,
+                "num_layers": 6,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "attn_encoder": {
+                "hidden_channels": 256,
+                "filter_channels": 1024,
+                "n_heads": 4,
+                "n_layers": 4,
+                "p_dropout": 0.1
+            },
+            "proj_out": {
+                "idim": 256,
+                "odim": 256
+            }
+        },
+        "flow_matching": {
+            "sig_min": 0
+        },
+        "style_encoder": {
+            "proj_in": {
+                "ldim": 24,
+                "chunk_compress_factor": 6,
+                "odim": 256
+            },
+            "convnext": {
+                "idim": 256,
+                "ksz": 5,
+                "intermediate_dim": 1024,
+                "num_layers": 6,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "style_token_layer": {
+                "input_dim": 256,
+                "n_style": 50,
+                "style_key_dim": 256,
+                "style_value_dim": 256,
+                "prototype_dim": 256,
+                "n_units": 256,
+                "n_heads": 2
+            }
+        },
+        "speech_prompted_text_encoder": {
+            "text_dim": 256,
+            "style_dim": 256,
+            "n_units": 256,
+            "n_heads": 2
+        },
+        "uncond_masker": {
+            "prob_both_uncond": 0.04,
+            "prob_text_uncond": 0.01,
+            "std": 0.1,
+            "text_dim": 256,
+            "n_style": 50,
+            "style_key_dim": 256,
+            "style_value_dim": 256
+        },
+        "vector_field": {
+            "proj_in": {
+                "ldim": 24,
+                "chunk_compress_factor": 6,
+                "odim": 512
+            },
+            "time_encoder": {
+                "time_dim": 64,
+                "hdim": 256
+            },
+            "main_blocks": {
+                "n_blocks": 4,
+                "time_cond_layer": {
+                    "idim": 512,
+                    "time_dim": 64
+                },
+                "style_cond_layer": {
+                    "idim": 512,
+                    "style_dim": 256
+                },
+                "text_cond_layer": {
+                    "idim": 512,
+                    "text_dim": 256,
+                    "n_heads": 4,
+                    "use_residual": true,
+                    "rotary_base": 10000,
+                    "rotary_scale": 10
+                },
+                "convnext_0": {
+                    "idim": 512,
+                    "ksz": 5,
+                    "intermediate_dim": 1024,
+                    "num_layers": 4,
+                    "dilation_lst": [
+                        1,
+                        2,
+                        4,
+                        8
+                    ]
+                },
+                "convnext_1": {
+                    "idim": 512,
+                    "ksz": 5,
+                    "intermediate_dim": 1024,
+                    "num_layers": 1,
+                    "dilation_lst": [
+                        1
+                    ]
+                },
+                "convnext_2": {
+                    "idim": 512,
+                    "ksz": 5,
+                    "intermediate_dim": 1024,
+                    "num_layers": 1,
+                    "dilation_lst": [
+                        1
+                    ]
+                }
+            },
+            "last_convnext": {
+                "idim": 512,
+                "ksz": 5,
+                "intermediate_dim": 1024,
+                "num_layers": 4,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "proj_out": {
+                "idim": 512,
+                "chunk_compress_factor": 6,
+                "ldim": 24
+            }
+        }
+    },
+    "ae": {
+        "sample_rate": 44100,
+        "n_delay": 0,
+        "base_chunk_size": 512,
+        "chunk_compress_factor": 1,
+        "ldim": 24,
+        "encoder": {
+            "spec_processor": {
+                "n_fft": 2048,
+                "win_length": 2048,
+                "hop_length": 512,
+                "n_mels": 228,
+                "sample_rate": 44100,
+                "eps": 1e-05,
+                "norm_mean": 0.0,
+                "norm_std": 1.0
+            },
+            "ksz_init": 7,
+            "ksz": 7,
+            "num_layers": 10,
+            "dilation_lst": [
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1,
+                1
+            ],
+            "intermediate_dim": 2048,
+            "idim": 1253,
+            "hdim": 512,
+            "odim": 24
+        },
+        "decoder": {
+            "ksz_init": 7,
+            "ksz": 7,
+            "num_layers": 10,
+            "dilation_lst": [
+                1,
+                2,
+                4,
+                1,
+                2,
+                4,
+                1,
+                1,
+                1,
+                1
+            ],
+            "intermediate_dim": 2048,
+            "idim": 24,
+            "hdim": 512,
+            "head": {
+                "idim": 512,
+                "hdim": 2048,
+                "odim": 512,
+                "ksz": 3
+            }
+        }
+    },
+    "dp": {
+        "latent_dim": 24,
+        "chunk_compress_factor": 6,
+        "normalizer": {
+            "scale": 1.0
+        },
+        "sentence_encoder": {
+            "char_emb_dim": 64,
+            "char_dict_path": "resources/metadata/char_dict/opensource-multilingual2/char_dict.json",
+            "text_embedder": {
+                "char_dict_path": "resources/metadata/char_dict/opensource-multilingual2/char_dict.json",
+                "char_emb_dim": 64
+            },
+            "convnext": {
+                "idim": 64,
+                "ksz": 5,
+                "intermediate_dim": 256,
+                "num_layers": 6,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "attn_encoder": {
+                "hidden_channels": 64,
+                "filter_channels": 256,
+                "n_heads": 2,
+                "n_layers": 2,
+                "p_dropout": 0.0
+            },
+            "proj_out": {
+                "idim": 64,
+                "odim": 64
+            }
+        },
+        "style_encoder": {
+            "proj_in": {
+                "ldim": 24,
+                "chunk_compress_factor": 6,
+                "odim": 64
+            },
+            "convnext": {
+                "idim": 64,
+                "ksz": 5,
+                "intermediate_dim": 256,
+                "num_layers": 4,
+                "dilation_lst": [
+                    1,
+                    1,
+                    1,
+                    1
+                ]
+            },
+            "style_token_layer": {
+                "input_dim": 64,
+                "n_style": 8,
+                "style_key_dim": 0,
+                "style_value_dim": 16,
+                "prototype_dim": 64,
+                "n_units": 64,
+                "n_heads": 2
+            }
+        },
+        "predictor": {
+            "sentence_dim": 64,
+            "n_style": 8,
+            "style_dim": 16,
+            "hdim": 128,
+            "n_layer": 2
+        }
+    }
+}

onnx/unicode_indexer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

onnx/vector_estimator.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:105e9d66fd8756876b210a6b4aa03fc393b1eaca3a8dadcc8d9a3bc785c86a35
+size 132471364

onnx/vocoder.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:19bd51f47a186069c752403518a40f7ea4c647455056d2511f7249691ecddf7c
+size 101405066

voice_styles/F1.json ADDED Viewed

The diff for this file is too large to render. See raw diff

voice_styles/F2.json ADDED Viewed

The diff for this file is too large to render. See raw diff

voice_styles/F3.json ADDED Viewed

The diff for this file is too large to render. See raw diff

voice_styles/F4.json ADDED Viewed

The diff for this file is too large to render. See raw diff

voice_styles/F5.json ADDED Viewed

The diff for this file is too large to render. See raw diff

voice_styles/M1.json ADDED Viewed

The diff for this file is too large to render. See raw diff

voice_styles/M2.json ADDED Viewed

The diff for this file is too large to render. See raw diff

voice_styles/M3.json ADDED Viewed

The diff for this file is too large to render. See raw diff

voice_styles/M4.json ADDED Viewed

The diff for this file is too large to render. See raw diff

voice_styles/M5.json ADDED Viewed

The diff for this file is too large to render. See raw diff