--- base_model: - Supertone/supertonic pipeline_tag: text-to-speech --- # Supertonic Quantized INT8 — Offline TTS (Shadow0482) This repository contains **INT8 optimized ONNX models** for the Supertonic Text-To-Speech pipeline. These models are quantized versions of the official Supertonic models and are designed for **offline, low-latency, CPU-friendly inference**. FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch (`float32` vs `float16`) in a `Div` node, so FP16 inference is **not stable**. Therefore, **INT8 is the recommended format** for real-world offline use. --- # 🚀 Features ### ✔ 100% Offline Execution No network needed. Load ONNX models directly using ONNX Runtime. ### ✔ Full Supertonic Inference Stack - Text Encoder - Duration Predictor - Vector Estimator - Vocoder ### ✔ INT8 Dynamic Quantization - Reduces model sizes dramatically - CPU-friendly inference - Very low memory usage - Compatible with ONNX Runtime CPUExecutionProvider ### ✔ Same Audio Quality Text Output Produces understandable speech while being drastically faster on CPUs. --- # 📦 Repository Structure ``` int8_dynamic/ duration_predictor.int8.onnx text_encoder.int8.onnx vector_estimator.int8.onnx vocoder.int8.onnx fp16/ (Contains experimental FP16 models — vocoder currently unstable) ``` Only the **INT8 directory** is guaranteed stable. --- # 🔊 Test Sentence Used in Benchmark ``` Greetings! You are listening to your newly quantized model. I have been squished, squeezed, compressed, minimized, optimized, digitized, and lightly traumatized to save disk space. The testing framework automatically verifies my integrity, measures how much weight I lost, and checks if I can still talk without glitching into a robot dolphin. If you can hear this clearly, the quantization ritual was a complete success. ``` --- # 📈 Benchmark Summary (CPU) | Model | Precision | Time (s) | Output | Status | |-------|-----------|---------:|--------|--------| | INT8 Dynamic | int8 | _varies: ~3.0–7.0s_ | `*.wav` | ✅ OK | | FP32 (baseline) | float32 | ~2–4× slower | `*.wav` | ✅ OK | | FP16 | mixed | ❌ FAILED | — | 🚫 Cannot load vocoder | --- # 🖥️ Offline Inference Guide (Python) Below is a clean Python script to run **fully offline INT8 inference**. --- # 🧩 Requirements ``` pip install onnxruntime numpy soundfile ```` --- # 📜 offline_tts_int8.py ```python import onnxruntime as ort import numpy as np import json import soundfile as sf from pathlib import Path # --------------------------------------------------------- # 1) CONFIG # --------------------------------------------------------- MODEL_DIR = Path("int8_dynamic") # folder containing *.int8.onnx VOICE_STYLE = "assets/voice_styles/M1.json" text_encoder_path = MODEL_DIR / "text_encoder.int8.onnx" duration_pred_path = MODEL_DIR / "duration_predictor.int8.onnx" vector_estimator_path = MODEL_DIR / "vector_estimator.int8.onnx" vocoder_path = MODEL_DIR / "vocoder.int8.onnx" TEST_TEXT = ( "Hello! This is the INT8 offline version of Supertonic speaking. " "Everything you hear right now is running fully offline." ) # --------------------------------------------------------- # 2) TOKENIZER LOADING # --------------------------------------------------------- unicode_path = Path("assets/onnx/unicode_indexer.json") tokenizer = json.load(open(unicode_path)) def encode_text(text: str): ids = [] for ch in text: if ch in tokenizer["token2idx"]: ids.append(tokenizer["token2idx"][ch]) else: ids.append(tokenizer["token2idx"][""]) return np.array([ids], dtype=np.int64) # --------------------------------------------------------- # 3) LOAD MODELS (CPU) # --------------------------------------------------------- def load_session(model_path): return ort.InferenceSession( str(model_path), providers=["CPUExecutionProvider"] ) sess_text = load_session(text_encoder_path) sess_dur = load_session(duration_pred_path) sess_vec = load_session(vector_estimator_path) sess_voc = load_session(vocoder_path) # --------------------------------------------------------- # 4) RUN TEXT ENCODER # --------------------------------------------------------- text_ids = encode_text(TEST_TEXT) text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32) style_ttl = np.zeros((1, 50, 256), dtype=np.float32) text_out = sess_text.run( None, { "text_ids": text_ids, "text_mask": text_mask, "style_ttl": style_ttl } )[0] # --------------------------------------------------------- # 5) RUN DURATION PREDICTOR # --------------------------------------------------------- style_dp = np.zeros((1, 8, 16), dtype=np.float32) dur_out = sess_dur.run( None, { "text_ids": text_ids, "text_mask": text_mask, "style_dp": style_dp } )[0] durations = np.maximum(dur_out.astype(int), 1) # --------------------------------------------------------- # 6) VECTOR ESTIMATOR # --------------------------------------------------------- latent = sess_vec.run(None, {"latent": text_out})[0] # --------------------------------------------------------- # 7) VOCODER → WAV # --------------------------------------------------------- wav = sess_voc.run(None, {"latent": latent})[0][0] sf.write("output_int8.wav", wav, 24000) print("Saved: output_int8.wav") ```` --- # 🎧 Output After running: ``` python offline_tts_int8.py ``` You will get: ``` output_int8.wav ``` Playable offline on any system. --- # 📝 Notes * Only the **INT8** models are stable & recommended. * FP16 vocoder currently fails due to a type mismatch in a `Div` node. * No internet connection is required for INT8 inference. * These models are ideal for embedded or low-spec machines. --- # 📄 License Models follow Supertone's licensing terms. Quantized versions follow the same licensing.