File size: 6,026 Bytes

---
base_model:
- Supertone/supertonic
pipeline_tag: text-to-speech
---

# Supertonic Quantized INT8 — Offline TTS (Shadow0482)

This repository contains **INT8 optimized ONNX models** for the Supertonic Text-To-Speech
pipeline. These models are quantized versions of the official Supertonic models and are
designed for **offline, low-latency, CPU-friendly inference**.

FP16 versions exist for experimentation, but the vocoder currently contains a type mismatch
(`float32` vs `float16`) in a `Div` node, so FP16 inference is **not stable**.  
Therefore, **INT8 is the recommended format** for real-world offline use.

---

# 🚀 Features

### ✔ 100% Offline Execution  
No network needed. Load ONNX models directly using ONNX Runtime.

### ✔ Full Supertonic Inference Stack  
- Text Encoder  
- Duration Predictor  
- Vector Estimator  
- Vocoder  

### ✔ INT8 Dynamic Quantization  
- Reduces model sizes dramatically  
- CPU-friendly inference  
- Very low memory usage  
- Compatible with ONNX Runtime CPUExecutionProvider  

### ✔ Same Audio Quality Text Output  
Produces understandable speech while being drastically faster on CPUs.

---

# 📦 Repository Structure

```

int8_dynamic/
duration_predictor.int8.onnx
text_encoder.int8.onnx
vector_estimator.int8.onnx
vocoder.int8.onnx

fp16/
(Contains experimental FP16 models — vocoder currently unstable)

```

Only the **INT8 directory** is guaranteed stable.

---

# 🔊 Test Sentence Used in Benchmark

```

Greetings! You are listening to your newly quantized model.
I have been squished, squeezed, compressed, minimized, optimized,
digitized, and lightly traumatized to save disk space.
The testing framework automatically verifies my integrity,
measures how much weight I lost,
and checks if I can still talk without glitching into a robot dolphin.
If you can hear this clearly, the quantization ritual was a complete success.

```

---

# 📈 Benchmark Summary (CPU)

| Model | Precision | Time (s) | Output | Status |
|-------|-----------|---------:|--------|--------|
| INT8 Dynamic | int8 | _varies: ~3.0–7.0s_ | `*.wav` | ✅ OK |
| FP32 (baseline) | float32 | ~2–4× slower | `*.wav` | ✅ OK |
| FP16 | mixed | ❌ FAILED | — | 🚫 Cannot load vocoder |

---

# 🖥️ Offline Inference Guide (Python)

Below is a clean Python script to run **fully offline INT8 inference**.

---

# 🧩 Requirements

```

pip install onnxruntime numpy soundfile

````

---

# 📜 offline_tts_int8.py

```python
import onnxruntime as ort
import numpy as np
import json
import soundfile as sf
from pathlib import Path

# ---------------------------------------------------------
# 1) CONFIG
# ---------------------------------------------------------
MODEL_DIR = Path("int8_dynamic")   # folder containing *.int8.onnx
VOICE_STYLE = "assets/voice_styles/M1.json"

text_encoder_path      = MODEL_DIR / "text_encoder.int8.onnx"
duration_pred_path     = MODEL_DIR / "duration_predictor.int8.onnx"
vector_estimator_path  = MODEL_DIR / "vector_estimator.int8.onnx"
vocoder_path           = MODEL_DIR / "vocoder.int8.onnx"

TEST_TEXT = (
    "Hello! This is the INT8 offline version of Supertonic speaking. "
    "Everything you hear right now is running fully offline."
)

# ---------------------------------------------------------
# 2) TOKENIZER LOADING
# ---------------------------------------------------------
unicode_path = Path("assets/onnx/unicode_indexer.json")
tokenizer = json.load(open(unicode_path))

def encode_text(text: str):
    ids = []
    for ch in text:
        if ch in tokenizer["token2idx"]:
            ids.append(tokenizer["token2idx"][ch])
        else:
            ids.append(tokenizer["token2idx"]["<unk>"])
    return np.array([ids], dtype=np.int64)

# ---------------------------------------------------------
# 3) LOAD MODELS (CPU)
# ---------------------------------------------------------
def load_session(model_path):
    return ort.InferenceSession(
        str(model_path),
        providers=["CPUExecutionProvider"]
    )

sess_text = load_session(text_encoder_path)
sess_dur  = load_session(duration_pred_path)
sess_vec  = load_session(vector_estimator_path)
sess_voc  = load_session(vocoder_path)

# ---------------------------------------------------------
# 4) RUN TEXT ENCODER
# ---------------------------------------------------------
text_ids = encode_text(TEST_TEXT)
text_mask = np.ones((1, 1, text_ids.shape[1]), dtype=np.float32)
style_ttl = np.zeros((1, 50, 256), dtype=np.float32)

text_out = sess_text.run(
    None,
    {
        "text_ids": text_ids,
        "text_mask": text_mask,
        "style_ttl": style_ttl
    }
)[0]

# ---------------------------------------------------------
# 5) RUN DURATION PREDICTOR
# ---------------------------------------------------------
style_dp = np.zeros((1, 8, 16), dtype=np.float32)

dur_out = sess_dur.run(
    None,
    {
        "text_ids": text_ids,
        "text_mask": text_mask,
        "style_dp": style_dp
    }
)[0]

durations = np.maximum(dur_out.astype(int), 1)

# ---------------------------------------------------------
# 6) VECTOR ESTIMATOR
# ---------------------------------------------------------
latent = sess_vec.run(None, {"latent": text_out})[0]

# ---------------------------------------------------------
# 7) VOCODER → WAV
# ---------------------------------------------------------
wav = sess_voc.run(None, {"latent": latent})[0][0]

sf.write("output_int8.wav", wav, 24000)
print("Saved: output_int8.wav")
````

---

# 🎧 Output

After running:

```
python offline_tts_int8.py
```

You will get:

```
output_int8.wav
```

Playable offline on any system.

---

# 📝 Notes

* Only the **INT8** models are stable & recommended.
* FP16 vocoder currently fails due to a type mismatch in a `Div` node.
* No internet connection is required for INT8 inference.
* These models are ideal for embedded or low-spec machines.

---

# 📄 License

Models follow Supertone's licensing terms.
Quantized versions follow the same licensing.