---
license: other
base_model: BigBlueCeiling/MisoTTS-bf16
tags:
- text-to-speech
- quantized
- torchao
---

# MisoTTS int4 (BigBlueCeiling)

A weight-only **int4** quantization of
[BigBlueCeiling/MisoTTS-bf16](https://huggingface.co/BigBlueCeiling/MisoTTS-bf16),
produced with torchao (`int4_weight_only`). Only the backbone/decoder Linear
layers are quantized; the embeddings, output heads, and projection stay bf16.

> EXPERIMENTAL and lower quality. Use int8 or bf16 if your card fits. This file is tinygemm-packed on an Ampere (sm_86) GPU; the serving core falls back to quantizing the bf16 weights at load if it does not load on your architecture.

## What it is for

Lowering the hardware floor. Quantization here is a **memory** lever, not a speed
one: MisoTTS decodes one frame at a time, and those tiny per-step matmuls cannot
feed the GPU's low-precision tensor cores, so int4 dequantizes to bf16 for the
matmul. You get the VRAM saving, not a throughput win.

- **Fits:** ~12 GB VRAM cards (RTX 3060 12G, 4070, ...)
- **Quality:** Noticeably degraded: mean CER 0.18, WER 0.26, UTMOS 2.93 (vs bf16 UTMOS 3.94). Worst on long utterances (long-clip CER up to ~0.5). Acceptable only as a last-resort 'runs at all' tier.

## Use

This checkpoint is a `torch.save`'d torchao state_dict (`model.pt`). The serving
core in the [MisoTTS repo](https://github.com/eoffermann/MisoTTS) pulls it
automatically when GPU-sense detects a matching VRAM tier. To load it directly:

```python
from generator import load_miso_8b  # from the MisoTTS repo
gen = load_miso_8b("cuda", model_path_or_repo_id="BigBlueCeiling/MisoTTS-int4",
                   prequantized=True)
```

Requires torch>=2.7 and a matching torchao (loading unpickles a torchao tensor
subclass, so `weights_only=False` is used; load only checkpoints you trust).

Model and original inference code are MisoLabs' work; see the upstream license.