--- license: other base_model: BigBlueCeiling/MisoTTS-bf16 tags: - text-to-speech - quantized - torchao --- # MisoTTS int4 (BigBlueCeiling) A weight-only **int4** quantization of [BigBlueCeiling/MisoTTS-bf16](https://huggingface.co/BigBlueCeiling/MisoTTS-bf16), produced with torchao (`int4_weight_only`). Only the backbone/decoder Linear layers are quantized; the embeddings, output heads, and projection stay bf16. > EXPERIMENTAL and lower quality. Use int8 or bf16 if your card fits. This file is tinygemm-packed on an Ampere (sm_86) GPU; the serving core falls back to quantizing the bf16 weights at load if it does not load on your architecture. ## What it is for Lowering the hardware floor. Quantization here is a **memory** lever, not a speed one: MisoTTS decodes one frame at a time, and those tiny per-step matmuls cannot feed the GPU's low-precision tensor cores, so int4 dequantizes to bf16 for the matmul. You get the VRAM saving, not a throughput win. - **Fits:** ~12 GB VRAM cards (RTX 3060 12G, 4070, ...) - **Quality:** Noticeably degraded: mean CER 0.18, WER 0.26, UTMOS 2.93 (vs bf16 UTMOS 3.94). Worst on long utterances (long-clip CER up to ~0.5). Acceptable only as a last-resort 'runs at all' tier. ## Use This checkpoint is a `torch.save`'d torchao state_dict (`model.pt`). The serving core in the [MisoTTS repo](https://github.com/eoffermann/MisoTTS) pulls it automatically when GPU-sense detects a matching VRAM tier. To load it directly: ```python from generator import load_miso_8b # from the MisoTTS repo gen = load_miso_8b("cuda", model_path_or_repo_id="BigBlueCeiling/MisoTTS-int4", prequantized=True) ``` Requires torch>=2.7 and a matching torchao (loading unpickles a torchao tensor subclass, so `weights_only=False` is used; load only checkpoints you trust). Model and original inference code are MisoLabs' work; see the upstream license.