File size: 3,823 Bytes
5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff 286264e 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff fce944c 5d557ff 9581d6d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
library_name: transformers
license: apache-2.0
pipeline_tag: text-to-speech
---
# Soprano: Instant, Ultra‑Realistic Text‑to‑Speech
<!-- Embedded demo video (placeholder) -->
<p align="center">
</p>
---
## Overview
**Soprano** is an ultra‑lightweight, open‑source text‑to‑speech (TTS) model designed for real‑time, high‑fidelity speech synthesis at unprecedented speed, all while remaining compact and easy to deploy.
With only **80M parameters**, Soprano achieves a real‑time factor (RTF) of **~2000×**, capable of generating **10 hours of audio in under 20 seconds**. Soprano uses a **seamless streaming** technique that enables true real‑time synthesis in **<15 ms**, multiple orders of magnitude faster than existing TTS pipelines.
This space contains the **model weights** for Soprano. The LLM uses a standard Qwen3 architecture, and the decoder is a Vocos model fine-tuned on the output hidden states of the LLM.
Github: https://github.com/ekwek1/soprano
Model Demo: https://huggingface.co/spaces/ekwek/Soprano-TTS
---
## Installation
**Requirements**: Linux or Windows, CUDA‑enabled GPU required (CPU support coming soon).
### One‑line install
```bash
pip install soprano-tts
```
### Install from source
```bash
git clone https://github.com/ekwek1/soprano.git
cd soprano
pip install -e .
```
> **Note**: Soprano uses **LMDeploy** to accelerate inference by default. If LMDeploy cannot be installed in your environment, Soprano can fall back to the HuggingFace **transformers** backend (with slower performance). To enable this, pass `backend='transformers'` when creating the TTS model.
---
## Usage
```python
from soprano import SopranoTTS
model = SopranoTTS()
```
### Basic inference
```python
out = model.infer("Hello world!")
```
### Save output to a file
```python
out = model.infer("Hello world!", "out.wav")
```
### Custom sampling parameters
```python
out = model.infer(
"Hello world!",
temperature=0.3,
top_p=0.95,
repetition_penalty=1.2,
)
```
### Batched inference
```python
out = model.infer_batch(["Hello world!"] * 10)
```
#### Save batch outputs to a directory
```python
out = model.infer_batch(["Hello world!"] * 10, "/dir")
```
### Streaming inference
```python
import torch
stream = model.infer_stream("Hello world!", chunk_size=1)
# Audio chunks can be accessed via an iterator
chunks = []
for chunk in stream:
chunks.append(chunk)
out = torch.cat(chunks)
```
---
## Key Features
### 1. High‑fidelity 32 kHz audio
Soprano synthesizes speech at **32 kHz**, delivering clarity that is perceptually indistinguishable from 44.1 kHz audio and significantly higher quality than the 24 kHz output used by many existing TTS models.
### 2. Vocos‑based neural decoder
Instead of slow diffusion decoders, Soprano uses a **Vocos‑based decoder**, enabling **orders‑of‑magnitude faster** waveform generation while maintaining comparable perceptual quality.
### 3. Seamless real‑time streaming
Soprano leverages the decoder’s finite receptive field to losslessly stream audio with **ultra‑low latency**. The streamed output is acoustically identical to offline synthesis, enabling interactive applications with sub‑frame delays.
### 4. State‑of‑the‑art neural audio codec
Speech is represented using a **neural codec** that compresses audio to **~15 tokens/sec** at just **0.2 kbps**, allowing extremely fast generation and efficient memory usage without sacrificing quality.
### 5. Sentence‑level streaming for infinite context
Each sentence is generated independently, enabling **effectively infinite generation length** while maintaining stability and real‑time performance for long‑form generation.
---
## License
This project is licensed under the **Apache-2.0** license. |