Soprano-80M / README.md
ekwek's picture
Update README.md
9581d6d verified
---
library_name: transformers
license: apache-2.0
pipeline_tag: text-to-speech
---
# Soprano: Instant, Ultra‑Realistic Text‑to‑Speech
<!-- Embedded demo video (placeholder) -->
<p align="center">
</p>
---
## Overview
**Soprano** is an ultra‑lightweight, open‑source text‑to‑speech (TTS) model designed for real‑time, high‑fidelity speech synthesis at unprecedented speed, all while remaining compact and easy to deploy.
With only **80M parameters**, Soprano achieves a real‑time factor (RTF) of **~2000×**, capable of generating **10 hours of audio in under 20 seconds**. Soprano uses a **seamless streaming** technique that enables true real‑time synthesis in **<15 ms**, multiple orders of magnitude faster than existing TTS pipelines.
This space contains the **model weights** for Soprano. The LLM uses a standard Qwen3 architecture, and the decoder is a Vocos model fine-tuned on the output hidden states of the LLM.
Github: https://github.com/ekwek1/soprano
Model Demo: https://huggingface.co/spaces/ekwek/Soprano-TTS
---
## Installation
**Requirements**: Linux or Windows, CUDA‑enabled GPU required (CPU support coming soon).
### One‑line install
```bash
pip install soprano-tts
```
### Install from source
```bash
git clone https://github.com/ekwek1/soprano.git
cd soprano
pip install -e .
```
> **Note**: Soprano uses **LMDeploy** to accelerate inference by default. If LMDeploy cannot be installed in your environment, Soprano can fall back to the HuggingFace **transformers** backend (with slower performance). To enable this, pass `backend='transformers'` when creating the TTS model.
---
## Usage
```python
from soprano import SopranoTTS
model = SopranoTTS()
```
### Basic inference
```python
out = model.infer("Hello world!")
```
### Save output to a file
```python
out = model.infer("Hello world!", "out.wav")
```
### Custom sampling parameters
```python
out = model.infer(
"Hello world!",
temperature=0.3,
top_p=0.95,
repetition_penalty=1.2,
)
```
### Batched inference
```python
out = model.infer_batch(["Hello world!"] * 10)
```
#### Save batch outputs to a directory
```python
out = model.infer_batch(["Hello world!"] * 10, "/dir")
```
### Streaming inference
```python
import torch
stream = model.infer_stream("Hello world!", chunk_size=1)
# Audio chunks can be accessed via an iterator
chunks = []
for chunk in stream:
chunks.append(chunk)
out = torch.cat(chunks)
```
---
## Key Features
### 1. High‑fidelity 32 kHz audio
Soprano synthesizes speech at **32 kHz**, delivering clarity that is perceptually indistinguishable from 44.1 kHz audio and significantly higher quality than the 24 kHz output used by many existing TTS models.
### 2. Vocos‑based neural decoder
Instead of slow diffusion decoders, Soprano uses a **Vocos‑based decoder**, enabling **orders‑of‑magnitude faster** waveform generation while maintaining comparable perceptual quality.
### 3. Seamless real‑time streaming
Soprano leverages the decoder’s finite receptive field to losslessly stream audio with **ultra‑low latency**. The streamed output is acoustically identical to offline synthesis, enabling interactive applications with sub‑frame delays.
### 4. State‑of‑the‑art neural audio codec
Speech is represented using a **neural codec** that compresses audio to **~15 tokens/sec** at just **0.2 kbps**, allowing extremely fast generation and efficient memory usage without sacrificing quality.
### 5. Sentence‑level streaming for infinite context
Each sentence is generated independently, enabling **effectively infinite generation length** while maintaining stability and real‑time performance for long‑form generation.
---
## License
This project is licensed under the **Apache-2.0** license.