File size: 3,823 Bytes

5d557ff
 
fce944c
 
5d557ff
fce944c
 
5d557ff
 
fce944c
5d557ff
fce944c
 
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
 
 
5d557ff
fce944c
5d557ff
fce944c
 
 
 
 
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
 
5d557ff
fce944c
 
5d557ff
fce944c
5d557ff
fce944c
 
 
5d557ff
fce944c
5d557ff
fce944c
 
 
5d557ff
fce944c
5d557ff
fce944c
 
 
 
 
 
 
 
5d557ff
fce944c
5d557ff
fce944c
 
 
5d557ff
fce944c
5d557ff
fce944c
 
 
5d557ff
fce944c
5d557ff
fce944c
 
5d557ff
fce944c
5d557ff
fce944c
 
 
 
5d557ff
fce944c
 
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
286264e
 
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
fce944c
5d557ff
9581d6d

---
library_name: transformers
license: apache-2.0
pipeline_tag: text-to-speech
---
  
# Soprano: Instant, Ultra‑Realistic Text‑to‑Speech


<!-- Embedded demo video (placeholder) -->

<p align="center">
</p>

---

## Overview

**Soprano** is an ultra‑lightweight, open‑source text‑to‑speech (TTS) model designed for real‑time, high‑fidelity speech synthesis at unprecedented speed, all while remaining compact and easy to deploy.

With only **80M parameters**, Soprano achieves a real‑time factor (RTF) of **~2000×**, capable of generating **10 hours of audio in under 20 seconds**. Soprano uses a **seamless streaming** technique that enables true real‑time synthesis in **<15 ms**, multiple orders of magnitude faster than existing TTS pipelines.

This space contains the **model weights** for Soprano. The LLM uses a standard Qwen3 architecture, and the decoder is a Vocos model fine-tuned on the output hidden states of the LLM.

Github: https://github.com/ekwek1/soprano

Model Demo: https://huggingface.co/spaces/ekwek/Soprano-TTS

---

## Installation

**Requirements**: Linux or Windows, CUDA‑enabled GPU required (CPU support coming soon).

### One‑line install

```bash
pip install soprano-tts
```

### Install from source

```bash
git clone https://github.com/ekwek1/soprano.git
cd soprano
pip install -e .
```

> **Note**: Soprano uses **LMDeploy** to accelerate inference by default. If LMDeploy cannot be installed in your environment, Soprano can fall back to the HuggingFace **transformers** backend (with slower performance). To enable this, pass `backend='transformers'` when creating the TTS model.

---

## Usage

```python
from soprano import SopranoTTS

model = SopranoTTS()
```

### Basic inference

```python
out = model.infer("Hello world!")
```

### Save output to a file

```python
out = model.infer("Hello world!", "out.wav")
```

### Custom sampling parameters

```python
out = model.infer(
    "Hello world!",
    temperature=0.3,
    top_p=0.95,
    repetition_penalty=1.2,
)
```

### Batched inference

```python
out = model.infer_batch(["Hello world!"] * 10)
```

#### Save batch outputs to a directory

```python
out = model.infer_batch(["Hello world!"] * 10, "/dir")
```

### Streaming inference

```python
import torch

stream = model.infer_stream("Hello world!", chunk_size=1)

# Audio chunks can be accessed via an iterator
chunks = []
for chunk in stream:
    chunks.append(chunk)

out = torch.cat(chunks)
```

---

## Key Features

### 1. High‑fidelity 32 kHz audio

Soprano synthesizes speech at **32 kHz**, delivering clarity that is perceptually indistinguishable from 44.1 kHz audio and significantly higher quality than the 24 kHz output used by many existing TTS models.

### 2. Vocos‑based neural decoder

Instead of slow diffusion decoders, Soprano uses a **Vocos‑based decoder**, enabling **orders‑of‑magnitude faster** waveform generation while maintaining comparable perceptual quality.

### 3. Seamless real‑time streaming


Soprano leverages the decoder’s finite receptive field to losslessly stream audio with **ultra‑low latency**. The streamed output is acoustically identical to offline synthesis, enabling interactive applications with sub‑frame delays.

### 4. State‑of‑the‑art neural audio codec

Speech is represented using a **neural codec** that compresses audio to **~15 tokens/sec** at just **0.2 kbps**, allowing extremely fast generation and efficient memory usage without sacrificing quality.

### 5. Sentence‑level streaming for infinite context

Each sentence is generated independently, enabling **effectively infinite generation length** while maintaining stability and real‑time performance for long‑form generation.

---

## License

This project is licensed under the **Apache-2.0** license.