|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
# Soprano: Instant, Ultra‑Realistic Text‑to‑Speech |
|
|
|
|
|
|
|
|
<!-- Embedded demo video (placeholder) --> |
|
|
|
|
|
<p align="center"> |
|
|
</p> |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
**Soprano** is an ultra‑lightweight, open‑source text‑to‑speech (TTS) model designed for real‑time, high‑fidelity speech synthesis at unprecedented speed, all while remaining compact and easy to deploy. |
|
|
|
|
|
With only **80M parameters**, Soprano achieves a real‑time factor (RTF) of **~2000×**, capable of generating **10 hours of audio in under 20 seconds**. Soprano uses a **seamless streaming** technique that enables true real‑time synthesis in **<15 ms**, multiple orders of magnitude faster than existing TTS pipelines. |
|
|
|
|
|
This space contains the **model weights** for Soprano. The LLM uses a standard Qwen3 architecture, and the decoder is a Vocos model fine-tuned on the output hidden states of the LLM. |
|
|
|
|
|
Github: https://github.com/ekwek1/soprano |
|
|
|
|
|
Model Demo: https://huggingface.co/spaces/ekwek/Soprano-TTS |
|
|
|
|
|
--- |
|
|
|
|
|
## Installation |
|
|
|
|
|
**Requirements**: Linux or Windows, CUDA‑enabled GPU required (CPU support coming soon). |
|
|
|
|
|
### One‑line install |
|
|
|
|
|
```bash |
|
|
pip install soprano-tts |
|
|
``` |
|
|
|
|
|
### Install from source |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/ekwek1/soprano.git |
|
|
cd soprano |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
> **Note**: Soprano uses **LMDeploy** to accelerate inference by default. If LMDeploy cannot be installed in your environment, Soprano can fall back to the HuggingFace **transformers** backend (with slower performance). To enable this, pass `backend='transformers'` when creating the TTS model. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from soprano import SopranoTTS |
|
|
|
|
|
model = SopranoTTS() |
|
|
``` |
|
|
|
|
|
### Basic inference |
|
|
|
|
|
```python |
|
|
out = model.infer("Hello world!") |
|
|
``` |
|
|
|
|
|
### Save output to a file |
|
|
|
|
|
```python |
|
|
out = model.infer("Hello world!", "out.wav") |
|
|
``` |
|
|
|
|
|
### Custom sampling parameters |
|
|
|
|
|
```python |
|
|
out = model.infer( |
|
|
"Hello world!", |
|
|
temperature=0.3, |
|
|
top_p=0.95, |
|
|
repetition_penalty=1.2, |
|
|
) |
|
|
``` |
|
|
|
|
|
### Batched inference |
|
|
|
|
|
```python |
|
|
out = model.infer_batch(["Hello world!"] * 10) |
|
|
``` |
|
|
|
|
|
#### Save batch outputs to a directory |
|
|
|
|
|
```python |
|
|
out = model.infer_batch(["Hello world!"] * 10, "/dir") |
|
|
``` |
|
|
|
|
|
### Streaming inference |
|
|
|
|
|
```python |
|
|
import torch |
|
|
|
|
|
stream = model.infer_stream("Hello world!", chunk_size=1) |
|
|
|
|
|
# Audio chunks can be accessed via an iterator |
|
|
chunks = [] |
|
|
for chunk in stream: |
|
|
chunks.append(chunk) |
|
|
|
|
|
out = torch.cat(chunks) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Key Features |
|
|
|
|
|
### 1. High‑fidelity 32 kHz audio |
|
|
|
|
|
Soprano synthesizes speech at **32 kHz**, delivering clarity that is perceptually indistinguishable from 44.1 kHz audio and significantly higher quality than the 24 kHz output used by many existing TTS models. |
|
|
|
|
|
### 2. Vocos‑based neural decoder |
|
|
|
|
|
Instead of slow diffusion decoders, Soprano uses a **Vocos‑based decoder**, enabling **orders‑of‑magnitude faster** waveform generation while maintaining comparable perceptual quality. |
|
|
|
|
|
### 3. Seamless real‑time streaming |
|
|
|
|
|
|
|
|
Soprano leverages the decoder’s finite receptive field to losslessly stream audio with **ultra‑low latency**. The streamed output is acoustically identical to offline synthesis, enabling interactive applications with sub‑frame delays. |
|
|
|
|
|
### 4. State‑of‑the‑art neural audio codec |
|
|
|
|
|
Speech is represented using a **neural codec** that compresses audio to **~15 tokens/sec** at just **0.2 kbps**, allowing extremely fast generation and efficient memory usage without sacrificing quality. |
|
|
|
|
|
### 5. Sentence‑level streaming for infinite context |
|
|
|
|
|
Each sentence is generated independently, enabling **effectively infinite generation length** while maintaining stability and real‑time performance for long‑form generation. |
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
This project is licensed under the **Apache-2.0** license. |