Soprano-80M / README.md

Update README.md

9581d6d verified 19 days ago

3.82 kB

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: text-to-speech
	---

	# Soprano: Instant, Ultra‑Realistic Text‑to‑Speech


	<!-- Embedded demo video (placeholder) -->

	<p align="center">
	</p>

	---

	## Overview

	Soprano is an ultra‑lightweight, open‑source text‑to‑speech (TTS) model designed for real‑time, high‑fidelity speech synthesis at unprecedented speed, all while remaining compact and easy to deploy.

	With only 80M parameters, Soprano achieves a real‑time factor (RTF) of ~2000×, capable of generating 10 hours of audio in under 20 seconds. Soprano uses a seamless streaming technique that enables true real‑time synthesis in <15 ms, multiple orders of magnitude faster than existing TTS pipelines.

	This space contains the model weights for Soprano. The LLM uses a standard Qwen3 architecture, and the decoder is a Vocos model fine-tuned on the output hidden states of the LLM.

	Github: https://github.com/ekwek1/soprano

	Model Demo: https://huggingface.co/spaces/ekwek/Soprano-TTS

	---

	## Installation

	Requirements: Linux or Windows, CUDA‑enabled GPU required (CPU support coming soon).

	### One‑line install

	```bash
	pip install soprano-tts
	```

	### Install from source

	```bash
	git clone https://github.com/ekwek1/soprano.git
	cd soprano
	pip install -e .
	```

	> Note: Soprano uses LMDeploy to accelerate inference by default. If LMDeploy cannot be installed in your environment, Soprano can fall back to the HuggingFace transformers backend (with slower performance). To enable this, pass `backend='transformers'` when creating the TTS model.

	---

	## Usage

	```python
	from soprano import SopranoTTS

	model = SopranoTTS()
	```

	### Basic inference

	```python
	out = model.infer("Hello world!")
	```

	### Save output to a file

	```python
	out = model.infer("Hello world!", "out.wav")
	```

	### Custom sampling parameters

	```python
	out = model.infer(
	"Hello world!",
	temperature=0.3,
	top_p=0.95,
	repetition_penalty=1.2,
	)
	```

	### Batched inference

	```python
	out = model.infer_batch(["Hello world!"] * 10)
	```

	#### Save batch outputs to a directory

	```python
	out = model.infer_batch(["Hello world!"] * 10, "/dir")
	```

	### Streaming inference

	```python
	import torch

	stream = model.infer_stream("Hello world!", chunk_size=1)

	# Audio chunks can be accessed via an iterator
	chunks = []
	for chunk in stream:
	chunks.append(chunk)

	out = torch.cat(chunks)
	```

	---

	## Key Features

	### 1. High‑fidelity 32 kHz audio

	Soprano synthesizes speech at 32 kHz, delivering clarity that is perceptually indistinguishable from 44.1 kHz audio and significantly higher quality than the 24 kHz output used by many existing TTS models.

	### 2. Vocos‑based neural decoder

	Instead of slow diffusion decoders, Soprano uses a Vocos‑based decoder, enabling orders‑of‑magnitude faster waveform generation while maintaining comparable perceptual quality.

	### 3. Seamless real‑time streaming


	Soprano leverages the decoder’s finite receptive field to losslessly stream audio with ultra‑low latency. The streamed output is acoustically identical to offline synthesis, enabling interactive applications with sub‑frame delays.

	### 4. State‑of‑the‑art neural audio codec

	Speech is represented using a neural codec that compresses audio to ~15 tokens/sec at just 0.2 kbps, allowing extremely fast generation and efficient memory usage without sacrificing quality.

	### 5. Sentence‑level streaming for infinite context

	Each sentence is generated independently, enabling effectively infinite generation length while maintaining stability and real‑time performance for long‑form generation.

	---

	## License

	This project is licensed under the Apache-2.0 license.