Continue-TTS / README.md
SVECTOR-OFFICIAL's picture
Update README.md
2edab05 verified
---
license: apache-2.0
pipeline_tag: text-to-speech
tags:
- voice
- speech
- text-to-speech
- audio
---
<p align="center">
<img alt="Continue-TTS" src="https://github.com/SVECTOR-CORPORATION/Continue-TTS/blob/main/continue-tts-image-banner.jpg?raw=true" width="800">
</p>
# Continue-TTS
### Text-to-Speech Model Based on Continue-1-OSS
<div align="left" style="line-height: 1;">
<a href="https://spec-chat.tech" target="_blank" style="margin: 2px;">
<img alt="SVECTOR" src="https://img.shields.io/badge/💬%20Spec%20Chat-Spec%20Chat-blue?style=plastic" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/SVECTOR-CORPORATION" target="_blank" style="margin: 2px;">
<img alt="SVECTOR" src="https://img.shields.io/badge/🤗%20Hugging%20Face-SVECTOR-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/SVECTOR-CORPORATION/Continue-TTS/blob/main/LICENSE" style="margin: 2px;">
<img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue?color=1e88e5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://github.com/SVECTOR-CORPORATION/Continue-TTS" target="_blank" style="margin: 2px;">
<img alt="GitHub" src="https://img.shields.io/badge/GitHub-Continue--TTS-181717?logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
## Introduction
We are thrilled to introduce **Continue-TTS**, a fine-tuned text-to-speech model based on the **Continue-1-OSS** architecture, developed by SVECTOR. This model is specifically trained for high-quality speech synthesis and delivers exceptional voice generation capabilities.
**Continue-TTS** is engineered to provide:
- **Natural Speech:** Human-like intonation, emotion, and rhythm that rivals commercial solutions
- **8 Unique Voices:** Diverse voice options with distinct personalities and characteristics
- **Real-time Generation:** Low-latency streaming for interactive applications (~200ms)
- **Emotional Expression:** Built-in support for laughter, sighs, gasps, and other natural emotions
- **Open Source:** Fully accessible under Apache 2.0 license for research and commercial use
This model is based on the **Continue-1-OSS** architecture and combines the power of large language models with neural audio codecs to generate exceptionally natural speech from text.
<audio controls src="https://ik.imagekit.io/svector/efd3e807-49a4-463b-af6d-4069acf7ff3a.wav"></audio>
```
The sun was setting behind the mountains, painting the sky with soft shades of orange and violet.
She stood there quietly, breathing in the moment. <sigh>
Sometimes, the smallest moments are the ones that change everything.
```
<audio controls src="https://ik.imagekit.io/svector/c99ff697-291a-4fb7-940a-56b523b9f286.wav?updatedAt=1762362454065"></audio>
```
<sigh>
Not every journey is loud.
Some begin quietly… inside.
But once they begin, they never stop.
We continue.
```
### Model Specifications
- **Base Architecture:** Continue-1-OSS
- **Type:** Text-to-Speech (TTS) Model
- **Parameters:** 3 Billion
- **Audio Codec:** SNAC (24kHz)
- **Context Length:** 131,072 tokens
- **Vocabulary:** 156,940 tokens (including 28,672 audio tokens)
- **License:** Apache 2.0
- **Voices:** 8 (Nova, Aurora, Stellar, Atlas, Orion, Luna, Phoenix, Ember)
## Requirements
To use Continue-TTS, install the required dependencies:
```bash
pip install transformers torch
pip install snac # Audio codec
pip install vllm==0.7.3 # For fast inference (optional but recommended)
```
## Quickstart
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "SVECTOR-CORPORATION/Continue-TTS"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Prepare text with voice
text = "Hello! I am Continue-TTS, a text-to-speech model based on Continue-1-OSS."
voice = "nova" # Choose: nova, aurora, stellar, atlas, orion, luna, phoenix, ember
# Format prompt (TTS format)
adapted_prompt = f"{voice}: {text}"
prompt_tokens = tokenizer(adapted_prompt, return_tensors="pt")
start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260, 128261, 128257]], dtype=torch.int64)
input_ids = torch.cat([start_token, prompt_tokens.input_ids, end_tokens], dim=1)
# Generate audio tokens
outputs = model.generate(
input_ids.to(model.device),
max_new_tokens=1200,
temperature=0.6,
top_p=0.8,
repetition_penalty=1.3,
eos_token_id=49158, # TTS stop token
do_sample=True
)
# Decode tokens (audio codes can be decoded using SNAC decoder)
generated_tokens = tokenizer.decode(outputs[0], skip_special_tokens=False)
```
### Using Continue-TTS Package (Recommended)
For easier usage with audio generation, use the Continue-TTS package:
```bash
pip install continue-speech
```
```python
from continue_tts import Continue1Model
import wave
# Initialize model
model = Continue1Model(model_name="SVECTOR-CORPORATION/Continue-TTS", max_model_len=2048)
# Generate speech
text = "Welcome to Continue-TTS! This model is built on Continue-1-OSS."
audio_chunks = model.generate_speech(prompt=text, voice="nova")
# Save to file
with wave.open("output.wav", "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(24000)
for chunk in audio_chunks:
wf.writeframes(chunk)
```
## Available Voices
Continue-TTS includes 8 professionally designed voices:
| Voice | Gender | Description |
|-------|--------|-------------|
| **nova** | Female | Conversational and natural, perfect for general use |
| **aurora** | Female | Warm and friendly, excellent for storytelling |
| **stellar** | Female | Energetic and bright, great for upbeat content |
| **atlas** | Male | Deep and authoritative, ideal for narration |
| **orion** | Male | Friendly and casual, perfect for conversational content |
| **luna** | Female | Soft and gentle, excellent for calm narration |
| **phoenix** | Male | Dynamic and expressive, great for engaging content |
| **ember** | Female | Warm and engaging, perfect for emotional expression |
## Advanced Features
### Emotion Tags
Add natural emotions to your speech:
```python
text = "This is incredible! <laugh> I can't believe how natural it sounds. <gasp>"
```
**Supported emotions:**
- `<laugh>` - Natural laughter
- `<chuckle>` - Light laugh
- `<sigh>` - Expressive sigh
- `<gasp>` - Surprised gasp
- `<cough>` - Cough sound
- `<yawn>` - Yawn
- `<groan>` - Groan
- `<sniffle>` - Sniffle
### Custom Generation Parameters
Fine-tune generation quality:
```python
audio = model.generate_speech(
prompt="Your text here",
voice="nova",
temperature=0.6, # Lower = more consistent, Higher = more varied
top_p=0.8, # Nucleus sampling threshold
max_tokens=1200, # Maximum audio length
repetition_penalty=1.3 # Prevent token repetition
)
```
## Use Cases
Continue-TTS excels at:
- **Audiobook Narration:** Natural storytelling with emotional expression
- **Virtual Assistants:** Conversational AI with personality
- **Accessibility:** Text-to-speech for visually impaired users
- **Content Creation:** Voiceovers for videos, podcasts, and presentations
- **Gaming:** Dynamic character voices and dialogue
- **Education:** Interactive learning materials with voice
- **Customer Service:** Natural-sounding automated responses
## Performance
- **Quality:** State-of-the-art natural speech synthesis
- **Latency:** ~200ms for streaming generation (GPU)
- **Speed:** Real-time on GPU, slower on CPU
- **Memory:** ~7GB GPU RAM (FP16), ~14GB (FP32)
- **Sample Rate:** 24kHz (high quality audio)
## Model Architecture
Continue-TTS is built on the Continue-1-OSS and combines:
- **Base Model:** Continue-1-OSS (LLaMA-based, 3.3B parameters)
- **Audio Codec:** SNAC multi-scale neural audio codec
- **Token Structure:** 7 audio tokens per frame (hierarchical encoding)
- **Training:** Fine-tuned on few hours of diverse speech data
The model generates audio tokens autoregressively, which are then decoded into waveforms using the SNAC neural codec.
## Training
Continue-TTS was fine-tuned on the Continue-1-OSS using:
- High-quality speech datasets covering diverse accents and styles
- Multi-speaker recordings for voice diversity
- Emotional speech data for expressive synthesis
- Conversational and narrative content
Training utilized:
- Continue-1-OSS as base
- Custom tokenizer with 28,672 audio tokens
- Multi-stage training (pretraining + fine-tuning)
- Optimized for naturalness and emotion
## Limitations
As with any TTS model, Continue-TTS has certain limitations:
- **Pronunciation:** May struggle with unusual names, technical terms, or non-English words
- **Consistency:** Long-form generation may have minor quality variations
- **Accents:** Primarily trained on specific accent patterns
- **Compute:** Requires GPU for real-time generation (CPU is slower)
- **Language:** Currently optimized for English
## Ethical Considerations
SVECTOR is committed to responsible AI development. Users should:
- **Transparency:** Disclose when audio is AI-generated
- **Consent:** Do not clone voices without explicit permission
- **Verification:** Implement safeguards against deepfakes and misinformation
- **Attribution:** Credit the model when used in public projects
- **Responsible Use:** Avoid generating harmful, deceptive, or illegal content
## License
This model is released under the **Apache License 2.0**. See the [LICENSE](https://huggingface.co/SVECTOR-CORPORATION/Continue-TTS/blob/main/LICENSE) file for complete details.
## Acknowledgments
Continue-1-OSS builds upon advances in neural speech synthesis, large language models, and neural audio codecs. We thank the open-source community for their contributions to these foundational technologies.
---
<p align="center">
<i>Developed by <a href="https://www.svector.co.in">SVECTOR</a></i>
</p>