File size: 10,269 Bytes
c8da1af 78a7d83 2edab05 78a7d83 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
---
license: apache-2.0
pipeline_tag: text-to-speech
tags:
- voice
- speech
- text-to-speech
- audio
---
<p align="center">
<img alt="Continue-TTS" src="https://github.com/SVECTOR-CORPORATION/Continue-TTS/blob/main/continue-tts-image-banner.jpg?raw=true" width="800">
</p>
# Continue-TTS
### Text-to-Speech Model Based on Continue-1-OSS
<div align="left" style="line-height: 1;">
<a href="https://spec-chat.tech" target="_blank" style="margin: 2px;">
<img alt="SVECTOR" src="https://img.shields.io/badge/💬%20Spec%20Chat-Spec%20Chat-blue?style=plastic" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/SVECTOR-CORPORATION" target="_blank" style="margin: 2px;">
<img alt="SVECTOR" src="https://img.shields.io/badge/🤗%20Hugging%20Face-SVECTOR-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/SVECTOR-CORPORATION/Continue-TTS/blob/main/LICENSE" style="margin: 2px;">
<img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue?color=1e88e5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://github.com/SVECTOR-CORPORATION/Continue-TTS" target="_blank" style="margin: 2px;">
<img alt="GitHub" src="https://img.shields.io/badge/GitHub-Continue--TTS-181717?logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
## Introduction
We are thrilled to introduce **Continue-TTS**, a fine-tuned text-to-speech model based on the **Continue-1-OSS** architecture, developed by SVECTOR. This model is specifically trained for high-quality speech synthesis and delivers exceptional voice generation capabilities.
**Continue-TTS** is engineered to provide:
- **Natural Speech:** Human-like intonation, emotion, and rhythm that rivals commercial solutions
- **8 Unique Voices:** Diverse voice options with distinct personalities and characteristics
- **Real-time Generation:** Low-latency streaming for interactive applications (~200ms)
- **Emotional Expression:** Built-in support for laughter, sighs, gasps, and other natural emotions
- **Open Source:** Fully accessible under Apache 2.0 license for research and commercial use
This model is based on the **Continue-1-OSS** architecture and combines the power of large language models with neural audio codecs to generate exceptionally natural speech from text.
<audio controls src="https://ik.imagekit.io/svector/efd3e807-49a4-463b-af6d-4069acf7ff3a.wav"></audio>
```
The sun was setting behind the mountains, painting the sky with soft shades of orange and violet.
She stood there quietly, breathing in the moment. <sigh>
Sometimes, the smallest moments are the ones that change everything.
```
<audio controls src="https://ik.imagekit.io/svector/c99ff697-291a-4fb7-940a-56b523b9f286.wav?updatedAt=1762362454065"></audio>
```
<sigh>
Not every journey is loud.
Some begin quietly… inside.
But once they begin, they never stop.
We continue.
```
### Model Specifications
- **Base Architecture:** Continue-1-OSS
- **Type:** Text-to-Speech (TTS) Model
- **Parameters:** 3 Billion
- **Audio Codec:** SNAC (24kHz)
- **Context Length:** 131,072 tokens
- **Vocabulary:** 156,940 tokens (including 28,672 audio tokens)
- **License:** Apache 2.0
- **Voices:** 8 (Nova, Aurora, Stellar, Atlas, Orion, Luna, Phoenix, Ember)
## Requirements
To use Continue-TTS, install the required dependencies:
```bash
pip install transformers torch
pip install snac # Audio codec
pip install vllm==0.7.3 # For fast inference (optional but recommended)
```
## Quickstart
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "SVECTOR-CORPORATION/Continue-TTS"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Prepare text with voice
text = "Hello! I am Continue-TTS, a text-to-speech model based on Continue-1-OSS."
voice = "nova" # Choose: nova, aurora, stellar, atlas, orion, luna, phoenix, ember
# Format prompt (TTS format)
adapted_prompt = f"{voice}: {text}"
prompt_tokens = tokenizer(adapted_prompt, return_tensors="pt")
start_token = torch.tensor([[128259]], dtype=torch.int64)
end_tokens = torch.tensor([[128009, 128260, 128261, 128257]], dtype=torch.int64)
input_ids = torch.cat([start_token, prompt_tokens.input_ids, end_tokens], dim=1)
# Generate audio tokens
outputs = model.generate(
input_ids.to(model.device),
max_new_tokens=1200,
temperature=0.6,
top_p=0.8,
repetition_penalty=1.3,
eos_token_id=49158, # TTS stop token
do_sample=True
)
# Decode tokens (audio codes can be decoded using SNAC decoder)
generated_tokens = tokenizer.decode(outputs[0], skip_special_tokens=False)
```
### Using Continue-TTS Package (Recommended)
For easier usage with audio generation, use the Continue-TTS package:
```bash
pip install continue-speech
```
```python
from continue_tts import Continue1Model
import wave
# Initialize model
model = Continue1Model(model_name="SVECTOR-CORPORATION/Continue-TTS", max_model_len=2048)
# Generate speech
text = "Welcome to Continue-TTS! This model is built on Continue-1-OSS."
audio_chunks = model.generate_speech(prompt=text, voice="nova")
# Save to file
with wave.open("output.wav", "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(24000)
for chunk in audio_chunks:
wf.writeframes(chunk)
```
## Available Voices
Continue-TTS includes 8 professionally designed voices:
| Voice | Gender | Description |
|-------|--------|-------------|
| **nova** | Female | Conversational and natural, perfect for general use |
| **aurora** | Female | Warm and friendly, excellent for storytelling |
| **stellar** | Female | Energetic and bright, great for upbeat content |
| **atlas** | Male | Deep and authoritative, ideal for narration |
| **orion** | Male | Friendly and casual, perfect for conversational content |
| **luna** | Female | Soft and gentle, excellent for calm narration |
| **phoenix** | Male | Dynamic and expressive, great for engaging content |
| **ember** | Female | Warm and engaging, perfect for emotional expression |
## Advanced Features
### Emotion Tags
Add natural emotions to your speech:
```python
text = "This is incredible! <laugh> I can't believe how natural it sounds. <gasp>"
```
**Supported emotions:**
- `<laugh>` - Natural laughter
- `<chuckle>` - Light laugh
- `<sigh>` - Expressive sigh
- `<gasp>` - Surprised gasp
- `<cough>` - Cough sound
- `<yawn>` - Yawn
- `<groan>` - Groan
- `<sniffle>` - Sniffle
### Custom Generation Parameters
Fine-tune generation quality:
```python
audio = model.generate_speech(
prompt="Your text here",
voice="nova",
temperature=0.6, # Lower = more consistent, Higher = more varied
top_p=0.8, # Nucleus sampling threshold
max_tokens=1200, # Maximum audio length
repetition_penalty=1.3 # Prevent token repetition
)
```
## Use Cases
Continue-TTS excels at:
- **Audiobook Narration:** Natural storytelling with emotional expression
- **Virtual Assistants:** Conversational AI with personality
- **Accessibility:** Text-to-speech for visually impaired users
- **Content Creation:** Voiceovers for videos, podcasts, and presentations
- **Gaming:** Dynamic character voices and dialogue
- **Education:** Interactive learning materials with voice
- **Customer Service:** Natural-sounding automated responses
## Performance
- **Quality:** State-of-the-art natural speech synthesis
- **Latency:** ~200ms for streaming generation (GPU)
- **Speed:** Real-time on GPU, slower on CPU
- **Memory:** ~7GB GPU RAM (FP16), ~14GB (FP32)
- **Sample Rate:** 24kHz (high quality audio)
## Model Architecture
Continue-TTS is built on the Continue-1-OSS and combines:
- **Base Model:** Continue-1-OSS (LLaMA-based, 3.3B parameters)
- **Audio Codec:** SNAC multi-scale neural audio codec
- **Token Structure:** 7 audio tokens per frame (hierarchical encoding)
- **Training:** Fine-tuned on few hours of diverse speech data
The model generates audio tokens autoregressively, which are then decoded into waveforms using the SNAC neural codec.
## Training
Continue-TTS was fine-tuned on the Continue-1-OSS using:
- High-quality speech datasets covering diverse accents and styles
- Multi-speaker recordings for voice diversity
- Emotional speech data for expressive synthesis
- Conversational and narrative content
Training utilized:
- Continue-1-OSS as base
- Custom tokenizer with 28,672 audio tokens
- Multi-stage training (pretraining + fine-tuning)
- Optimized for naturalness and emotion
## Limitations
As with any TTS model, Continue-TTS has certain limitations:
- **Pronunciation:** May struggle with unusual names, technical terms, or non-English words
- **Consistency:** Long-form generation may have minor quality variations
- **Accents:** Primarily trained on specific accent patterns
- **Compute:** Requires GPU for real-time generation (CPU is slower)
- **Language:** Currently optimized for English
## Ethical Considerations
SVECTOR is committed to responsible AI development. Users should:
- **Transparency:** Disclose when audio is AI-generated
- **Consent:** Do not clone voices without explicit permission
- **Verification:** Implement safeguards against deepfakes and misinformation
- **Attribution:** Credit the model when used in public projects
- **Responsible Use:** Avoid generating harmful, deceptive, or illegal content
## License
This model is released under the **Apache License 2.0**. See the [LICENSE](https://huggingface.co/SVECTOR-CORPORATION/Continue-TTS/blob/main/LICENSE) file for complete details.
## Acknowledgments
Continue-1-OSS builds upon advances in neural speech synthesis, large language models, and neural audio codecs. We thank the open-source community for their contributions to these foundational technologies.
---
<p align="center">
<i>Developed by <a href="https://www.svector.co.in">SVECTOR</a></i>
</p>
|