|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- voice |
|
|
- speech |
|
|
- text-to-speech |
|
|
- audio |
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
<img alt="Continue-TTS" src="https://github.com/SVECTOR-CORPORATION/Continue-TTS/blob/main/continue-tts-image-banner.jpg?raw=true" width="800"> |
|
|
</p> |
|
|
|
|
|
# Continue-TTS |
|
|
|
|
|
### Text-to-Speech Model Based on Continue-1-OSS |
|
|
|
|
|
<div align="left" style="line-height: 1;"> |
|
|
<a href="https://spec-chat.tech" target="_blank" style="margin: 2px;"> |
|
|
<img alt="SVECTOR" src="https://img.shields.io/badge/💬%20Spec%20Chat-Spec%20Chat-blue?style=plastic" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> |
|
|
|
|
|
<a href="https://huggingface.co/SVECTOR-CORPORATION" target="_blank" style="margin: 2px;"> |
|
|
<img alt="SVECTOR" src="https://img.shields.io/badge/🤗%20Hugging%20Face-SVECTOR-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> |
|
|
|
|
|
<a href="https://huggingface.co/SVECTOR-CORPORATION/Continue-TTS/blob/main/LICENSE" style="margin: 2px;"> |
|
|
<img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-blue?color=1e88e5&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> |
|
|
|
|
|
<a href="https://github.com/SVECTOR-CORPORATION/Continue-TTS" target="_blank" style="margin: 2px;"> |
|
|
<img alt="GitHub" src="https://img.shields.io/badge/GitHub-Continue--TTS-181717?logo=github&logoColor=white" style="display: inline-block; vertical-align: middle;"/> |
|
|
</a> |
|
|
</div> |
|
|
|
|
|
## Introduction |
|
|
|
|
|
We are thrilled to introduce **Continue-TTS**, a fine-tuned text-to-speech model based on the **Continue-1-OSS** architecture, developed by SVECTOR. This model is specifically trained for high-quality speech synthesis and delivers exceptional voice generation capabilities. |
|
|
|
|
|
**Continue-TTS** is engineered to provide: |
|
|
|
|
|
- **Natural Speech:** Human-like intonation, emotion, and rhythm that rivals commercial solutions |
|
|
- **8 Unique Voices:** Diverse voice options with distinct personalities and characteristics |
|
|
- **Real-time Generation:** Low-latency streaming for interactive applications (~200ms) |
|
|
- **Emotional Expression:** Built-in support for laughter, sighs, gasps, and other natural emotions |
|
|
- **Open Source:** Fully accessible under Apache 2.0 license for research and commercial use |
|
|
|
|
|
This model is based on the **Continue-1-OSS** architecture and combines the power of large language models with neural audio codecs to generate exceptionally natural speech from text. |
|
|
|
|
|
<audio controls src="https://ik.imagekit.io/svector/efd3e807-49a4-463b-af6d-4069acf7ff3a.wav"></audio> |
|
|
|
|
|
``` |
|
|
The sun was setting behind the mountains, painting the sky with soft shades of orange and violet. |
|
|
She stood there quietly, breathing in the moment. <sigh> |
|
|
Sometimes, the smallest moments are the ones that change everything. |
|
|
``` |
|
|
|
|
|
<audio controls src="https://ik.imagekit.io/svector/c99ff697-291a-4fb7-940a-56b523b9f286.wav?updatedAt=1762362454065"></audio> |
|
|
|
|
|
``` |
|
|
<sigh> |
|
|
Not every journey is loud. |
|
|
Some begin quietly… inside. |
|
|
But once they begin, they never stop. |
|
|
We continue. |
|
|
``` |
|
|
|
|
|
### Model Specifications |
|
|
|
|
|
- **Base Architecture:** Continue-1-OSS |
|
|
- **Type:** Text-to-Speech (TTS) Model |
|
|
- **Parameters:** 3 Billion |
|
|
- **Audio Codec:** SNAC (24kHz) |
|
|
- **Context Length:** 131,072 tokens |
|
|
- **Vocabulary:** 156,940 tokens (including 28,672 audio tokens) |
|
|
- **License:** Apache 2.0 |
|
|
- **Voices:** 8 (Nova, Aurora, Stellar, Atlas, Orion, Luna, Phoenix, Ember) |
|
|
|
|
|
## Requirements |
|
|
|
|
|
To use Continue-TTS, install the required dependencies: |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
pip install snac # Audio codec |
|
|
pip install vllm==0.7.3 # For fast inference (optional but recommended) |
|
|
``` |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model_id = "SVECTOR-CORPORATION/Continue-TTS" |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Prepare text with voice |
|
|
text = "Hello! I am Continue-TTS, a text-to-speech model based on Continue-1-OSS." |
|
|
voice = "nova" # Choose: nova, aurora, stellar, atlas, orion, luna, phoenix, ember |
|
|
|
|
|
# Format prompt (TTS format) |
|
|
adapted_prompt = f"{voice}: {text}" |
|
|
prompt_tokens = tokenizer(adapted_prompt, return_tensors="pt") |
|
|
start_token = torch.tensor([[128259]], dtype=torch.int64) |
|
|
end_tokens = torch.tensor([[128009, 128260, 128261, 128257]], dtype=torch.int64) |
|
|
input_ids = torch.cat([start_token, prompt_tokens.input_ids, end_tokens], dim=1) |
|
|
|
|
|
# Generate audio tokens |
|
|
outputs = model.generate( |
|
|
input_ids.to(model.device), |
|
|
max_new_tokens=1200, |
|
|
temperature=0.6, |
|
|
top_p=0.8, |
|
|
repetition_penalty=1.3, |
|
|
eos_token_id=49158, # TTS stop token |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
# Decode tokens (audio codes can be decoded using SNAC decoder) |
|
|
generated_tokens = tokenizer.decode(outputs[0], skip_special_tokens=False) |
|
|
``` |
|
|
|
|
|
### Using Continue-TTS Package (Recommended) |
|
|
|
|
|
For easier usage with audio generation, use the Continue-TTS package: |
|
|
|
|
|
```bash |
|
|
pip install continue-speech |
|
|
``` |
|
|
|
|
|
```python |
|
|
from continue_tts import Continue1Model |
|
|
import wave |
|
|
|
|
|
# Initialize model |
|
|
model = Continue1Model(model_name="SVECTOR-CORPORATION/Continue-TTS", max_model_len=2048) |
|
|
|
|
|
# Generate speech |
|
|
text = "Welcome to Continue-TTS! This model is built on Continue-1-OSS." |
|
|
audio_chunks = model.generate_speech(prompt=text, voice="nova") |
|
|
|
|
|
# Save to file |
|
|
with wave.open("output.wav", "wb") as wf: |
|
|
wf.setnchannels(1) |
|
|
wf.setsampwidth(2) |
|
|
wf.setframerate(24000) |
|
|
for chunk in audio_chunks: |
|
|
wf.writeframes(chunk) |
|
|
``` |
|
|
|
|
|
## Available Voices |
|
|
|
|
|
Continue-TTS includes 8 professionally designed voices: |
|
|
|
|
|
| Voice | Gender | Description | |
|
|
|-------|--------|-------------| |
|
|
| **nova** | Female | Conversational and natural, perfect for general use | |
|
|
| **aurora** | Female | Warm and friendly, excellent for storytelling | |
|
|
| **stellar** | Female | Energetic and bright, great for upbeat content | |
|
|
| **atlas** | Male | Deep and authoritative, ideal for narration | |
|
|
| **orion** | Male | Friendly and casual, perfect for conversational content | |
|
|
| **luna** | Female | Soft and gentle, excellent for calm narration | |
|
|
| **phoenix** | Male | Dynamic and expressive, great for engaging content | |
|
|
| **ember** | Female | Warm and engaging, perfect for emotional expression | |
|
|
|
|
|
## Advanced Features |
|
|
|
|
|
### Emotion Tags |
|
|
|
|
|
Add natural emotions to your speech: |
|
|
|
|
|
```python |
|
|
text = "This is incredible! <laugh> I can't believe how natural it sounds. <gasp>" |
|
|
``` |
|
|
|
|
|
**Supported emotions:** |
|
|
- `<laugh>` - Natural laughter |
|
|
- `<chuckle>` - Light laugh |
|
|
- `<sigh>` - Expressive sigh |
|
|
- `<gasp>` - Surprised gasp |
|
|
- `<cough>` - Cough sound |
|
|
- `<yawn>` - Yawn |
|
|
- `<groan>` - Groan |
|
|
- `<sniffle>` - Sniffle |
|
|
|
|
|
### Custom Generation Parameters |
|
|
|
|
|
Fine-tune generation quality: |
|
|
|
|
|
```python |
|
|
audio = model.generate_speech( |
|
|
prompt="Your text here", |
|
|
voice="nova", |
|
|
temperature=0.6, # Lower = more consistent, Higher = more varied |
|
|
top_p=0.8, # Nucleus sampling threshold |
|
|
max_tokens=1200, # Maximum audio length |
|
|
repetition_penalty=1.3 # Prevent token repetition |
|
|
) |
|
|
``` |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
Continue-TTS excels at: |
|
|
|
|
|
- **Audiobook Narration:** Natural storytelling with emotional expression |
|
|
- **Virtual Assistants:** Conversational AI with personality |
|
|
- **Accessibility:** Text-to-speech for visually impaired users |
|
|
- **Content Creation:** Voiceovers for videos, podcasts, and presentations |
|
|
- **Gaming:** Dynamic character voices and dialogue |
|
|
- **Education:** Interactive learning materials with voice |
|
|
- **Customer Service:** Natural-sounding automated responses |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Quality:** State-of-the-art natural speech synthesis |
|
|
- **Latency:** ~200ms for streaming generation (GPU) |
|
|
- **Speed:** Real-time on GPU, slower on CPU |
|
|
- **Memory:** ~7GB GPU RAM (FP16), ~14GB (FP32) |
|
|
- **Sample Rate:** 24kHz (high quality audio) |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
Continue-TTS is built on the Continue-1-OSS and combines: |
|
|
- **Base Model:** Continue-1-OSS (LLaMA-based, 3.3B parameters) |
|
|
- **Audio Codec:** SNAC multi-scale neural audio codec |
|
|
- **Token Structure:** 7 audio tokens per frame (hierarchical encoding) |
|
|
- **Training:** Fine-tuned on few hours of diverse speech data |
|
|
|
|
|
The model generates audio tokens autoregressively, which are then decoded into waveforms using the SNAC neural codec. |
|
|
|
|
|
## Training |
|
|
|
|
|
Continue-TTS was fine-tuned on the Continue-1-OSS using: |
|
|
- High-quality speech datasets covering diverse accents and styles |
|
|
- Multi-speaker recordings for voice diversity |
|
|
- Emotional speech data for expressive synthesis |
|
|
- Conversational and narrative content |
|
|
|
|
|
Training utilized: |
|
|
- Continue-1-OSS as base |
|
|
- Custom tokenizer with 28,672 audio tokens |
|
|
- Multi-stage training (pretraining + fine-tuning) |
|
|
- Optimized for naturalness and emotion |
|
|
|
|
|
## Limitations |
|
|
|
|
|
As with any TTS model, Continue-TTS has certain limitations: |
|
|
|
|
|
- **Pronunciation:** May struggle with unusual names, technical terms, or non-English words |
|
|
- **Consistency:** Long-form generation may have minor quality variations |
|
|
- **Accents:** Primarily trained on specific accent patterns |
|
|
- **Compute:** Requires GPU for real-time generation (CPU is slower) |
|
|
- **Language:** Currently optimized for English |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
SVECTOR is committed to responsible AI development. Users should: |
|
|
|
|
|
- **Transparency:** Disclose when audio is AI-generated |
|
|
- **Consent:** Do not clone voices without explicit permission |
|
|
- **Verification:** Implement safeguards against deepfakes and misinformation |
|
|
- **Attribution:** Credit the model when used in public projects |
|
|
- **Responsible Use:** Avoid generating harmful, deceptive, or illegal content |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the **Apache License 2.0**. See the [LICENSE](https://huggingface.co/SVECTOR-CORPORATION/Continue-TTS/blob/main/LICENSE) file for complete details. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
Continue-1-OSS builds upon advances in neural speech synthesis, large language models, and neural audio codecs. We thank the open-source community for their contributions to these foundational technologies. |
|
|
|
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
<i>Developed by <a href="https://www.svector.co.in">SVECTOR</a></i> |
|
|
</p> |
|
|
|