Parle TTS - MOSS-TTS-Realtime (1.7B) on Modal
Deploy do modelo OpenMOSS-Team/MOSS-TTS-Realtime no Modal.com para TTS em tempo real com voice cloning.
Features
- 20 idiomas suportados (pt, en, es, fr, de, it, ja, zh, ko, ru, ar, tr, ...)
- Voice cloning com audio de referencia (WAV/MP3)
- Voice presets pre-configurados (male1, male2, female1, female2)
- Deploy serverless no Modal.com com GPU A10G
- Auto-scale com scaledown apos 5 min de inatividade
Quick Start
Deploy no Modal
pip install modal
modal setup # login
modal deploy modal_deploy.py
API Endpoints
Health check:
curl https://YOUR-ENDPOINT.modal.run/health
Gerar audio (voz padrao):
curl -X POST https://YOUR-ENDPOINT.modal.run/api/text \
-H "Content-Type: application/json" \
-d '{"text": "Ola mundo!", "language": "pt"}'
Gerar audio com voice preset (masculino/feminino):
curl -X POST https://YOUR-ENDPOINT.modal.run/api/text \
-H "Content-Type: application/json" \
-d '{"text": "Ola mundo!", "language": "pt", "voice_preset": "female1"}'
Voice presets disponiveis: male1, male2, female1, female2, ref_a, ref_b
Voice cloning com audio de referencia:
# Encode audio to base64
REF_AUDIO=$(base64 -i reference.wav)
curl -X POST https://YOUR-ENDPOINT.modal.run/api/text \
-H "Content-Type: application/json" \
-d "{\"text\": \"Ola mundo!\", \"language\": \"pt\", \"reference_audio\": \"$REF_AUDIO\"}"
Response Format
{
"audio": "<base64 WAV>",
"sample_rate": 24000,
"format": "wav",
"language": "pt",
"text": "Ola mundo!",
"voice_preset": "female1",
"voice_cloned": true,
"duration_seconds": 2.5,
"generation_time": 3.2
}
Arquivos
| Arquivo | Descricao |
|---|---|
modal_deploy.py |
Script de deploy no Modal.com |
voice-presets/ |
Audios de referencia para voice presets (male/female) |
tests/moss-tts-test.py |
Script de testes |
Modelo Base
- Modelo: OpenMOSS-Team/MOSS-TTS-Realtime (1.7B params)
- Codec: OpenMOSS-Team/MOSS-Audio-Tokenizer
- Sample Rate: 24000 Hz
- GPU: NVIDIA A10G (Modal.com)
Parte do Projeto Parle
Este repo faz parte do projeto Parle - plataforma de aprendizado de idiomas com conversacao por voz em tempo real.