Samarth Naik commited on
Commit
3135113
Β·
1 Parent(s): a66eb6f

feat: Add production-ready Text-to-Speech API with Coqui TTS

Browse files

- Implement FastAPI-based TTS API with xtts_v2 model
- Add support for voice cloning with speaker WAV files
- Include comprehensive error handling and logging
- Add GET and POST endpoints for flexible usage
- Configure for CPU-only inference on Hugging Face Spaces
- Add test suite and documentation
- Update HF Space config from Gradio to FastAPI

Files changed (4) hide show
  1. README.md +104 -7
  2. app.py +209 -0
  3. requirements.txt +27 -0
  4. test_api.py +154 -0
README.md CHANGED
@@ -1,12 +1,109 @@
1
  ---
2
- title: Ttlm
3
- emoji: 🐒
4
- colorFrom: green
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 6.2.0
8
  app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Text-to-Speech API
3
+ emoji: πŸ—£οΈ
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: fastapi
 
7
  app_file: app.py
8
  pinned: false
9
  ---
10
 
11
+ # Text-to-Speech API with Coqui TTS
12
+
13
+ A production-ready Text-to-Speech API built with FastAPI and Coqui TTS, designed to run on Hugging Face Spaces.
14
+
15
+ ## Features
16
+
17
+ - **High-Quality TTS**: Uses Coqui's `xtts_v2` multilingual model
18
+ - **Voice Cloning**: Optional speaker reference for voice cloning
19
+ - **CPU Optimized**: Runs efficiently on CPU-only environments
20
+ - **REST API**: Simple GET/POST endpoints
21
+ - **Production Ready**: Proper error handling, logging, and health checks
22
+
23
+ ## API Usage
24
+
25
+ ### Simple GET Request
26
+ ```bash
27
+ curl "https://your-space-url/tts?text=Hello%20world&language=en"
28
+ ```
29
+
30
+ ### POST with JSON
31
+ ```bash
32
+ curl -X POST "https://your-space-url/tts" \
33
+ -H "Content-Type: application/json" \
34
+ -d '{"text": "Hello world", "language": "en"}'
35
+ ```
36
+
37
+ ### POST with Voice Cloning
38
+ ```bash
39
+ curl -X POST "https://your-space-url/tts" \
40
+ -F "text=Hello world" \
41
+ -F "language=en" \
42
+ -F "speaker_wav=@path/to/speaker.wav"
43
+ ```
44
+
45
+ ## Endpoints
46
+
47
+ - `GET /` - Health check
48
+ - `GET /tts` - Simple text-to-speech conversion
49
+ - `POST /tts` - Advanced TTS with optional voice cloning
50
+ - `GET /health` - Detailed health status
51
+
52
+ ## Supported Languages
53
+
54
+ The XTTS v2 model supports multiple languages including:
55
+ - English (en)
56
+ - Spanish (es)
57
+ - French (fr)
58
+ - German (de)
59
+ - Italian (it)
60
+ - Portuguese (pt)
61
+ - Polish (pl)
62
+ - Turkish (tr)
63
+ - Russian (ru)
64
+ - Dutch (nl)
65
+ - Czech (cs)
66
+ - Arabic (ar)
67
+ - Chinese (zh-cn)
68
+ - Japanese (ja)
69
+ - Hungarian (hu)
70
+ - Korean (ko)
71
+
72
+ ## Response
73
+
74
+ All endpoints return a WAV audio file that can be played directly in browsers or audio players.
75
+
76
+ ## Local Development
77
+
78
+ ```bash
79
+ # Install dependencies
80
+ pip install -r requirements.txt
81
+
82
+ # Run the application
83
+ python app.py
84
+ ```
85
+
86
+ The API will be available at `http://localhost:7860`
87
+
88
+ ## Model Information
89
+
90
+ This application uses the `tts_models/multilingual/multi-dataset/xtts_v2` model from Coqui TTS, which provides:
91
+ - High-quality multilingual speech synthesis
92
+ - Voice cloning capabilities
93
+ - CPU-friendly inference
94
+ - Support for 16+ languages
95
+
96
+ ## Error Handling
97
+
98
+ The API includes comprehensive error handling for:
99
+ - Invalid text input
100
+ - Unsupported file formats
101
+ - Model loading failures
102
+ - Audio generation errors
103
+
104
+ ## Performance Notes
105
+
106
+ - Model loads once at startup (not per request)
107
+ - Optimized for CPU inference
108
+ - Temporary files are automatically cleaned up
109
+ - Response streaming for large audio files
app.py ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Text-to-Speech API using Coqui TTS
3
+ Production-ready FastAPI application for Hugging Face Spaces
4
+ """
5
+
6
+ import os
7
+ import tempfile
8
+ import logging
9
+ from pathlib import Path
10
+ from typing import Optional
11
+
12
+ from fastapi import FastAPI, HTTPException, UploadFile, File, Form
13
+ from fastapi.responses import FileResponse
14
+ from pydantic import BaseModel
15
+ import uvicorn
16
+
17
+ # Import TTS
18
+ try:
19
+ from TTS.api import TTS
20
+ except ImportError:
21
+ raise ImportError("TTS library not found. Please install coqui-tts: pip install coqui-tts")
22
+
23
+ # Configure logging
24
+ logging.basicConfig(level=logging.INFO)
25
+ logger = logging.getLogger(__name__)
26
+
27
+ # Initialize FastAPI app
28
+ app = FastAPI(
29
+ title="Text-to-Speech API",
30
+ description="Production-ready TTS API using Coqui TTS",
31
+ version="1.0.0"
32
+ )
33
+
34
+ # Global TTS model variable
35
+ tts_model = None
36
+
37
+ # Request models
38
+ class TTSRequest(BaseModel):
39
+ text: str
40
+ language: Optional[str] = "en"
41
+
42
+
43
+ @app.on_event("startup")
44
+ async def startup_event():
45
+ """
46
+ Load the TTS model once at startup to avoid loading it on every request.
47
+ Using the highest-quality open-source multilingual model.
48
+ """
49
+ global tts_model
50
+ try:
51
+ logger.info("Loading TTS model...")
52
+ # Using the high-quality multilingual model that works on CPU
53
+ model_name = "tts_models/multilingual/multi-dataset/xtts_v2"
54
+ tts_model = TTS(model_name=model_name, progress_bar=False)
55
+
56
+ # Ensure we're using CPU (important for Hugging Face Spaces)
57
+ if hasattr(tts_model, 'to'):
58
+ tts_model.to("cpu")
59
+
60
+ logger.info("TTS model loaded successfully!")
61
+ except Exception as e:
62
+ logger.error(f"Failed to load TTS model: {str(e)}")
63
+ raise e
64
+
65
+
66
+ @app.get("/")
67
+ async def root():
68
+ """Health check endpoint"""
69
+ return {
70
+ "status": "healthy",
71
+ "message": "Text-to-Speech API is running",
72
+ "model": "tts_models/multilingual/multi-dataset/xtts_v2"
73
+ }
74
+
75
+
76
+ @app.get("/tts")
77
+ async def tts_get(text: str, language: str = "en"):
78
+ """
79
+ Simple GET endpoint for TTS
80
+ Usage: GET /tts?text=Hello%20world&language=en
81
+ """
82
+ if not text or len(text.strip()) == 0:
83
+ raise HTTPException(status_code=400, detail="Text parameter is required")
84
+
85
+ return await generate_speech(text, language)
86
+
87
+
88
+ @app.post("/tts")
89
+ async def tts_post(
90
+ request: TTSRequest = None,
91
+ text: str = Form(None),
92
+ language: str = Form("en"),
93
+ speaker_wav: UploadFile = File(None)
94
+ ):
95
+ """
96
+ POST endpoint for TTS with optional voice cloning
97
+ Accepts JSON body or form data with optional speaker WAV file
98
+ """
99
+ # Handle different input formats
100
+ if request:
101
+ input_text = request.text
102
+ input_language = request.language
103
+ elif text:
104
+ input_text = text
105
+ input_language = language
106
+ else:
107
+ raise HTTPException(status_code=400, detail="Text is required")
108
+
109
+ if not input_text or len(input_text.strip()) == 0:
110
+ raise HTTPException(status_code=400, detail="Text cannot be empty")
111
+
112
+ # Handle speaker WAV file if provided
113
+ speaker_wav_path = None
114
+ if speaker_wav:
115
+ try:
116
+ # Save uploaded speaker file temporarily
117
+ speaker_suffix = Path(speaker_wav.filename).suffix if speaker_wav.filename else ".wav"
118
+ with tempfile.NamedTemporaryFile(delete=False, suffix=speaker_suffix) as tmp_speaker:
119
+ content = await speaker_wav.read()
120
+ tmp_speaker.write(content)
121
+ speaker_wav_path = tmp_speaker.name
122
+ except Exception as e:
123
+ logger.error(f"Error processing speaker WAV file: {str(e)}")
124
+ raise HTTPException(status_code=400, detail="Invalid speaker WAV file")
125
+
126
+ try:
127
+ return await generate_speech(input_text, input_language, speaker_wav_path)
128
+ finally:
129
+ # Clean up speaker file
130
+ if speaker_wav_path and os.path.exists(speaker_wav_path):
131
+ try:
132
+ os.unlink(speaker_wav_path)
133
+ except:
134
+ pass
135
+
136
+
137
+ async def generate_speech(text: str, language: str = "en", speaker_wav_path: str = None):
138
+ """
139
+ Generate speech from text using the loaded TTS model
140
+ """
141
+ if not tts_model:
142
+ raise HTTPException(status_code=503, detail="TTS model not loaded")
143
+
144
+ try:
145
+ # Create temporary file for output
146
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
147
+ output_path = tmp_file.name
148
+
149
+ logger.info(f"Generating speech for text: '{text[:50]}...' in language: {language}")
150
+
151
+ # Generate speech
152
+ if speaker_wav_path and os.path.exists(speaker_wav_path):
153
+ # Voice cloning with speaker reference
154
+ logger.info("Using voice cloning with speaker reference")
155
+ tts_model.tts_to_file(
156
+ text=text,
157
+ file_path=output_path,
158
+ speaker_wav=speaker_wav_path,
159
+ language=language
160
+ )
161
+ else:
162
+ # Standard TTS without voice cloning
163
+ tts_model.tts_to_file(
164
+ text=text,
165
+ file_path=output_path,
166
+ language=language
167
+ )
168
+
169
+ # Verify the file was created and has content
170
+ if not os.path.exists(output_path) or os.path.getsize(output_path) == 0:
171
+ raise Exception("Generated audio file is empty or was not created")
172
+
173
+ logger.info(f"Speech generated successfully, file size: {os.path.getsize(output_path)} bytes")
174
+
175
+ # Return the audio file
176
+ return FileResponse(
177
+ path=output_path,
178
+ media_type="audio/wav",
179
+ filename="generated_speech.wav",
180
+ headers={
181
+ "Content-Disposition": "attachment; filename=generated_speech.wav",
182
+ "Cache-Control": "no-cache"
183
+ }
184
+ )
185
+
186
+ except Exception as e:
187
+ logger.error(f"Error generating speech: {str(e)}")
188
+ # Clean up output file on error
189
+ if 'output_path' in locals() and os.path.exists(output_path):
190
+ try:
191
+ os.unlink(output_path)
192
+ except:
193
+ pass
194
+ raise HTTPException(status_code=500, detail=f"Failed to generate speech: {str(e)}")
195
+
196
+
197
+ @app.get("/health")
198
+ async def health_check():
199
+ """Detailed health check endpoint"""
200
+ return {
201
+ "status": "healthy",
202
+ "model_loaded": tts_model is not None,
203
+ "model_name": "tts_models/multilingual/multi-dataset/xtts_v2"
204
+ }
205
+
206
+
207
+ if __name__ == "__main__":
208
+ # For local development
209
+ uvicorn.run(app, host="0.0.0.0", port=7860)
requirements.txt ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core web framework
2
+ fastapi==0.104.1
3
+ uvicorn[standard]==0.24.0
4
+
5
+ # Text-to-Speech engine
6
+ coqui-tts==0.21.1
7
+
8
+ # File handling and HTTP
9
+ python-multipart==0.0.6
10
+ python-dateutil==2.8.2
11
+
12
+ # Audio processing dependencies (required by Coqui TTS)
13
+ numpy==1.24.3
14
+ scipy==1.11.4
15
+ librosa==0.10.1
16
+ soundfile==0.12.1
17
+
18
+ # Machine learning dependencies
19
+ torch==2.0.1
20
+ torchaudio==2.0.2
21
+
22
+ # Hugging Face integration
23
+ transformers==4.35.2
24
+
25
+ # Utilities
26
+ pydantic==2.5.0
27
+ typing-extensions==4.8.0
test_api.py ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Simple test script for the Text-to-Speech API
4
+ Run this to test the API locally
5
+ """
6
+
7
+ import requests
8
+ import time
9
+ import os
10
+
11
+ # Configuration
12
+ API_BASE_URL = "http://localhost:7860"
13
+ TEST_TEXTS = [
14
+ "Hello world, this is a test of the text to speech API.",
15
+ "The quick brown fox jumps over the lazy dog.",
16
+ "Welcome to our production-ready TTS service!"
17
+ ]
18
+
19
+ def test_health_check():
20
+ """Test the health check endpoints"""
21
+ print("πŸ” Testing health check endpoints...")
22
+
23
+ try:
24
+ # Test root endpoint
25
+ response = requests.get(f"{API_BASE_URL}/")
26
+ print(f"GET / - Status: {response.status_code}")
27
+ print(f"Response: {response.json()}")
28
+
29
+ # Test health endpoint
30
+ response = requests.get(f"{API_BASE_URL}/health")
31
+ print(f"GET /health - Status: {response.status_code}")
32
+ print(f"Response: {response.json()}")
33
+
34
+ except requests.exceptions.ConnectionError:
35
+ print("❌ Could not connect to the API. Make sure it's running on localhost:7860")
36
+ return False
37
+
38
+ return True
39
+
40
+ def test_get_endpoint():
41
+ """Test the GET TTS endpoint"""
42
+ print("\n🎀 Testing GET TTS endpoint...")
43
+
44
+ for i, text in enumerate(TEST_TEXTS):
45
+ try:
46
+ params = {
47
+ "text": text,
48
+ "language": "en"
49
+ }
50
+
51
+ print(f"Testing text {i+1}: '{text[:30]}...'")
52
+ response = requests.get(f"{API_BASE_URL}/tts", params=params)
53
+
54
+ if response.status_code == 200:
55
+ # Save the audio file
56
+ filename = f"test_output_get_{i+1}.wav"
57
+ with open(filename, "wb") as f:
58
+ f.write(response.content)
59
+ print(f"βœ… Audio saved as {filename} ({len(response.content)} bytes)")
60
+ else:
61
+ print(f"❌ Error: {response.status_code} - {response.text}")
62
+
63
+ except Exception as e:
64
+ print(f"❌ Exception: {str(e)}")
65
+
66
+ def test_post_endpoint():
67
+ """Test the POST TTS endpoint"""
68
+ print("\n🎡 Testing POST TTS endpoint...")
69
+
70
+ for i, text in enumerate(TEST_TEXTS):
71
+ try:
72
+ data = {
73
+ "text": text,
74
+ "language": "en"
75
+ }
76
+
77
+ print(f"Testing text {i+1}: '{text[:30]}...'")
78
+ response = requests.post(f"{API_BASE_URL}/tts", json=data)
79
+
80
+ if response.status_code == 200:
81
+ # Save the audio file
82
+ filename = f"test_output_post_{i+1}.wav"
83
+ with open(filename, "wb") as f:
84
+ f.write(response.content)
85
+ print(f"βœ… Audio saved as {filename} ({len(response.content)} bytes)")
86
+ else:
87
+ print(f"❌ Error: {response.status_code} - {response.text}")
88
+
89
+ except Exception as e:
90
+ print(f"❌ Exception: {str(e)}")
91
+
92
+ def test_form_endpoint():
93
+ """Test the POST TTS endpoint with form data"""
94
+ print("\nπŸ“‹ Testing POST TTS endpoint with form data...")
95
+
96
+ try:
97
+ data = {
98
+ "text": "This is a test using form data submission.",
99
+ "language": "en"
100
+ }
101
+
102
+ response = requests.post(f"{API_BASE_URL}/tts", data=data)
103
+
104
+ if response.status_code == 200:
105
+ filename = "test_output_form.wav"
106
+ with open(filename, "wb") as f:
107
+ f.write(response.content)
108
+ print(f"βœ… Audio saved as {filename} ({len(response.content)} bytes)")
109
+ else:
110
+ print(f"❌ Error: {response.status_code} - {response.text}")
111
+
112
+ except Exception as e:
113
+ print(f"❌ Exception: {str(e)}")
114
+
115
+ def cleanup_test_files():
116
+ """Clean up generated test files"""
117
+ print("\n🧹 Cleaning up test files...")
118
+
119
+ test_files = [f for f in os.listdir(".") if f.startswith("test_output_") and f.endswith(".wav")]
120
+
121
+ for file in test_files:
122
+ try:
123
+ os.remove(file)
124
+ print(f"Removed {file}")
125
+ except Exception as e:
126
+ print(f"Could not remove {file}: {str(e)}")
127
+
128
+ if __name__ == "__main__":
129
+ print("πŸš€ Starting TTS API Test Suite")
130
+ print("=" * 50)
131
+
132
+ # Test health check first
133
+ if not test_health_check():
134
+ print("\n❌ Health check failed. Exiting.")
135
+ exit(1)
136
+
137
+ # Wait a moment for the model to be ready
138
+ print("\n⏳ Waiting for model to be ready...")
139
+ time.sleep(2)
140
+
141
+ # Run tests
142
+ test_get_endpoint()
143
+ test_post_endpoint()
144
+ test_form_endpoint()
145
+
146
+ print("\n" + "=" * 50)
147
+ print("βœ… Test suite completed!")
148
+ print("\nTo clean up test files, run:")
149
+ print("python test_api.py --cleanup")
150
+
151
+ # Check if cleanup flag is provided
152
+ import sys
153
+ if "--cleanup" in sys.argv:
154
+ cleanup_test_files()