Shantika commited on
Commit
598efec
·
1 Parent(s): d6c8310

Upload full project to Space

Browse files
IMPLEMENTATION_NOTES.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Notes
2
+
3
+ ## Architecture Overview
4
+
5
+ The STT system is built in 5 progressive steps, each adding functionality on top of the previous:
6
+
7
+ 1. **Step 1**: Basic offline transcription (Whisper/Vosk)
8
+ 2. **Step 2**: HTTP API for file uploads
9
+ 3. **Step 3**: WebSocket streaming for real-time audio
10
+ 4. **Step 4**: Telephony audio format support (Twilio/Exotel)
11
+ 5. **Step 5**: Production-ready with stability features
12
+
13
+ ## Key Components
14
+
15
+ ### Audio Processing
16
+
17
+ - **TelephonyAudioConverter**: Handles format conversion
18
+ - Twilio: 8kHz μ-law → 16kHz PCM
19
+ - Exotel: 8kHz PCM → 16kHz PCM
20
+ - Uses scipy.signal.resample for sample rate conversion
21
+
22
+ ### Voice Activity Detection (VAD)
23
+
24
+ - Simple energy-based VAD in Step 5
25
+ - Threshold: 0.01 (configurable)
26
+ - Frame-based analysis (25ms frames)
27
+ - Detects speech vs silence
28
+
29
+ ### Audio Buffering
30
+
31
+ - **AudioBuffer**: Accumulates audio chunks
32
+ - Configurable chunk duration (default: 1.0s)
33
+ - Minimum interval between transcriptions (0.5s)
34
+ - Handles silence timeouts (3.0s)
35
+
36
+ ### Duplicate Prevention
37
+
38
+ - Compares new transcriptions with previous
39
+ - Prevents sending identical text multiple times
40
+ - Simple substring matching (can be enhanced)
41
+
42
+ ## Things to Consider
43
+
44
+ ### Performance
45
+
46
+ 1. **Model Loading**: Whisper models are loaded per connection (lazy loading)
47
+ - Consider model caching/pooling for production
48
+ - Larger models (medium/large) are more accurate but slower
49
+
50
+ 2. **Chunk Size**: Balance between latency and accuracy
51
+ - Smaller chunks = lower latency but less context
52
+ - Larger chunks = better accuracy but higher latency
53
+
54
+ 3. **Concurrent Connections**: Each connection loads its own model
55
+ - Consider shared model instances for multiple connections
56
+ - Monitor memory usage with many concurrent calls
57
+
58
+ ### Audio Quality
59
+
60
+ 1. **Sample Rate**: Whisper works best with 16kHz
61
+ - Telephony audio (8kHz) must be upsampled
62
+ - Quality may be reduced compared to native 16kHz
63
+
64
+ 2. **Noise**: Telephony audio often has background noise
65
+ - Consider noise reduction preprocessing
66
+ - VAD helps filter silence but not noise
67
+
68
+ 3. **Format Conversion**: μ-law to PCM conversion may introduce artifacts
69
+ - Test with real telephony audio
70
+ - Consider alternative conversion methods if quality is poor
71
+
72
+ ### Stability & Reliability
73
+
74
+ 1. **Disconnections**: Handled gracefully in Step 5
75
+ - Final transcription on remaining buffer
76
+ - Session cleanup on disconnect
77
+
78
+ 2. **Error Handling**: Comprehensive error catching
79
+ - Logs errors per call
80
+ - Continues processing on individual failures
81
+
82
+ 3. **Logging**: Per-call logging in Step 5
83
+ - Logs stored in `logs/stt.log`
84
+ - Includes call_id for tracking
85
+
86
+ ### Scaling Considerations
87
+
88
+ 1. **Model Memory**: Whisper models are large (base ~150MB, large ~3GB)
89
+ - Consider GPU acceleration for faster inference
90
+ - Model quantization for reduced memory
91
+
92
+ 2. **API Rate Limiting**: No rate limiting implemented
93
+ - Add rate limiting for production
94
+ - Consider request queuing
95
+
96
+ 3. **Database**: No persistent storage
97
+ - Add database for call transcripts
98
+ - Store session metadata
99
+
100
+ 4. **Load Balancing**: Single server implementation
101
+ - Consider multiple workers/instances
102
+ - Use message queue for audio processing
103
+
104
+ ### Security
105
+
106
+ 1. **Authentication**: No authentication implemented
107
+ - Add API keys/tokens
108
+ - WebSocket authentication
109
+
110
+ 2. **Input Validation**: Basic validation
111
+ - Validate audio format/size
112
+ - Rate limit per client
113
+
114
+ 3. **Data Privacy**: Transcripts logged
115
+ - Consider encryption for sensitive data
116
+ - Implement data retention policies
117
+
118
+ ## Testing Recommendations
119
+
120
+ 1. **Unit Tests**: Test audio conversion functions
121
+ 2. **Integration Tests**: Test WebSocket streaming with real audio
122
+ 3. **Load Tests**: Test with multiple concurrent connections
123
+ 4. **Telephony Tests**: Test with actual Twilio/Exotel audio streams
124
+
125
+ ## Future Enhancements
126
+
127
+ 1. **Better VAD**: Use more sophisticated VAD (e.g., WebRTC VAD)
128
+ 2. **Streaming Model**: Use streaming-capable models for lower latency
129
+ 3. **Language Detection**: Auto-detect language
130
+ 4. **Speaker Diarization**: Identify different speakers
131
+ 5. **Punctuation**: Better punctuation in transcripts
132
+ 6. **Timestamping**: Word-level timestamps
133
+ 7. **Confidence Scores**: Return confidence scores per word
134
+
135
+
README.md CHANGED
@@ -1,13 +1,277 @@
1
- ---
2
- title: NeuralvoiceGPU
3
- emoji: 🔥
4
- colorFrom: gray
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: 6.3.0
8
- python_version: '3.12'
9
- app_file: app.py
10
- pinned: false
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NeuralVoice AI
2
+
3
+ A real-time voice AI assistant system that combines Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS) capabilities for natural phone conversations. Built with FastAPI, Vosk, OpenAI, and Piper TTS, integrated with Twilio for telephony.
4
+
5
+ ## 🎯 Overview
6
+
7
+ NeuralVoice AI enables real-time bidirectional voice conversations over phone calls. The system:
8
+ - **Listens** to caller speech using Vosk STT
9
+ - **Understands** and responds using OpenAI's GPT models
10
+ - **Speaks** back using Piper TTS with phone-optimized audio processing
11
+ - **Handles** natural conversation flow with barge-in support and voice activity detection
12
+
13
+ ## Features
14
+
15
+ - **Real-time Speech Recognition**: Vosk-based STT with voice activity detection (VAD)
16
+ - **Intelligent Responses**: OpenAI GPT integration for contextual conversations
17
+ - **Natural Voice Synthesis**: Piper TTS with phone-optimized audio filters
18
+ - **Barge-in Support**: Callers can interrupt the AI mid-sentence
19
+ - **WebSocket Streaming**: Low-latency bidirectional audio streaming via Twilio Media Streams
20
+ - **Web Dashboard**: React-based frontend for monitoring live call transcripts
21
+ - **Production Ready**: Includes error handling, keepalive, and session management
22
+
23
+ ## 🏗️ Architecture
24
+
25
+ ```
26
+ ┌─────────────┐
27
+ │ Twilio │ ← Phone calls
28
+ └──────┬──────┘
29
+ │ WebSocket (8kHz μ-law)
30
+
31
+ ┌─────────────────────────────────┐
32
+ │ FastAPI Backend (Python) │
33
+ │ ┌──────────────────────────┐ │
34
+ │ │ STT (Vosk) │ │
35
+ │ │ ↓ │ │
36
+ │ │ LLM (OpenAI GPT) │ │
37
+ │ │ ↓ │ │
38
+ │ │ TTS (Piper + ffmpeg) │ │
39
+ │ └──────────────────────────┘ │
40
+ └──────┬──────────────────────────┘
41
+
42
+ ├─→ WebSocket → React Frontend (Live Transcripts)
43
+ └─→ WebSocket → Twilio (Audio Playback)
44
+ ```
45
+
46
+ ## 📁 Project Structure
47
+
48
+ ```
49
+ nv2/
50
+ ├── stt_llm_ttsopenai.py # Main production server (STT+LLM+TTS pipeline)
51
+ ├── pipe_method3.py # Alternative implementation with improved VAD
52
+ ├── download_models.py # Script to download Vosk and Piper models
53
+ ├── requirements.txt # Python dependencies
54
+ ├── Dockerfile # Docker configuration
55
+ ├── start.sh # Startup script
56
+ ├── IMPLEMENTATION_NOTES.md # Technical implementation details
57
+ ├── README.md # This file
58
+ └── web_demo/ # React frontend
59
+ ├── src/
60
+ │ ├── App.jsx # Main React app
61
+ │ ├── components/
62
+ │ │ ├── MicrophoneTest.jsx # STT testing component
63
+ │ │ ├── TextToSpeech.jsx # TTS testing component
64
+ │ │ └── SttLlmTts.jsx # Full pipeline testing
65
+ │ └── ...
66
+ ├── package.json
67
+ └── vite.config.js
68
+ ```
69
+
70
+ ## 🚀 Quick Start
71
+
72
+ ### Prerequisites
73
+
74
+ - Python 3.8+
75
+ - Node.js 16+ (for frontend)
76
+ - ffmpeg (for audio processing)
77
+ - Piper TTS binary (or install via package manager)
78
+ - OpenAI API key
79
+ - Twilio account (for phone integration)
80
+
81
+ ### Installation
82
+
83
+ 1. **Clone the repository**
84
+ ```bash
85
+ git clone https://github.com/NuralVoice-AI-Model/NeuralVoiceAI.git
86
+ cd NeuralVoiceAI
87
+ ```
88
+
89
+ 2. **Install Python dependencies**
90
+ ```bash
91
+ pip install -r requirements.txt
92
+ ```
93
+
94
+ 3. **Download AI models**
95
+ ```bash
96
+ python download_models.py
97
+ ```
98
+ This will download:
99
+ - Vosk STT model (English, ~1.8GB)
100
+ - Piper TTS model (English, ~50MB)
101
+
102
+ 4. **Install frontend dependencies**
103
+ ```bash
104
+ cd web_demo
105
+ npm install
106
+ cd ..
107
+ ```
108
+
109
+ 5. **Set environment variables**
110
+ ```bash
111
+ export OPENAI_API_KEY="your-openai-api-key"
112
+ export OPENAI_MODEL="gpt-4o-mini" # or gpt-4, gpt-3.5-turbo
113
+ export PIPER_BIN="piper" # or full path to piper binary
114
+ export PIPER_MODEL_PATH="models/piper/en_US-lessac-medium.onnx"
115
+ export VOSK_MODEL_PATH="models/vosk-model-en-us-0.22-lgraph"
116
+ export TWILIO_STREAM_URL="wss://your-domain.com/stream" # For Twilio integration
117
+ export PORT=8080
118
+ ```
119
+
120
+ ### Running the Application
121
+
122
+ 1. **Start the backend server**
123
+ ```bash
124
+ python stt_llm_ttsopenai.py
125
+ # or
126
+ python pipe_method3.py
127
+ ```
128
+ Server will start on `http://0.0.0.0:8080`
129
+
130
+ 2. **Start the frontend (optional)**
131
+ ```bash
132
+ cd web_demo
133
+ npm run dev
134
+ ```
135
+ Frontend will be available at `http://localhost:5173`
136
+
137
+ 3. **Configure Twilio (for phone calls)**
138
+ - Set your Twilio voice webhook to: `https://your-domain.com/voice`
139
+ - Ensure `TWILIO_STREAM_URL` points to your WebSocket endpoint
140
+ - Use ngrok or similar for local development:
141
+ ```bash
142
+ ngrok http 8080
143
+ # Set TWILIO_STREAM_URL to wss://your-ngrok-url.ngrok.io/stream
144
+ ```
145
+
146
+ ## 🔧 Configuration
147
+
148
+ ### Environment Variables
149
+
150
+ | Variable | Description | Default |
151
+ |----------|-------------|---------|
152
+ | `OPENAI_API_KEY` | OpenAI API key (required) | - |
153
+ | `OPENAI_MODEL` | OpenAI model to use | `gpt-4o-mini` |
154
+ | `VOSK_MODEL_PATH` | Path to Vosk STT model | `models/vosk-model-en-us-0.22-lgraph` |
155
+ | `PIPER_BIN` | Path to Piper TTS binary | `piper` |
156
+ | `PIPER_MODEL_PATH` | Path to Piper TTS model | - |
157
+ | `TWILIO_STREAM_URL` | WebSocket URL for Twilio streams | - |
158
+ | `HOST` | Server host | `0.0.0.0` |
159
+ | `PORT` | Server port | `8080` |
160
+
161
+ ### Tuning Parameters
162
+
163
+ In `stt_llm_ttsopenai.py` or `pipe_method3.py`, you can adjust:
164
+
165
+ - **STT Latency**: `SILENCE_MS`, `STABLE_PARTIAL_MS`
166
+ - **VAD Sensitivity**: `RMS_SPEECH_THRESHOLD`, `SPEECH_START_FRAMES`
167
+ - **LLM Response**: `SYSTEM_PROMPT`, `max_tokens`, `temperature`
168
+ - **TTS Chunking**: `CHUNK_MAX_CHARS`, `CHUNK_END_RE`
169
+
170
+ ## 📡 API Endpoints
171
+
172
+ ### HTTP Endpoints
173
+
174
+ - `GET /health` - Health check
175
+ - `POST /voice` - Twilio webhook (returns TwiML)
176
+ - `GET /voice` - Twilio webhook (GET method)
177
+
178
+ ### WebSocket Endpoints
179
+
180
+ - `WS /stream` - Main audio streaming endpoint for Twilio
181
+ - `WS /client-ws` - Frontend client WebSocket for live transcripts
182
+
183
+ ## 🎤 Usage
184
+
185
+ ### Making a Phone Call
186
+
187
+ 1. Configure Twilio to call your `/voice` endpoint
188
+ 2. The system will:
189
+ - Answer the call
190
+ - Stream audio bidirectionally
191
+ - Transcribe speech in real-time
192
+ - Generate AI responses
193
+ - Speak responses back to caller
194
+
195
+ ### Testing Components
196
+
197
+ The web dashboard (`web_demo`) provides three testing interfaces:
198
+
199
+ 1. **Microphone Test**: Test STT with your microphone
200
+ 2. **Text-to-Speech**: Test TTS with custom text
201
+ 3. **STT-LLM-TTS**: Test the full pipeline
202
+
203
+ ### Example Conversation Flow
204
+
205
+ ```
206
+ Caller: "Hello, I need help with my account"
207
+ ↓ [STT: Vosk transcribes]
208
+ ↓ [LLM: OpenAI generates response]
209
+ ↓ [TTS: Piper synthesizes audio]
210
+ AI: "I'd be happy to help. What's your account number?"
211
+ ↓ [Caller can interrupt/barge-in at any time]
212
+ Caller: "It's 12345"
213
+ ↓ [Process repeats...]
214
+ ```
215
+
216
+ ## 🐳 Docker Deployment
217
+
218
+ ```bash
219
+ docker build -t neuralvoice-ai .
220
+ docker run -p 8080:8080 \
221
+ -e OPENAI_API_KEY="your-key" \
222
+ -e PIPER_MODEL_PATH="/app/models/piper/en_US-lessac-medium.onnx" \
223
+ -v $(pwd)/models:/app/models \
224
+ neuralvoice-ai
225
+ ```
226
+
227
+ ## 🔍 Monitoring
228
+
229
+ - **Logs**: Check console output for real-time STT, LLM, and TTS logs
230
+ - **Web Dashboard**: View live call transcripts in the React frontend
231
+ - **Health Endpoint**: `GET /health` for service status
232
+
233
+ ## 🛠️ Development
234
+
235
+ ### Key Components
236
+
237
+ 1. **STT Engine** (`vosk`): Offline speech recognition
238
+ 2. **LLM Integration** (`openai`): GPT-based conversation
239
+ 3. **TTS Engine** (`piper`): Neural text-to-speech
240
+ 4. **Audio Processing** (`audioop`, `ffmpeg`): Format conversion and filtering
241
+ 5. **WebSocket Handler**: Real-time bidirectional streaming
242
+
243
+ ### Code Flow
244
+
245
+ 1. Twilio sends 8kHz μ-law audio chunks via WebSocket
246
+ 2. Audio is converted to 16kHz PCM for Vosk
247
+ 3. Vosk performs real-time transcription
248
+ 4. VAD detects speech endpoints
249
+ 5. User utterances trigger OpenAI API calls
250
+ 6. LLM responses are chunked and sent to Piper TTS
251
+ 7. TTS audio is processed with phone-optimized filters
252
+ 8. Audio is converted back to 8kHz μ-law and streamed to Twilio
253
+
254
+ ## 📝 Notes
255
+
256
+ - **Latency**: Typical end-to-end latency is 1-3 seconds
257
+ - **Barge-in**: Callers can interrupt the AI by speaking (detected via VAD)
258
+ - **Audio Quality**: Phone-optimized filters (highpass/lowpass/compand) improve clarity
259
+ - **Model Size**: Vosk model is ~1.8GB, ensure sufficient disk space
260
+ - **Memory**: Each call loads the Vosk model (cached after first load)
261
+
262
+ ## 🤝 Contributing
263
+
264
+ Contributions are welcome! Please feel free to submit a Pull Request.
265
+
266
+ ## 📄 License
267
+
268
+ Copyright © 2026 Blink Digital India Pvt Ltd. All rights reserved.
269
+
270
+ All code in this repository is the property of Blink Digital India Pvt Ltd. Unauthorized copying, modification, distribution, or use of this software, via any medium, is strictly prohibited without express written permission from Blink Digital India Pvt Ltd.
271
+
272
+ ## 🙏 Acknowledgments
273
+
274
+ - [Vosk](https://alphacephei.com/vosk/) - Speech recognition
275
+ - [OpenAI](https://openai.com/) - Language models
276
+ - [Piper TTS](https://github.com/rhasspy/piper) - Text-to-speech
277
+ - [Twilio](https://www.twilio.com/) - Telephony platform
download_models.py ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import urllib.request
3
+ import zipfile
4
+ import tarfile
5
+
6
+ def download_file(url, dest):
7
+ if os.path.exists(dest):
8
+ print(f"File already exists: {dest}")
9
+ return
10
+ print(f"Downloading {url} to {dest}...")
11
+ urllib.request.urlretrieve(url, dest)
12
+ print("Download complete.")
13
+
14
+ def setup_models():
15
+ # Vosk Model
16
+ vosk_dir = "models/vosk-model-en-us-0.22-lgraph"
17
+ if not os.path.exists(vosk_dir):
18
+ os.makedirs("models", exist_ok=True)
19
+ zip_path = "models/vosk-model.zip"
20
+ download_file("https://alphacephei.com/vosk/models/vosk-model-en-us-0.22-lgraph.zip", zip_path)
21
+ print("Extracting Vosk model...")
22
+ with zipfile.ZipFile(zip_path, 'r') as zip_ref:
23
+ zip_ref.extractall("models")
24
+ os.remove(zip_path)
25
+ print("Vosk model setup complete.")
26
+ else:
27
+ print("Vosk model already exists.")
28
+
29
+ # Piper Model
30
+ piper_model_dir = "models/piper"
31
+ os.makedirs(piper_model_dir, exist_ok=True)
32
+
33
+ piper_onnx = os.path.join(piper_model_dir, "en_US-lessac-medium.onnx")
34
+ piper_json = os.path.join(piper_model_dir, "en_US-lessac-medium.onnx.json")
35
+
36
+ download_file("https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx", piper_onnx)
37
+ download_file("https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx.json", piper_json)
38
+
39
+ if __name__ == "__main__":
40
+ setup_models()
pipe_method3.py ADDED
@@ -0,0 +1,572 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Twilio Media Streams (bidirectional) + Vosk + OpenAI Answer + Piper -> Twilio playback
3
+
4
+ What this version does:
5
+ - NO intent / NO clarify JSON
6
+ - Logs only:
7
+ STT_FINAL> ...
8
+ LLM_ANS> ...
9
+ TTS> ...
10
+ - Generation-id safe TTS (no self-cancel on Railway)
11
+ - Better phone clarity using ffmpeg filters (highpass/lowpass/compand)
12
+ - Proper 20ms pacing + keepalive marks to prevent WS idle timeouts
13
+ """
14
+
15
+ import asyncio
16
+ import base64
17
+ import json
18
+ import logging
19
+ import os
20
+ import re
21
+ import tempfile
22
+ import time
23
+ import audioop
24
+ import subprocess
25
+ import threading
26
+ from dataclasses import dataclass, field
27
+ from typing import Optional, List, Dict
28
+
29
+ from fastapi import FastAPI, WebSocket, WebSocketDisconnect, Request
30
+ from fastapi.responses import PlainTextResponse, Response
31
+ from fastapi.middleware.cors import CORSMiddleware
32
+ from vosk import Model, KaldiRecognizer
33
+ from openai import OpenAI
34
+
35
+ # ----------------------------
36
+ # Logging
37
+ # ----------------------------
38
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
39
+ log = logging.getLogger("app")
40
+
41
+ def P(tag: str, msg: str):
42
+ print(f"{tag} {msg}", flush=True)
43
+
44
+ # ----------------------------
45
+ # Env
46
+ # ----------------------------
47
+ VOSK_MODEL_PATH = os.getenv("VOSK_MODEL_PATH", "/app/models/vosk-model-en-us-0.22-lgraph").strip()
48
+ TWILIO_STREAM_URL = os.getenv("TWILIO_STREAM_URL", "").strip()
49
+
50
+ OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "").strip()
51
+ OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini").strip()
52
+
53
+ PIPER_BIN = os.getenv("PIPER_BIN", "piper").strip()
54
+ PIPER_MODEL_PATH = os.getenv("PIPER_MODEL_PATH", "").strip()
55
+
56
+ HOST = "0.0.0.0"
57
+ PORT = int(os.getenv("PORT", "8080"))
58
+
59
+ # ----------------------------
60
+ # FastAPI
61
+ # ----------------------------
62
+ app = FastAPI()
63
+ app.add_middleware(
64
+ CORSMiddleware,
65
+ allow_origins=["*"],
66
+ allow_credentials=True,
67
+ allow_methods=["*"],
68
+ allow_headers=["*"],
69
+ )
70
+
71
+ # ----------------------------
72
+ # Audio / Twilio
73
+ # ----------------------------
74
+ FRAME_MS = 20
75
+ INPUT_RATE = 8000
76
+ STT_RATE = 16000
77
+ BYTES_PER_20MS_MULAW = int(INPUT_RATE * (FRAME_MS / 1000.0)) # 160 bytes @ 8kHz, 20ms
78
+
79
+ # ----------------------------
80
+ # VAD settings
81
+ # ----------------------------
82
+ RMS_SPEECH_THRESHOLD = 450
83
+ SPEECH_START_FRAMES = 3
84
+ SPEECH_END_SILENCE_FRAMES = 40 # 800ms
85
+ MAX_UTTERANCE_MS = 12000
86
+ PARTIAL_EMIT_EVERY_MS = 250
87
+
88
+ # ----------------------------
89
+ # LLM prompt
90
+ # ----------------------------
91
+ SYSTEM_PROMPT = (
92
+ "You are a phone-call assistant. "
93
+ "Reply in 1 short sentence (max 15 words). "
94
+ "No filler. No greetings unless user greets first."
95
+ )
96
+
97
+ # ----------------------------
98
+ # Cached Vosk model
99
+ # ----------------------------
100
+ _VOSK_MODEL = None
101
+
102
+ def now_ms() -> int:
103
+ return int(time.time() * 1000)
104
+
105
+ def build_twiml(stream_url: str) -> str:
106
+ return f"""<?xml version="1.0" encoding="UTF-8"?>
107
+ <Response>
108
+ <Connect>
109
+ <Stream url="{stream_url}" />
110
+ </Connect>
111
+ <Pause length="600"/>
112
+ </Response>
113
+ """
114
+
115
+ def split_mulaw_frames(mulaw_bytes: bytes) -> List[bytes]:
116
+ frames = []
117
+ for i in range(0, len(mulaw_bytes), BYTES_PER_20MS_MULAW):
118
+ chunk = mulaw_bytes[i:i + BYTES_PER_20MS_MULAW]
119
+ if len(chunk) < BYTES_PER_20MS_MULAW:
120
+ chunk += b"\xFF" * (BYTES_PER_20MS_MULAW - len(chunk))
121
+ frames.append(chunk)
122
+ return frames
123
+
124
+ async def drain_queue(q: asyncio.Queue):
125
+ try:
126
+ while True:
127
+ q.get_nowait()
128
+ q.task_done()
129
+ except asyncio.QueueEmpty:
130
+ return
131
+
132
+ # ----------------------------
133
+ # OpenAI
134
+ # ----------------------------
135
+ def openai_client() -> OpenAI:
136
+ if not OPENAI_API_KEY:
137
+ raise RuntimeError("OPENAI_API_KEY not set")
138
+ return OpenAI(api_key=OPENAI_API_KEY)
139
+
140
+ def openai_answer_blocking(history: List[Dict], user_text: str) -> str:
141
+ client = openai_client()
142
+ msgs = [{"role": "system", "content": SYSTEM_PROMPT}]
143
+ # short tail context
144
+ tail = history[-6:] if len(history) > 1 else []
145
+ msgs.extend(tail)
146
+ msgs.append({"role": "user", "content": user_text})
147
+
148
+ resp = client.chat.completions.create(
149
+ model=OPENAI_MODEL,
150
+ messages=msgs,
151
+ temperature=0.3,
152
+ max_tokens=80,
153
+ )
154
+ ans = (resp.choices[0].message.content or "").strip()
155
+ return ans
156
+
157
+ # ----------------------------
158
+ # Piper TTS -> 8k mulaw (clarity improved)
159
+ # ----------------------------
160
+ def piper_tts_to_mulaw(text: str) -> bytes:
161
+ if not PIPER_MODEL_PATH:
162
+ raise RuntimeError("PIPER_MODEL_PATH not set")
163
+
164
+ text = (text or "").strip()
165
+ if not text:
166
+ return b""
167
+
168
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as wavf:
169
+ wav_path = wavf.name
170
+ with tempfile.NamedTemporaryFile(suffix=".mulaw", delete=False) as mlf:
171
+ mulaw_path = mlf.name
172
+
173
+ try:
174
+ r1 = subprocess.run(
175
+ [PIPER_BIN, "--model", PIPER_MODEL_PATH, "--output_file", wav_path],
176
+ input=text.encode("utf-8"),
177
+ stdout=subprocess.PIPE,
178
+ stderr=subprocess.PIPE,
179
+ )
180
+ if r1.returncode != 0:
181
+ raise RuntimeError(f"piper rc={r1.returncode} stderr={r1.stderr.decode('utf-8','ignore')[:500]}")
182
+
183
+ # Phone-clarity filter chain:
184
+ # - highpass removes rumble
185
+ # - lowpass removes harshness
186
+ # - compand evens volume (helps “clarity” on phone)
187
+ # - dynaudnorm is avoided (can pump / distort at 8k)
188
+ af = "highpass=f=200,lowpass=f=3400,compand=attacks=0:decays=0.3:points=-80/-80|-20/-10|0/-3"
189
+
190
+ r2 = subprocess.run(
191
+ ["ffmpeg", "-y", "-i", wav_path,
192
+ "-ac", "1", "-ar", "8000",
193
+ "-af", af,
194
+ "-f", "mulaw", mulaw_path],
195
+ stdout=subprocess.PIPE,
196
+ stderr=subprocess.PIPE,
197
+ )
198
+ if r2.returncode != 0:
199
+ raise RuntimeError(f"ffmpeg rc={r2.returncode} stderr={r2.stderr.decode('utf-8','ignore')[:500]}")
200
+
201
+ with open(mulaw_path, "rb") as f:
202
+ data = f.read()
203
+
204
+ P("TTS>", f"audio_bytes={len(data)}")
205
+ return data
206
+ finally:
207
+ for p in (wav_path, mulaw_path):
208
+ try:
209
+ os.unlink(p)
210
+ except Exception:
211
+ pass
212
+
213
+ # ----------------------------
214
+ # Call state
215
+ # ----------------------------
216
+ @dataclass
217
+ class CancelFlag:
218
+ is_set: bool = False
219
+ def set(self):
220
+ self.is_set = True
221
+
222
+ @dataclass
223
+ class CallState:
224
+ call_id: str
225
+ stream_sid: str = ""
226
+
227
+ # vad
228
+ in_speech: bool = False
229
+ speech_start_count: int = 0
230
+ silence_count: int = 0
231
+ utter_start_ms: int = 0
232
+
233
+ rec: Optional[KaldiRecognizer] = None
234
+
235
+ # partials
236
+ last_partial: str = ""
237
+ last_partial_emit_ms: int = 0
238
+
239
+ # outbound
240
+ outbound_q: asyncio.Queue = field(default_factory=lambda: asyncio.Queue(maxsize=50000))
241
+ outbound_task: Optional[asyncio.Task] = None
242
+ keepalive_task: Optional[asyncio.Task] = None
243
+ mark_i: int = 0
244
+
245
+ # speaking / generation
246
+ bot_speaking: bool = False
247
+ cancel_llm: CancelFlag = field(default_factory=CancelFlag)
248
+ tts_generation_id: int = 0
249
+
250
+ # conversation history
251
+ history: List[Dict] = field(default_factory=list)
252
+ bot_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
253
+
254
+ def bump_tts_generation(self) -> int:
255
+ self.tts_generation_id += 1
256
+ return self.tts_generation_id
257
+
258
+ # ----------------------------
259
+ # Keepalive marks (prevents WS ping timeout)
260
+ # ----------------------------
261
+ async def twilio_keepalive(ws: WebSocket, st: CallState):
262
+ try:
263
+ while True:
264
+ await asyncio.sleep(10)
265
+ if st.stream_sid:
266
+ st.mark_i += 1
267
+ name = f"ka_{st.mark_i}"
268
+ await ws.send_text(json.dumps({
269
+ "event": "mark",
270
+ "streamSid": st.stream_sid,
271
+ "mark": {"name": name},
272
+ }))
273
+ P("TWILIO>", f"keepalive_mark={name}")
274
+ except asyncio.CancelledError:
275
+ return
276
+ except Exception as e:
277
+ P("SYS>", f"keepalive_error={e}")
278
+
279
+ # ----------------------------
280
+ # HTTP
281
+ # ----------------------------
282
+ @app.get("/health")
283
+ async def health():
284
+ return {"ok": True}
285
+
286
+ @app.post("/voice")
287
+ async def voice(request: Request):
288
+ stream_url = TWILIO_STREAM_URL
289
+ if not stream_url:
290
+ host = request.headers.get("host")
291
+ if host:
292
+ stream_url = f"wss://{host}/stream"
293
+ P("SYS>", f"auto_stream_url={stream_url}")
294
+ if not stream_url:
295
+ return PlainTextResponse("TWILIO_STREAM_URL not set and host not found", status_code=500)
296
+ return Response(content=build_twiml(stream_url), media_type="application/xml")
297
+
298
+ @app.get("/voice")
299
+ async def voice_get(request: Request):
300
+ return await voice(request)
301
+
302
+ # ----------------------------
303
+ # WebSocket /stream
304
+ # ----------------------------
305
+ @app.websocket("/stream")
306
+ async def stream(ws: WebSocket):
307
+ await ws.accept()
308
+ st = CallState(call_id=str(id(ws)))
309
+ st.history = [{"role": "system", "content": SYSTEM_PROMPT}]
310
+ P("SYS>", f"ws_open call_id={st.call_id}")
311
+
312
+ global _VOSK_MODEL
313
+ if _VOSK_MODEL is None:
314
+ P("SYS>", f"loading_vosk={VOSK_MODEL_PATH}")
315
+ _VOSK_MODEL = Model(VOSK_MODEL_PATH)
316
+ P("SYS>", "vosk_loaded")
317
+
318
+ st.rec = KaldiRecognizer(_VOSK_MODEL, STT_RATE)
319
+ st.rec.SetWords(False)
320
+
321
+ st.outbound_task = asyncio.create_task(outbound_sender(ws, st))
322
+
323
+ try:
324
+ while True:
325
+ raw = await ws.receive_text()
326
+ msg = json.loads(raw)
327
+ event = msg.get("event")
328
+
329
+ if event == "start":
330
+ st.stream_sid = msg["start"]["streamSid"]
331
+ P("TWILIO>", f"start streamSid={st.stream_sid}")
332
+
333
+ if st.keepalive_task is None:
334
+ st.keepalive_task = asyncio.create_task(twilio_keepalive(ws, st))
335
+
336
+ # optional short greeting
337
+ asyncio.create_task(speak_text(ws, st, "Hi! How can I help?"))
338
+
339
+ elif event == "media":
340
+ mulaw = base64.b64decode(msg["media"]["payload"])
341
+ pcm16_8k = audioop.ulaw2lin(mulaw, 2)
342
+ pcm16_16k, _ = audioop.ratecv(pcm16_8k, 2, 1, INPUT_RATE, STT_RATE, None)
343
+
344
+ rms = audioop.rms(pcm16_16k, 2)
345
+ is_speech = rms >= RMS_SPEECH_THRESHOLD
346
+
347
+ # barge-in: cancel current bot audio if caller speaks
348
+ if st.bot_speaking and is_speech:
349
+ await barge_in(ws, st)
350
+
351
+ await vad_and_stt(ws, st, pcm16_16k, is_speech)
352
+
353
+ elif event == "mark":
354
+ name = (msg.get("mark") or {}).get("name")
355
+ P("TWILIO>", f"mark_received={name}")
356
+
357
+ elif event == "stop":
358
+ P("TWILIO>", "stop")
359
+ break
360
+
361
+ except WebSocketDisconnect:
362
+ P("SYS>", "ws_disconnect")
363
+ except Exception as e:
364
+ P("SYS>", f"ws_error={e}")
365
+ log.exception("ws_error")
366
+ finally:
367
+ if st.keepalive_task:
368
+ st.keepalive_task.cancel()
369
+ if st.outbound_task:
370
+ st.outbound_task.cancel()
371
+ P("SYS>", "ws_closed")
372
+
373
+ # ----------------------------
374
+ # VAD + STT
375
+ # ----------------------------
376
+ async def vad_and_stt(ws: WebSocket, st: CallState, pcm16_16k: bytes, is_speech: bool):
377
+ t = now_ms()
378
+
379
+ if not st.in_speech:
380
+ if is_speech:
381
+ st.speech_start_count += 1
382
+ if st.speech_start_count >= SPEECH_START_FRAMES:
383
+ st.in_speech = True
384
+ st.silence_count = 0
385
+ st.utter_start_ms = t
386
+ st.speech_start_count = 0
387
+ st.last_partial = ""
388
+ st.last_partial_emit_ms = 0
389
+
390
+ st.rec = KaldiRecognizer(_VOSK_MODEL, STT_RATE)
391
+ st.rec.SetWords(False)
392
+ else:
393
+ st.speech_start_count = 0
394
+ return
395
+
396
+ st.rec.AcceptWaveform(pcm16_16k)
397
+
398
+ # partial logging only (you said UI later)
399
+ if t - st.last_partial_emit_ms >= PARTIAL_EMIT_EVERY_MS:
400
+ st.last_partial_emit_ms = t
401
+ try:
402
+ pj = json.loads(st.rec.PartialResult() or "{}")
403
+ partial = (pj.get("partial") or "").strip()
404
+ except Exception:
405
+ partial = ""
406
+ if partial and partial != st.last_partial:
407
+ st.last_partial = partial
408
+ P("STT_PART>", partial)
409
+
410
+ if (t - st.utter_start_ms) > MAX_UTTERANCE_MS:
411
+ await finalize_utterance(ws, st, "max_utterance")
412
+ return
413
+
414
+ if is_speech:
415
+ st.silence_count = 0
416
+ return
417
+
418
+ st.silence_count += 1
419
+ if st.silence_count >= SPEECH_END_SILENCE_FRAMES:
420
+ await finalize_utterance(ws, st, f"vad_silence_{SPEECH_END_SILENCE_FRAMES*FRAME_MS}ms")
421
+
422
+ async def finalize_utterance(ws: WebSocket, st: CallState, reason: str):
423
+ if not st.in_speech:
424
+ return
425
+
426
+ st.in_speech = False
427
+ st.silence_count = 0
428
+ st.speech_start_count = 0
429
+ st.last_partial = ""
430
+
431
+ try:
432
+ j = json.loads(st.rec.FinalResult() or "{}")
433
+ except Exception:
434
+ j = {}
435
+
436
+ user_text = (j.get("text") or "").strip()
437
+ if not user_text:
438
+ return
439
+
440
+ P("STT_FINAL>", f"{user_text} ({reason})")
441
+
442
+ async def bot_job():
443
+ async with st.bot_lock:
444
+ await answer_and_speak(ws, st, user_text)
445
+
446
+ asyncio.create_task(bot_job())
447
+
448
+ # ----------------------------
449
+ # LLM Answer -> Speak
450
+ # ----------------------------
451
+ async def answer_and_speak(ws: WebSocket, st: CallState, user_text: str):
452
+ st.cancel_llm = CancelFlag(False)
453
+
454
+ # store user
455
+ st.history.append({"role": "user", "content": user_text})
456
+ st.history = st.history[:1] + st.history[-8:]
457
+
458
+ loop = asyncio.get_running_loop()
459
+
460
+ def worker():
461
+ return openai_answer_blocking(st.history, user_text)
462
+
463
+ ans = await loop.run_in_executor(None, worker)
464
+ ans = (ans or "").strip()
465
+ if not ans:
466
+ ans = "Sorry, I didn’t catch that."
467
+
468
+ P("LLM_ANS>", ans)
469
+
470
+ # store assistant
471
+ st.history.append({"role": "assistant", "content": ans})
472
+ st.history = st.history[:1] + st.history[-8:]
473
+
474
+ await speak_text(ws, st, ans)
475
+
476
+ # ----------------------------
477
+ # Barge-in (clear + drain)
478
+ # ----------------------------
479
+ async def barge_in(ws: WebSocket, st: CallState):
480
+ st.cancel_llm.set()
481
+ st.bump_tts_generation() # invalidate older audio
482
+
483
+ if st.stream_sid:
484
+ try:
485
+ await ws.send_text(json.dumps({"event": "clear", "streamSid": st.stream_sid}))
486
+ P("TWILIO>", "sent_clear")
487
+ except Exception:
488
+ pass
489
+
490
+ await drain_queue(st.outbound_q)
491
+ st.bot_speaking = False
492
+
493
+ # ----------------------------
494
+ # Speak / TTS with generation-id
495
+ # ----------------------------
496
+ async def speak_text(ws: WebSocket, st: CallState, text: str):
497
+ gen = st.bump_tts_generation()
498
+
499
+ # clear previous audio
500
+ if st.stream_sid:
501
+ try:
502
+ await ws.send_text(json.dumps({"event": "clear", "streamSid": st.stream_sid}))
503
+ P("TWILIO>", "sent_clear")
504
+ except Exception:
505
+ pass
506
+ await drain_queue(st.outbound_q)
507
+
508
+ await tts_enqueue(st, text, gen)
509
+
510
+ async def tts_enqueue(st: CallState, text: str, gen: int):
511
+ my_gen = gen
512
+ st.bot_speaking = True
513
+ P("TTS>", f"text={text} gen={my_gen}")
514
+
515
+ loop = asyncio.get_running_loop()
516
+ try:
517
+ mulaw_bytes = await loop.run_in_executor(None, piper_tts_to_mulaw, text)
518
+ except Exception as e:
519
+ P("TTS_ERR>", str(e))
520
+ st.bot_speaking = False
521
+ return
522
+
523
+ if my_gen != st.tts_generation_id:
524
+ P("TTS>", f"discard_gen my_gen={my_gen} current_gen={st.tts_generation_id}")
525
+ return
526
+
527
+ for fr in split_mulaw_frames(mulaw_bytes):
528
+ if my_gen != st.tts_generation_id:
529
+ P("TTS>", f"discard_midstream my_gen={my_gen} current_gen={st.tts_generation_id}")
530
+ return
531
+ await st.outbound_q.put(base64.b64encode(fr).decode("ascii"))
532
+
533
+ await st.outbound_q.put("__END_CHUNK__")
534
+
535
+ async def outbound_sender(ws: WebSocket, st: CallState):
536
+ try:
537
+ while True:
538
+ item = await st.outbound_q.get()
539
+
540
+ if item == "__END_CHUNK__":
541
+ await asyncio.sleep(0.02)
542
+ if st.outbound_q.empty():
543
+ st.bot_speaking = False
544
+ st.outbound_q.task_done()
545
+ continue
546
+
547
+ if not st.stream_sid:
548
+ st.outbound_q.task_done()
549
+ continue
550
+
551
+ await ws.send_text(json.dumps({
552
+ "event": "media",
553
+ "streamSid": st.stream_sid,
554
+ "media": {"payload": item},
555
+ }))
556
+
557
+ st.outbound_q.task_done()
558
+ await asyncio.sleep(FRAME_MS / 1000.0)
559
+
560
+ except asyncio.CancelledError:
561
+ return
562
+ except Exception as e:
563
+ P("SYS>", f"outbound_sender_error={e}")
564
+ log.exception("outbound_sender_error")
565
+
566
+ # ----------------------------
567
+ # main
568
+ # ----------------------------
569
+ if __name__ == "__main__":
570
+ import uvicorn
571
+ P("SYS>", f"starting {HOST}:{PORT}")
572
+ uvicorn.run(app, host=HOST, port=PORT)
start.sh ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Colors
4
+ GREEN='\033[0;32m'
5
+ YELLOW='\033[1;33m'
6
+ NC='\033[0m' # No Color
7
+
8
+ echo -e "${GREEN}Starting setup for pipe_method3.py...${NC}"
9
+
10
+ # 1. Activate Virtual Environment
11
+ if [ -d ".venv" ]; then
12
+ source .venv/bin/activate
13
+ else
14
+ echo -e "${YELLOW}Virtual environment not found. Creating one...${NC}"
15
+ python3 -m venv .venv
16
+ source .venv/bin/activate
17
+ pip install fastapi vosk openai uvicorn websockets
18
+ fi
19
+
20
+ # 2. Set Piper Paths
21
+ export PIPER_BIN="$(pwd)/.venv/bin/piper"
22
+ export PIPER_MODEL_PATH="$(pwd)/models/piper/en_US-lessac-medium.onnx"
23
+
24
+ # 3. Check OpenAI API Key
25
+ if [ -z "$OPENAI_API_KEY" ]; then
26
+ echo -e "${YELLOW}OPENAI_API_KEY is not set.${NC}"
27
+ read -p "Please enter your OpenAI API Key: " OPENAI_API_KEY
28
+ export OPENAI_API_KEY
29
+ fi
30
+
31
+ # 4. Setup Ngrok
32
+ echo -e "${GREEN}Checking ngrok...${NC}"
33
+ NGROK_URL=""
34
+
35
+ # Check if ngrok is already running
36
+ if pgrep -x "ngrok" > /dev/null; then
37
+ echo "ngrok is already running."
38
+ else
39
+ echo "Starting ngrok..."
40
+ ngrok http 8002 > /dev/null &
41
+ sleep 3
42
+ fi
43
+
44
+ # Fetch ngrok URL
45
+ NGROK_API_URL="http://127.0.0.1:4040/api/tunnels"
46
+ if command -v curl > /dev/null; then
47
+ NGROK_URL=$(curl -s $NGROK_API_URL | grep -o '"public_url":"[^"]*' | grep -o '[^"]*$' | head -n 1)
48
+ fi
49
+
50
+ if [ -z "$NGROK_URL" ]; then
51
+ echo -e "${YELLOW}Could not automatically fetch ngrok URL.${NC}"
52
+ echo "Please ensure ngrok is running (ngrok http 8002) and set TWILIO_STREAM_URL manually if needed."
53
+ else
54
+ # Convert http/https to wss
55
+ WSS_URL="${NGROK_URL/https/wss}"
56
+ WSS_URL="${WSS_URL/http/wss}"
57
+ WSS_URL="$WSS_URL/stream"
58
+
59
+ export TWILIO_STREAM_URL="$WSS_URL"
60
+ echo -e "${GREEN}Twilio Stream URL set to: ${WSS_URL}${NC}"
61
+ echo -e "${YELLOW}IMPORTANT: Copy the URL above (or the https version for the webhook) to your Twilio Phone Number configuration.${NC}"
62
+ echo -e "Webhook URL: ${NGROK_URL}/voice"
63
+ fi
64
+
65
+ # 5. Run Script
66
+ echo -e "${GREEN}Starting pipe_method3.py...${NC}"
67
+ python3 pipe_method3.py
stt_llm_ttsopenai.py ADDED
@@ -0,0 +1,636 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ import base64
3
+ import json
4
+ import logging
5
+ import os
6
+ import re
7
+ import tempfile
8
+ import time
9
+ import audioop
10
+ import subprocess
11
+ from dataclasses import dataclass, field
12
+ from typing import Optional, List, Dict
13
+
14
+ from fastapi import FastAPI, WebSocket, WebSocketDisconnect, Request
15
+ from fastapi.responses import PlainTextResponse, Response
16
+ from vosk import Model, KaldiRecognizer
17
+
18
+ from openai import OpenAI
19
+
20
+ # ----------------------------
21
+ # Logging
22
+ # ----------------------------
23
+ logging.basicConfig(
24
+ level=logging.INFO,
25
+ format="%(asctime)s %(levelname)s %(message)s",
26
+ )
27
+ log = logging.getLogger("stt-llm-tts")
28
+
29
+ # ----------------------------
30
+ # Env
31
+ # ----------------------------
32
+ VOSK_MODEL_PATH = os.getenv("VOSK_MODEL_PATH", "models/vosk-model-en-us-0.22-lgraph")
33
+ TWILIO_STREAM_URL = os.getenv("TWILIO_STREAM_URL") # must be wss://.../stream
34
+
35
+ OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
36
+ OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")
37
+
38
+ PIPER_BIN = os.getenv("PIPER_BIN", "piper")
39
+ PIPER_MODEL_PATH = os.getenv("PIPER_MODEL_PATH", "")
40
+
41
+ HOST = os.getenv("HOST", "0.0.0.0")
42
+ PORT = int(os.getenv("PORT", "8002"))
43
+
44
+ # ----------------------------
45
+ # Tunables (latency vs accuracy)
46
+ # ----------------------------
47
+ # Endpointing:
48
+ SILENCE_MS = 700 # if no audio frames for this long -> commit utterance
49
+ STABLE_PARTIAL_MS = 650 # if partial text hasn't changed for this long -> commit
50
+ MIN_UTTER_WORDS = 2 # ignore single-word junk
51
+ MAX_UTTER_CHARS = 220 # safety cap
52
+
53
+ # LLM->TTS chunking:
54
+ CHUNK_MAX_CHARS = 90 # emit TTS chunk if buffer grows
55
+ CHUNK_END_RE = re.compile(r"[.!?\n]")
56
+
57
+ # Twilio pacing:
58
+ FRAME_MS = 20
59
+ MULAW_RATE = 8000
60
+ BYTES_PER_20MS_MULAW = int(MULAW_RATE * (FRAME_MS / 1000.0)) # 160 bytes at 8kHz mulaw
61
+
62
+ # Filter common garbage utterances:
63
+ SINGLE_WORD_IGNORE = {
64
+ "the", "a", "an", "yeah", "yes", "no", "okay", "ok", "hmm", "um", "uh"
65
+ }
66
+
67
+ SYSTEM_PROMPT = (
68
+ "You are a fast phone-call assistant. "
69
+ "Reply in 1-2 short sentences. "
70
+ "Ask only one question at a time. "
71
+ "Be concise."
72
+ )
73
+
74
+ app = FastAPI()
75
+
76
+ # ----------------------------
77
+ # Frontend Clients
78
+ # ----------------------------
79
+ connected_clients: List[WebSocket] = []
80
+
81
+
82
+ async def broadcast_transcript(role: str, text: str):
83
+ """Broadcasts a transcript message to all connected frontend clients."""
84
+ if not connected_clients:
85
+ return
86
+
87
+ message = {
88
+ "type": "transcript",
89
+ "role": role,
90
+ "text": text,
91
+ "timestamp": now_ms()
92
+ }
93
+
94
+ disconnected = []
95
+ for client in connected_clients:
96
+ try:
97
+ await client.send_json(message)
98
+ except Exception:
99
+ disconnected.append(client)
100
+
101
+ for client in disconnected:
102
+ if client in connected_clients:
103
+ connected_clients.remove(client)
104
+
105
+
106
+ # ----------------------------
107
+ # Helpers
108
+ # ----------------------------
109
+ def now_ms() -> int:
110
+ return int(time.time() * 1000)
111
+
112
+
113
+ def safe_strip_key(key: str) -> str:
114
+ return (key or "").strip().replace("\r", "").replace("\n", "")
115
+
116
+
117
+ def split_mulaw_frames(mulaw_bytes: bytes) -> List[bytes]:
118
+ frames = []
119
+ for i in range(0, len(mulaw_bytes), BYTES_PER_20MS_MULAW):
120
+ chunk = mulaw_bytes[i:i + BYTES_PER_20MS_MULAW]
121
+ if len(chunk) < BYTES_PER_20MS_MULAW:
122
+ # pad with silence (mu-law silence is 0xFF)
123
+ chunk += b"\xFF" * (BYTES_PER_20MS_MULAW - len(chunk))
124
+ frames.append(chunk)
125
+ return frames
126
+
127
+
128
+ def is_junk_utterance(text: str) -> bool:
129
+ t = (text or "").strip().lower()
130
+ if not t:
131
+ return True
132
+ if len(t) > MAX_UTTER_CHARS:
133
+ return False
134
+ words = [w for w in t.split() if w]
135
+ if len(words) < MIN_UTTER_WORDS and (t in SINGLE_WORD_IGNORE):
136
+ return True
137
+ if len(words) < MIN_UTTER_WORDS and len(t) < 4:
138
+ return True
139
+ return False
140
+
141
+
142
+ def build_twiml(stream_url: str) -> str:
143
+ return f"""<?xml version="1.0" encoding="UTF-8"?>
144
+ <Response>
145
+ <Connect>
146
+ <Stream url="{stream_url}" />
147
+ </Connect>
148
+ <Pause length="600"/>
149
+ </Response>
150
+ """
151
+
152
+
153
+ async def drain_queue(q: asyncio.Queue):
154
+ try:
155
+ while True:
156
+ q.get_nowait()
157
+ q.task_done()
158
+ except asyncio.QueueEmpty:
159
+ return
160
+
161
+
162
+ # ----------------------------
163
+ # Piper TTS -> mulaw 8k
164
+ # ----------------------------
165
+ def piper_tts_to_mulaw(text: str) -> bytes:
166
+ """
167
+ Generates 8k mulaw raw bytes suitable for Twilio Media Streams.
168
+ Pipeline: piper -> wav (22k/16k depends) -> ffmpeg -> mulaw 8k raw
169
+ """
170
+ if not PIPER_MODEL_PATH:
171
+ raise RuntimeError("Set PIPER_MODEL_PATH to a valid .onnx voice model")
172
+ if not shutil_which(PIPER_BIN):
173
+ raise RuntimeError(f"piper binary not found: {PIPER_BIN} (set PIPER_BIN to full path)")
174
+
175
+ text = (text or "").strip()
176
+ if not text:
177
+ return b""
178
+
179
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as wavf:
180
+ wav_path = wavf.name
181
+ with tempfile.NamedTemporaryFile(suffix=".mulaw", delete=False) as mlf:
182
+ mulaw_path = mlf.name
183
+
184
+ try:
185
+ # piper writes wav
186
+ # piper usage: piper --model <onnx> --output_file out.wav
187
+ # it reads text from stdin
188
+ p = subprocess.run(
189
+ [PIPER_BIN, "--model", PIPER_MODEL_PATH, "--output_file", wav_path],
190
+ input=text.encode("utf-8"),
191
+ stdout=subprocess.PIPE,
192
+ stderr=subprocess.PIPE,
193
+ check=True
194
+ )
195
+
196
+ # Convert wav -> raw mulaw 8k
197
+ subprocess.run(
198
+ [
199
+ "ffmpeg", "-y",
200
+ "-i", wav_path,
201
+ "-ac", "1",
202
+ "-ar", "8000",
203
+ "-f", "mulaw",
204
+ mulaw_path
205
+ ],
206
+ stdout=subprocess.PIPE,
207
+ stderr=subprocess.PIPE,
208
+ check=True
209
+ )
210
+
211
+ with open(mulaw_path, "rb") as f:
212
+ return f.read()
213
+
214
+ finally:
215
+ try:
216
+ os.unlink(wav_path)
217
+ except Exception:
218
+ pass
219
+ try:
220
+ os.unlink(mulaw_path)
221
+ except Exception:
222
+ pass
223
+
224
+
225
+ def shutil_which(cmd: str) -> Optional[str]:
226
+ # tiny "which" helper without importing shutil (keeps deps minimal)
227
+ if os.path.isabs(cmd) and os.path.exists(cmd):
228
+ return cmd
229
+ for p in os.getenv("PATH", "").split(os.pathsep):
230
+ full = os.path.join(p, cmd)
231
+ if os.path.exists(full) and os.access(full, os.X_OK):
232
+ return full
233
+ return None
234
+
235
+
236
+ # ----------------------------
237
+ # OpenAI streaming
238
+ # ----------------------------
239
+ def openai_stream_tokens_blocking(messages: List[Dict], model: str, cancel_flag: "CancelFlag"):
240
+ """
241
+ Blocking generator for streaming tokens.
242
+ We stop yielding if cancel_flag is set.
243
+ """
244
+ key = safe_strip_key(OPENAI_API_KEY)
245
+ if not key:
246
+ raise RuntimeError("OPENAI_API_KEY is empty. Set it in your environment.")
247
+ client = OpenAI(api_key=key)
248
+
249
+ stream = client.chat.completions.create(
250
+ model=model,
251
+ messages=messages,
252
+ temperature=0.4,
253
+ stream=True,
254
+ max_tokens=180,
255
+ )
256
+
257
+ for event in stream:
258
+ if cancel_flag.is_set:
259
+ break
260
+ delta = event.choices[0].delta
261
+ if delta and delta.content:
262
+ yield delta.content
263
+
264
+
265
+ class CancelFlag:
266
+ def __init__(self):
267
+ self.is_set = False
268
+
269
+ def set(self):
270
+ self.is_set = True
271
+
272
+
273
+ # ----------------------------
274
+ # Call state
275
+ # ----------------------------
276
+ @dataclass
277
+ class CallState:
278
+ call_id: str
279
+ stream_sid: str = ""
280
+ # audio
281
+ last_audio_ms: int = field(default_factory=now_ms)
282
+ # partial tracking
283
+ last_partial: str = ""
284
+ last_partial_change_ms: int = field(default_factory=now_ms)
285
+ # recognizer
286
+ rec: Optional[KaldiRecognizer] = None
287
+ # outbound audio
288
+ outbound_q: asyncio.Queue = field(default_factory=lambda: asyncio.Queue(maxsize=4000))
289
+ # barge-in / cancellation
290
+ bot_speaking: bool = False
291
+ cancel_llm: CancelFlag = field(default_factory=CancelFlag)
292
+ cancel_tts: CancelFlag = field(default_factory=CancelFlag)
293
+ # conversation
294
+ history: List[Dict] = field(default_factory=list)
295
+ # tasks
296
+ outbound_task: Optional[asyncio.Task] = None
297
+ # lock so only one bot response at a time
298
+ bot_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
299
+
300
+ def reset_cancels(self):
301
+ self.cancel_llm = CancelFlag()
302
+ self.cancel_tts = CancelFlag()
303
+
304
+
305
+ # ----------------------------
306
+ # FastAPI endpoints
307
+ # ----------------------------
308
+ @app.get("/health")
309
+ async def health():
310
+ return {"ok": True}
311
+
312
+
313
+ @app.post("/voice")
314
+ async def voice(request: Request):
315
+ if not TWILIO_STREAM_URL:
316
+ return PlainTextResponse("TWILIO_STREAM_URL is not set", status_code=500)
317
+ xml = build_twiml(TWILIO_STREAM_URL)
318
+ log.info("Returning TwiML:\n%s", xml)
319
+ return Response(content=xml, media_type="application/xml")
320
+
321
+ @app.websocket("/client-ws")
322
+ async def client_websocket(ws: WebSocket):
323
+ await ws.accept()
324
+ connected_clients.append(ws)
325
+ log.info("Frontend client connected. Total clients: %d", len(connected_clients))
326
+ try:
327
+ while True:
328
+ # Keep connection alive
329
+ await ws.receive_text()
330
+ except WebSocketDisconnect:
331
+ if ws in connected_clients:
332
+ connected_clients.remove(ws)
333
+ log.info("Frontend client disconnected. Total clients: %d", len(connected_clients))
334
+ except Exception as e:
335
+ log.error("Frontend client error: %s", e)
336
+ if ws in connected_clients:
337
+ connected_clients.remove(ws)
338
+
339
+ @app.websocket("/stream")
340
+ async def stream(ws: WebSocket):
341
+ await ws.accept()
342
+
343
+ call_id = str(id(ws))
344
+ st = CallState(call_id=call_id)
345
+ st.history = [{"role": "system", "content": SYSTEM_PROMPT}]
346
+
347
+ log.info("[%s] connection open", call_id)
348
+
349
+ # Load Vosk model once per process? (simple: load per call is slow)
350
+ # We'll cache it globally to reduce call start latency.
351
+ global _VOSK_MODEL
352
+ if _VOSK_MODEL is None:
353
+ log.info("Loading Vosk model: %s", VOSK_MODEL_PATH)
354
+ _VOSK_MODEL = Model(VOSK_MODEL_PATH)
355
+ log.info("Vosk model loaded.")
356
+ st.rec = KaldiRecognizer(_VOSK_MODEL, 16000)
357
+ st.rec.SetWords(False)
358
+
359
+ # Outbound sender task (pacing)
360
+ st.outbound_task = asyncio.create_task(outbound_sender(ws, st))
361
+
362
+ try:
363
+ while True:
364
+ raw = await ws.receive_text()
365
+ msg = json.loads(raw)
366
+
367
+ event = msg.get("event")
368
+ if event == "start":
369
+ st.stream_sid = msg["start"]["streamSid"]
370
+ enc = msg["start"].get("mediaFormat", {}).get("encoding") or msg["start"].get("mediaFormat", {}).get("codec")
371
+ sr = msg["start"].get("mediaFormat", {}).get("sampleRate")
372
+ log.info("[%s] start streamSid=%s encoding=%s sr=%s", call_id, st.stream_sid, enc, sr)
373
+
374
+ # greet once
375
+ await speak_text(ws, st, "Hi! How can I help you today?")
376
+
377
+ elif event == "media":
378
+ st.last_audio_ms = now_ms()
379
+ payload_b64 = msg["media"]["payload"]
380
+ mulaw = base64.b64decode(payload_b64)
381
+
382
+ # decode mulaw 8k -> PCM 16k 16-bit
383
+ pcm16 = audioop.ulaw2lin(mulaw, 2) # 16-bit
384
+ pcm16_16k, _ = audioop.ratecv(pcm16, 2, 1, 8000, 16000, None)
385
+
386
+ # feed recognizer
387
+ if st.rec.AcceptWaveform(pcm16_16k):
388
+ j = json.loads(st.rec.Result() or "{}")
389
+ text = (j.get("text") or "").strip()
390
+ if text:
391
+ await on_utterance(ws, st, text, reason="vosk_final")
392
+ else:
393
+ j = json.loads(st.rec.PartialResult() or "{}")
394
+ partial = (j.get("partial") or "").strip()
395
+ if partial:
396
+ await on_partial(ws, st, partial)
397
+
398
+ # endpointing checks (silence/stability)
399
+ await maybe_endpoint(ws, st)
400
+
401
+ elif event == "stop":
402
+ log.info("[%s] stop", call_id)
403
+ break
404
+
405
+ except WebSocketDisconnect:
406
+ log.info("[%s] websocket disconnected", call_id)
407
+ except Exception as e:
408
+ log.exception("[%s] websocket error: %s", call_id, e)
409
+ finally:
410
+ # cancel outbound task
411
+ if st.outbound_task:
412
+ st.outbound_task.cancel()
413
+ log.info("[%s] connection closed", call_id)
414
+
415
+
416
+ # cache model
417
+ _VOSK_MODEL = None
418
+
419
+
420
+ # ----------------------------
421
+ # Endpointing + Barge-in
422
+ # ----------------------------
423
+ async def on_partial(ws: WebSocket, st: CallState, partial: str):
424
+ # barge-in trigger: user starts speaking while bot speaking
425
+ # use partial length threshold
426
+ words = partial.split()
427
+ if st.bot_speaking and len(words) >= 2:
428
+ log.info("[%s] BARGE-IN detected (partial=%r)", st.call_id, partial)
429
+ await barge_in(ws, st)
430
+
431
+ if partial != st.last_partial:
432
+ st.last_partial = partial
433
+ st.last_partial_change_ms = now_ms()
434
+ log.info("[%s] partial: %s", st.call_id, partial)
435
+
436
+
437
+ async def maybe_endpoint(ws: WebSocket, st: CallState):
438
+ # stable partial endpoint
439
+ if st.last_partial:
440
+ stable_ms = now_ms() - st.last_partial_change_ms
441
+ if stable_ms >= STABLE_PARTIAL_MS:
442
+ # commit partial as utterance
443
+ text = st.last_partial.strip()
444
+ st.last_partial = ""
445
+ if text and not is_junk_utterance(text):
446
+ await on_utterance(ws, st, text, reason=f"stable_partial_{stable_ms}ms")
447
+
448
+ # silence endpoint
449
+ silence_ms = now_ms() - st.last_audio_ms
450
+ if silence_ms >= SILENCE_MS and st.last_partial:
451
+ text = st.last_partial.strip()
452
+ st.last_partial = ""
453
+ if text and not is_junk_utterance(text):
454
+ await on_utterance(ws, st, text, reason=f"silence_{silence_ms}ms")
455
+
456
+
457
+ async def on_utterance(ws: WebSocket, st: CallState, text: str, reason: str):
458
+ text = (text or "").strip()
459
+ if not text:
460
+ return
461
+ if is_junk_utterance(text):
462
+ log.info("[%s] ignore utterance=%r reason=%s", st.call_id, text, reason)
463
+ return
464
+
465
+ # print to terminal clearly
466
+ print("\n" + "=" * 70)
467
+ print(f"STT ({reason}): {text}")
468
+ print("LLM: ", end="", flush=True)
469
+
470
+ # broadcast to frontend
471
+ await broadcast_transcript("user", text)
472
+
473
+ # ensure only one bot response runs at a time
474
+ async with st.bot_lock:
475
+ st.reset_cancels()
476
+ await run_llm_stream_and_tts(ws, st, text)
477
+
478
+
479
+ async def barge_in(ws: WebSocket, st: CallState):
480
+ # cancel ongoing LLM/TTS
481
+ st.cancel_llm.set()
482
+ st.cancel_tts.set()
483
+
484
+ # stop playback on Twilio side (important)
485
+ try:
486
+ await ws.send_text(json.dumps({"event": "clear", "streamSid": st.stream_sid}))
487
+ except Exception:
488
+ pass
489
+
490
+ # clear our outbound queue
491
+ await drain_queue(st.outbound_q)
492
+
493
+ st.bot_speaking = False
494
+
495
+
496
+ # ----------------------------
497
+ # LLM streaming -> chunk -> TTS -> queue -> paced playback
498
+ # ----------------------------
499
+ async def run_llm_stream_and_tts(ws: WebSocket, st: CallState, user_text: str):
500
+ # build short rolling history
501
+ st.history.append({"role": "user", "content": user_text})
502
+ st.history = st.history[:1] + st.history[-8:] # keep system + last 8 msgs
503
+
504
+ loop = asyncio.get_running_loop()
505
+ token_q: asyncio.Queue = asyncio.Queue()
506
+
507
+ def worker():
508
+ try:
509
+ for tok in openai_stream_tokens_blocking(st.history, OPENAI_MODEL, st.cancel_llm):
510
+ if st.cancel_llm.is_set:
511
+ break
512
+ asyncio.run_coroutine_threadsafe(token_q.put(tok), loop)
513
+ finally:
514
+ asyncio.run_coroutine_threadsafe(token_q.put(None), loop)
515
+
516
+ # start blocking OpenAI stream in a thread
517
+ await loop.run_in_executor(None, worker)
518
+
519
+ # read tokens and chunk them
520
+ buf = ""
521
+ full = ""
522
+
523
+ while True:
524
+ tok = await token_q.get()
525
+ if tok is None:
526
+ break
527
+ if st.cancel_llm.is_set:
528
+ break
529
+
530
+ full += tok
531
+ buf += tok
532
+
533
+ # print as it streams
534
+ print(tok, end="", flush=True)
535
+
536
+ # chunk rule: punctuation OR length
537
+ if CHUNK_END_RE.search(buf) or len(buf) >= CHUNK_MAX_CHARS:
538
+ chunk = buf.strip()
539
+ buf = ""
540
+ if chunk:
541
+ await tts_enqueue(ws, st, chunk)
542
+
543
+ # flush remaining
544
+ rem = buf.strip()
545
+ if rem and not st.cancel_llm.is_set:
546
+ await tts_enqueue(ws, st, rem)
547
+
548
+ # store assistant message for context (only if not cancelled)
549
+ if full.strip() and not st.cancel_llm.is_set:
550
+ st.history.append({"role": "assistant", "content": full.strip()})
551
+ # broadcast to frontend
552
+ await broadcast_transcript("assistant", full.strip())
553
+
554
+
555
+ async def tts_enqueue(ws: WebSocket, st: CallState, text: str):
556
+ if st.cancel_tts.is_set:
557
+ return
558
+
559
+ st.bot_speaking = True
560
+ log.info("[%s] TTS start (chars=%d)", st.call_id, len(text))
561
+
562
+ # run piper+ffmpeg in executor (blocking)
563
+ loop = asyncio.get_running_loop()
564
+ mulaw_bytes = await loop.run_in_executor(None, piper_tts_to_mulaw, text)
565
+
566
+ if st.cancel_tts.is_set:
567
+ return
568
+
569
+ frames = split_mulaw_frames(mulaw_bytes)
570
+ log.info("[%s] TTS ready (frames=%d)", st.call_id, len(frames))
571
+
572
+ # enqueue frames
573
+ for fr in frames:
574
+ if st.cancel_tts.is_set:
575
+ break
576
+ b64 = base64.b64encode(fr).decode("ascii")
577
+ await st.outbound_q.put(b64)
578
+
579
+ # marker: end of this chunk
580
+ await st.outbound_q.put("__END_CHUNK__")
581
+
582
+
583
+ async def speak_text(ws: WebSocket, st: CallState, text: str):
584
+ # used for initial greeting
585
+ await barge_in(ws, st) # clear any previous
586
+ await tts_enqueue(ws, st, text)
587
+
588
+
589
+ async def outbound_sender(ws: WebSocket, st: CallState):
590
+ """
591
+ Sends queued audio to Twilio at real-time pace (20ms per frame).
592
+ Also turns off bot_speaking after chunk ends and queue drains.
593
+ """
594
+ sent_last_sec = 0
595
+ sec_tick = time.time()
596
+
597
+ try:
598
+ while True:
599
+ item = await st.outbound_q.get()
600
+
601
+ if item == "__END_CHUNK__":
602
+ # if queue empty after a short moment -> bot not speaking
603
+ await asyncio.sleep(0.02)
604
+ if st.outbound_q.empty():
605
+ st.bot_speaking = False
606
+ st.outbound_q.task_done()
607
+ continue
608
+
609
+ # Twilio media message
610
+ msg = {
611
+ "event": "media",
612
+ "streamSid": st.stream_sid,
613
+ "media": {"payload": item},
614
+ }
615
+ await ws.send_text(json.dumps(msg))
616
+ st.outbound_q.task_done()
617
+
618
+ # pacing
619
+ await asyncio.sleep(FRAME_MS / 1000.0)
620
+
621
+ # stats
622
+ sent_last_sec += 1
623
+ if time.time() - sec_tick >= 1.0:
624
+ log.info("[%s] outbound media messages sent last 1s: %d", st.call_id, sent_last_sec)
625
+ sent_last_sec = 0
626
+ sec_tick = time.time()
627
+
628
+ except asyncio.CancelledError:
629
+ return
630
+ except Exception as e:
631
+ log.exception("[%s] outbound sender error: %s", st.call_id, e)
632
+
633
+
634
+ if __name__ == "__main__":
635
+ import uvicorn
636
+ uvicorn.run(app, host=HOST, port=PORT)
web_demo/.gitignore ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Logs
2
+ logs
3
+ *.log
4
+ npm-debug.log*
5
+ yarn-debug.log*
6
+ yarn-error.log*
7
+ pnpm-debug.log*
8
+ lerna-debug.log*
9
+
10
+ node_modules
11
+ dist
12
+ dist-ssr
13
+ *.local
14
+ .env
15
+ # Editor directories and files
16
+ .vscode/*
17
+ !.vscode/extensions.json
18
+ .idea
19
+ .DS_Store
20
+ *.suo
21
+ *.ntvs*
22
+ *.njsproj
23
+ *.sln
24
+ *.sw?
web_demo/README.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # React + Vite
2
+
3
+ This template provides a minimal setup to get React working in Vite with HMR and some ESLint rules.
4
+
5
+ Currently, two official plugins are available:
6
+
7
+ - [@vitejs/plugin-react](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react) uses [Babel](https://babeljs.io/) (or [oxc](https://oxc.rs) when used in [rolldown-vite](https://vite.dev/guide/rolldown)) for Fast Refresh
8
+ - [@vitejs/plugin-react-swc](https://github.com/vitejs/vite-plugin-react/blob/main/packages/plugin-react-swc) uses [SWC](https://swc.rs/) for Fast Refresh
9
+
10
+ ## React Compiler
11
+
12
+ The React Compiler is not enabled on this template because of its impact on dev & build performances. To add it, see [this documentation](https://react.dev/learn/react-compiler/installation).
13
+
14
+ ## Expanding the ESLint configuration
15
+
16
+ If you are developing a production application, we recommend using TypeScript with type-aware lint rules enabled. Check out the [TS template](https://github.com/vitejs/vite/tree/main/packages/create-vite/template-react-ts) for information on how to integrate TypeScript and [`typescript-eslint`](https://typescript-eslint.io) in your project.
web_demo/envdatavars.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ VITE_OPENAI_API_KEY=
web_demo/eslint.config.js ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import js from '@eslint/js'
2
+ import globals from 'globals'
3
+ import reactHooks from 'eslint-plugin-react-hooks'
4
+ import reactRefresh from 'eslint-plugin-react-refresh'
5
+ import { defineConfig, globalIgnores } from 'eslint/config'
6
+
7
+ export default defineConfig([
8
+ globalIgnores(['dist']),
9
+ {
10
+ files: ['**/*.{js,jsx}'],
11
+ extends: [
12
+ js.configs.recommended,
13
+ reactHooks.configs.flat.recommended,
14
+ reactRefresh.configs.vite,
15
+ ],
16
+ languageOptions: {
17
+ ecmaVersion: 2020,
18
+ globals: globals.browser,
19
+ parserOptions: {
20
+ ecmaVersion: 'latest',
21
+ ecmaFeatures: { jsx: true },
22
+ sourceType: 'module',
23
+ },
24
+ },
25
+ rules: {
26
+ 'no-unused-vars': ['error', { varsIgnorePattern: '^[A-Z_]' }],
27
+ },
28
+ },
29
+ ])
web_demo/index.html ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!doctype html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8" />
5
+ <link rel="icon" type="image/svg+xml" href="/vite.svg" />
6
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
7
+ <title>web_demo</title>
8
+ </head>
9
+ <body>
10
+ <div id="root"></div>
11
+ <script type="module" src="/src/main.jsx"></script>
12
+ </body>
13
+ </html>
web_demo/package-lock.json ADDED
The diff for this file is too large to render. See raw diff
 
web_demo/package.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "web_demo",
3
+ "private": true,
4
+ "version": "0.0.0",
5
+ "type": "module",
6
+ "scripts": {
7
+ "dev": "vite",
8
+ "build": "vite build",
9
+ "lint": "eslint .",
10
+ "preview": "vite preview"
11
+ },
12
+ "dependencies": {
13
+ "lucide-react": "^0.562.0",
14
+ "react": "^19.2.0",
15
+ "react-dom": "^19.2.0"
16
+ },
17
+ "devDependencies": {
18
+ "@eslint/js": "^9.39.1",
19
+ "@types/react": "^19.2.5",
20
+ "@types/react-dom": "^19.2.3",
21
+ "@vitejs/plugin-react": "^5.1.1",
22
+ "eslint": "^9.39.1",
23
+ "eslint-plugin-react-hooks": "^7.0.1",
24
+ "eslint-plugin-react-refresh": "^0.4.24",
25
+ "globals": "^16.5.0",
26
+ "vite": "^7.2.4"
27
+ }
28
+ }
web_demo/public/vite.svg ADDED
web_demo/src/App.css ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ :root {
2
+ --bg-color: #0f172a;
3
+ --card-bg: #1e293b;
4
+ --text-primary: #f8fafc;
5
+ --text-secondary: #94a3b8;
6
+ --accent-color: #3b82f6;
7
+ --user-bubble: #334155;
8
+ --assistant-bubble: #2563eb;
9
+ --error-color: #ef4444;
10
+ --success-color: #22c55e;
11
+ }
12
+
13
+ * {
14
+ box-sizing: border-box;
15
+ margin: 0;
16
+ padding: 0;
17
+ }
18
+
19
+ body {
20
+ font-family: 'Inter', system-ui, -apple-system, sans-serif;
21
+ background-color: var(--bg-color);
22
+ color: var(--text-primary);
23
+ height: 100vh;
24
+ overflow: hidden;
25
+ }
26
+
27
+ .app-container {
28
+ display: flex;
29
+ flex-direction: column;
30
+ height: 100vh;
31
+ width: 100%;
32
+ max-width: 100vw;
33
+ margin: 0;
34
+ padding: 1.5rem;
35
+ }
36
+
37
+ .header {
38
+ display: flex;
39
+ justify-content: space-between;
40
+ align-items: center;
41
+ padding: 1rem;
42
+ background-color: var(--card-bg);
43
+ border-radius: 12px;
44
+ margin-bottom: 1rem;
45
+ box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1);
46
+ }
47
+
48
+ /* Tabs */
49
+ .tabs {
50
+ display: flex;
51
+ gap: 0.5rem;
52
+ margin-bottom: 1rem;
53
+ background-color: var(--card-bg);
54
+ padding: 0.5rem;
55
+ border-radius: 12px;
56
+ box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1);
57
+ }
58
+
59
+ .tab {
60
+ flex: 1;
61
+ display: flex;
62
+ align-items: center;
63
+ justify-content: center;
64
+ gap: 0.5rem;
65
+ padding: 0.75rem 1rem;
66
+ background: transparent;
67
+ border: none;
68
+ border-radius: 8px;
69
+ color: var(--text-secondary);
70
+ font-size: 0.875rem;
71
+ font-weight: 500;
72
+ cursor: pointer;
73
+ transition: all 0.2s ease;
74
+ font-family: inherit;
75
+ }
76
+
77
+ .tab:hover {
78
+ background-color: rgba(255, 255, 255, 0.05);
79
+ color: var(--text-primary);
80
+ }
81
+
82
+ .tab.active {
83
+ background-color: var(--accent-color);
84
+ color: white;
85
+ box-shadow: 0 2px 8px rgba(59, 130, 246, 0.3);
86
+ }
87
+
88
+ .tab svg {
89
+ flex-shrink: 0;
90
+ }
91
+
92
+ .logo {
93
+ display: flex;
94
+ align-items: center;
95
+ gap: 0.75rem;
96
+ }
97
+
98
+ .icon-logo {
99
+ color: var(--accent-color);
100
+ }
101
+
102
+ h1 {
103
+ font-size: 1.25rem;
104
+ font-weight: 600;
105
+ }
106
+
107
+ .status-badge {
108
+ display: flex;
109
+ align-items: center;
110
+ gap: 0.5rem;
111
+ padding: 0.5rem 1rem;
112
+ border-radius: 9999px;
113
+ font-size: 0.875rem;
114
+ background-color: rgba(255, 255, 255, 0.05);
115
+ }
116
+
117
+ .status-badge.connected {
118
+ color: var(--success-color);
119
+ background-color: rgba(34, 197, 94, 0.1);
120
+ }
121
+
122
+ .status-badge.disconnected {
123
+ color: var(--error-color);
124
+ background-color: rgba(239, 68, 68, 0.1);
125
+ }
126
+
127
+ .main-content {
128
+ flex: 1;
129
+ background-color: var(--card-bg);
130
+ border-radius: 12px;
131
+ padding: 1rem;
132
+ overflow-y: auto;
133
+ position: relative;
134
+ box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1);
135
+ }
136
+
137
+ .transcript-container {
138
+ display: flex;
139
+ flex-direction: column;
140
+ gap: 1rem;
141
+ }
142
+
143
+ .empty-state {
144
+ display: flex;
145
+ flex-direction: column;
146
+ align-items: center;
147
+ justify-content: center;
148
+ height: 100%;
149
+ color: var(--text-secondary);
150
+ gap: 1rem;
151
+ margin-top: 4rem;
152
+ }
153
+
154
+ .transcript-item {
155
+ display: flex;
156
+ flex-direction: column;
157
+ gap: 0.25rem;
158
+ max-width: 80%;
159
+ }
160
+
161
+ .transcript-item.user {
162
+ align-self: flex-end;
163
+ align-items: flex-end;
164
+ }
165
+
166
+ .transcript-item.assistant {
167
+ align-self: flex-start;
168
+ align-items: flex-start;
169
+ }
170
+
171
+ .message-header {
172
+ display: flex;
173
+ gap: 0.5rem;
174
+ font-size: 0.75rem;
175
+ color: var(--text-secondary);
176
+ }
177
+
178
+ .message-bubble {
179
+ padding: 0.75rem 1rem;
180
+ border-radius: 12px;
181
+ line-height: 1.5;
182
+ box-shadow: 0 1px 2px rgba(0, 0, 0, 0.1);
183
+ }
184
+
185
+ .transcript-item.user .message-bubble {
186
+ background-color: var(--user-bubble);
187
+ border-bottom-right-radius: 2px;
188
+ }
189
+
190
+ .transcript-item.assistant .message-bubble {
191
+ background-color: var(--assistant-bubble);
192
+ border-bottom-left-radius: 2px;
193
+ }
194
+
195
+ /* Scrollbar styling */
196
+ ::-webkit-scrollbar {
197
+ width: 8px;
198
+ }
199
+
200
+ ::-webkit-scrollbar-track {
201
+ background: transparent;
202
+ }
203
+
204
+ ::-webkit-scrollbar-thumb {
205
+ background: #475569;
206
+ border-radius: 4px;
207
+ }
208
+
209
+ ::-webkit-scrollbar-thumb:hover {
210
+ background: #64748b;
211
+ }
web_demo/src/App.jsx ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import React, { useState, useEffect, useRef, useCallback } from 'react';
2
+ import { Activity, Wifi, WifiOff, Terminal, MessageSquare, Mic, Volume2, Settings } from 'lucide-react';
3
+ import './App.css';
4
+ import MicrophoneTest from './components/MicrophoneTest.jsx';
5
+ import TextToSpeech from './components/TextToSpeech.jsx';
6
+ import SttLlmTts from './components/SttLlmTts.jsx';
7
+
8
+ function App() {
9
+ const [activeTab, setActiveTab] = useState('transcript');
10
+ const [isConnected, setIsConnected] = useState(false);
11
+ const [transcripts, setTranscripts] = useState([]);
12
+ const [status, setStatus] = useState('Disconnected');
13
+ const wsRef = useRef(null);
14
+ const transcriptsEndRef = useRef(null);
15
+
16
+ // Scroll to bottom of transcripts
17
+ useEffect(() => {
18
+ transcriptsEndRef.current?.scrollIntoView({ behavior: 'smooth' });
19
+ }, [transcripts]);
20
+
21
+ // Connect to WebSocket
22
+ const connectWebSocket = useCallback(() => {
23
+ setStatus('Connecting...');
24
+ // Connect to the backend's frontend-client endpoint
25
+ const wsUrl = `ws://localhost:8080/client-ws`;
26
+ const ws = new WebSocket(wsUrl);
27
+
28
+ ws.onopen = () => {
29
+ setIsConnected(true);
30
+ setStatus('Connected (Waiting for call data)');
31
+ };
32
+
33
+ ws.onclose = () => {
34
+ setIsConnected(false);
35
+ setStatus('Disconnected');
36
+ // Try reconnecting after 3 seconds
37
+ setTimeout(connectWebSocket, 3000);
38
+ };
39
+
40
+ ws.onerror = (error) => {
41
+ console.error('WebSocket error:', error);
42
+ setStatus('Error connecting');
43
+ };
44
+
45
+ ws.onmessage = (event) => {
46
+ try {
47
+ const data = JSON.parse(event.data);
48
+ handleServerMessage(data);
49
+ } catch (e) {
50
+ console.error('Error parsing message:', e);
51
+ }
52
+ };
53
+
54
+ wsRef.current = ws;
55
+ }, []);
56
+
57
+ useEffect(() => {
58
+ connectWebSocket();
59
+ return () => {
60
+ if (wsRef.current) {
61
+ wsRef.current.close();
62
+ }
63
+ };
64
+ }, [connectWebSocket]);
65
+
66
+ const handleServerMessage = (data) => {
67
+ if (data.type === 'transcript') {
68
+ setTranscripts(prev => [...prev, {
69
+ id: Date.now(),
70
+ role: data.role,
71
+ text: data.text,
72
+ timestamp: new Date(data.timestamp).toLocaleTimeString()
73
+ }]);
74
+ }
75
+ };
76
+
77
+ return (
78
+ <div className="app-container">
79
+ <header className="header">
80
+ <div className="logo">
81
+ <Activity className="icon-logo" />
82
+ <h1>NeuralVoice AI</h1>
83
+ </div>
84
+ <div className={`status-badge ${isConnected ? 'connected' : 'disconnected'}`}>
85
+ {isConnected ? <Wifi size={16} /> : <WifiOff size={16} />}
86
+ <span>{status}</span>
87
+ </div>
88
+ </header>
89
+
90
+ <div className="tabs">
91
+ <button
92
+ className={`tab ${activeTab === 'transcript' ? 'active' : ''}`}
93
+ onClick={() => setActiveTab('transcript')}
94
+ >
95
+ <MessageSquare size={18} />
96
+ <span>Live Call Transcript</span>
97
+ </button>
98
+ <button
99
+ className={`tab ${activeTab === 'microphone' ? 'active' : ''}`}
100
+ onClick={() => setActiveTab('microphone')}
101
+ >
102
+ <Mic size={18} />
103
+ <span>Microphone Test (STT)</span>
104
+ </button>
105
+ <button
106
+ className={`tab ${activeTab === 'tts' ? 'active' : ''}`}
107
+ onClick={() => setActiveTab('tts')}
108
+ >
109
+ <Volume2 size={18} />
110
+ <span>Text-to-Speech (TTS)</span>
111
+ </button>
112
+ <button
113
+ className={`tab ${activeTab === 'stt-llm-tts' ? 'active' : ''}`}
114
+ onClick={() => setActiveTab('stt-llm-tts')}
115
+ >
116
+ <Settings size={18} />
117
+ <span>STT-LLM-TTS</span>
118
+ </button>
119
+ </div>
120
+
121
+ <main className="main-content">
122
+ {activeTab === 'transcript' && (
123
+ <div className="transcript-container">
124
+ {transcripts.length === 0 && (
125
+ <div className="empty-state">
126
+ <Terminal size={48} />
127
+ <p>Waiting for call activity...</p>
128
+ </div>
129
+ )}
130
+
131
+ {transcripts.map((t) => (
132
+ <div key={t.id} className={`transcript-item ${t.role}`}>
133
+ <div className="message-header">
134
+ <span className="role">{t.role === 'user' ? 'Caller' : 'AI Assistant'}</span>
135
+ <span className="timestamp">{t.timestamp}</span>
136
+ </div>
137
+ <div className="message-bubble">
138
+ <p className="text">{t.text}</p>
139
+ </div>
140
+ </div>
141
+ ))}
142
+ <div ref={transcriptsEndRef} />
143
+ </div>
144
+ )}
145
+
146
+ {activeTab === 'microphone' && (
147
+ <MicrophoneTest />
148
+ )}
149
+
150
+ {activeTab === 'tts' && (
151
+ <TextToSpeech />
152
+ )}
153
+
154
+ {activeTab === 'stt-llm-tts' && (
155
+ <SttLlmTts />
156
+ )}
157
+ </main>
158
+ </div>
159
+ );
160
+ }
161
+
162
+ export default App;
web_demo/src/assets/react.svg ADDED
web_demo/src/components/MicrophoneTest.css ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .microphone-test {
2
+ display: flex;
3
+ flex-direction: column;
4
+ gap: 1.5rem;
5
+ padding: 0;
6
+ height: 100%;
7
+ width: 100%;
8
+ }
9
+
10
+ .mic-header {
11
+ text-align: center;
12
+ }
13
+
14
+ .mic-header h2 {
15
+ font-size: 1.5rem;
16
+ font-weight: 600;
17
+ margin-bottom: 0.5rem;
18
+ color: var(--text-primary);
19
+ }
20
+
21
+ .language-selector-stt {
22
+ display: flex;
23
+ flex-direction: column;
24
+ align-items: center;
25
+ gap: 0.5rem;
26
+ background: rgba(255, 255, 255, 0.03);
27
+ padding: 1rem;
28
+ border-radius: 8px;
29
+ margin-bottom: 1rem;
30
+ }
31
+
32
+ .language-selector-stt label {
33
+ font-size: 0.875rem;
34
+ color: var(--text-secondary);
35
+ font-weight: 500;
36
+ }
37
+
38
+ .tiny-info {
39
+ font-size: 0.75rem;
40
+ color: var(--error-color);
41
+ margin-top: 0.25rem;
42
+ }
43
+
44
+ .subtitle {
45
+ color: var(--text-secondary);
46
+ font-size: 0.875rem;
47
+ }
48
+
49
+ .error-message {
50
+ background-color: rgba(239, 68, 68, 0.1);
51
+ border: 1px solid var(--error-color);
52
+ color: var(--error-color);
53
+ padding: 0.75rem 1rem;
54
+ border-radius: 8px;
55
+ font-size: 0.875rem;
56
+ }
57
+
58
+ /* Audio Visualizer */
59
+ .audio-visualizer {
60
+ background: rgba(255, 255, 255, 0.03);
61
+ border-radius: 12px;
62
+ padding: 1.5rem;
63
+ display: flex;
64
+ flex-direction: column;
65
+ gap: 1rem;
66
+ }
67
+
68
+ .visualizer-container {
69
+ display: flex;
70
+ align-items: center;
71
+ justify-content: center;
72
+ gap: 4px;
73
+ height: 80px;
74
+ background: rgba(0, 0, 0, 0.2);
75
+ border-radius: 8px;
76
+ padding: 1rem;
77
+ }
78
+
79
+ .visualizer-bar {
80
+ width: 6px;
81
+ min-height: 5px;
82
+ background: linear-gradient(to top, var(--accent-color), #60a5fa);
83
+ border-radius: 3px;
84
+ transition: height 0.1s ease;
85
+ }
86
+
87
+ .audio-level-indicator {
88
+ display: flex;
89
+ align-items: center;
90
+ gap: 0.75rem;
91
+ color: var(--text-secondary);
92
+ }
93
+
94
+ .level-bar {
95
+ flex: 1;
96
+ height: 8px;
97
+ background: rgba(255, 255, 255, 0.1);
98
+ border-radius: 4px;
99
+ overflow: hidden;
100
+ }
101
+
102
+ .level-fill {
103
+ height: 100%;
104
+ background: linear-gradient(to right, var(--success-color), #4ade80);
105
+ transition: width 0.1s ease;
106
+ border-radius: 4px;
107
+ }
108
+
109
+ /* Controls */
110
+ .controls {
111
+ display: flex;
112
+ gap: 0.75rem;
113
+ justify-content: center;
114
+ flex-wrap: wrap;
115
+ }
116
+
117
+ .btn {
118
+ display: flex;
119
+ align-items: center;
120
+ gap: 0.5rem;
121
+ padding: 0.75rem 1.5rem;
122
+ border: none;
123
+ border-radius: 8px;
124
+ font-size: 0.875rem;
125
+ font-weight: 500;
126
+ cursor: pointer;
127
+ transition: all 0.2s ease;
128
+ font-family: inherit;
129
+ }
130
+
131
+ .btn:hover {
132
+ transform: translateY(-2px);
133
+ box-shadow: 0 4px 12px rgba(0, 0, 0, 0.2);
134
+ }
135
+
136
+ .btn:active {
137
+ transform: translateY(0);
138
+ }
139
+
140
+ .btn-primary {
141
+ background: var(--accent-color);
142
+ color: white;
143
+ }
144
+
145
+ .btn-primary:hover {
146
+ background: #2563eb;
147
+ }
148
+
149
+ .btn-success {
150
+ background: var(--success-color);
151
+ color: white;
152
+ }
153
+
154
+ .btn-success:hover {
155
+ background: #16a34a;
156
+ }
157
+
158
+ .btn-warning {
159
+ background: #f59e0b;
160
+ color: white;
161
+ }
162
+
163
+ .btn-warning:hover {
164
+ background: #d97706;
165
+ }
166
+
167
+ .btn-danger {
168
+ background: var(--error-color);
169
+ color: white;
170
+ }
171
+
172
+ .btn-danger:hover {
173
+ background: #dc2626;
174
+ }
175
+
176
+ .btn-secondary {
177
+ background: rgba(255, 255, 255, 0.1);
178
+ color: var(--text-primary);
179
+ }
180
+
181
+ .btn-secondary:hover {
182
+ background: rgba(255, 255, 255, 0.15);
183
+ }
184
+
185
+ /* Transcript Box */
186
+ .transcript-box {
187
+ flex: 1;
188
+ background: rgba(255, 255, 255, 0.03);
189
+ border-radius: 12px;
190
+ padding: 1.5rem;
191
+ display: flex;
192
+ flex-direction: column;
193
+ min-height: 200px;
194
+ }
195
+
196
+ .transcript-header {
197
+ display: flex;
198
+ justify-content: space-between;
199
+ align-items: center;
200
+ margin-bottom: 1rem;
201
+ padding-bottom: 0.75rem;
202
+ border-bottom: 1px solid rgba(255, 255, 255, 0.1);
203
+ }
204
+
205
+ .transcript-header h3 {
206
+ font-size: 1.125rem;
207
+ font-weight: 600;
208
+ color: var(--text-primary);
209
+ }
210
+
211
+ .recording-indicator {
212
+ display: flex;
213
+ align-items: center;
214
+ gap: 0.5rem;
215
+ font-size: 0.875rem;
216
+ color: var(--error-color);
217
+ font-weight: 500;
218
+ }
219
+
220
+ .pulse-dot {
221
+ width: 8px;
222
+ height: 8px;
223
+ background: var(--error-color);
224
+ border-radius: 50%;
225
+ animation: pulse 1.5s ease-in-out infinite;
226
+ }
227
+
228
+ @keyframes pulse {
229
+
230
+ 0%,
231
+ 100% {
232
+ opacity: 1;
233
+ transform: scale(1);
234
+ }
235
+
236
+ 50% {
237
+ opacity: 0.5;
238
+ transform: scale(1.2);
239
+ }
240
+ }
241
+
242
+ .transcript-content {
243
+ flex: 1;
244
+ overflow-y: auto;
245
+ line-height: 1.6;
246
+ }
247
+
248
+ .placeholder {
249
+ color: var(--text-secondary);
250
+ font-style: italic;
251
+ text-align: center;
252
+ margin-top: 2rem;
253
+ }
254
+
255
+ .final-transcript {
256
+ color: var(--text-primary);
257
+ margin-bottom: 0.5rem;
258
+ }
259
+
260
+ .interim-transcript {
261
+ color: var(--text-secondary);
262
+ font-style: italic;
263
+ }
264
+
265
+ /* Info Box */
266
+ .info-box {
267
+ background: rgba(59, 130, 246, 0.1);
268
+ border: 1px solid rgba(59, 130, 246, 0.3);
269
+ border-radius: 8px;
270
+ padding: 1rem;
271
+ }
272
+
273
+ .info-box h4 {
274
+ color: var(--accent-color);
275
+ font-size: 0.875rem;
276
+ font-weight: 600;
277
+ margin-bottom: 0.5rem;
278
+ }
279
+
280
+ .info-box ul {
281
+ list-style: none;
282
+ padding: 0;
283
+ margin: 0;
284
+ }
285
+
286
+ .info-box li {
287
+ color: var(--text-secondary);
288
+ font-size: 0.8125rem;
289
+ padding: 0.25rem 0;
290
+ padding-left: 1.25rem;
291
+ position: relative;
292
+ }
293
+
294
+ .info-box li::before {
295
+ content: "•";
296
+ position: absolute;
297
+ left: 0.5rem;
298
+ color: var(--accent-color);
299
+ }
300
+
301
+ /* Responsive */
302
+ @media (max-width: 640px) {
303
+ .controls {
304
+ flex-direction: column;
305
+ }
306
+
307
+ .btn {
308
+ width: 100%;
309
+ justify-content: center;
310
+ }
311
+
312
+ .visualizer-container {
313
+ height: 60px;
314
+ }
315
+ }
web_demo/src/components/MicrophoneTest.jsx ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import React, { useState, useRef, useEffect } from 'react';
2
+ import { Mic, MicOff, Play, Pause, Trash2, Volume2 } from 'lucide-react';
3
+ import './MicrophoneTest.css';
4
+
5
+ function MicrophoneTest() {
6
+ const [isRecording, setIsRecording] = useState(false);
7
+ const [isPaused, setIsPaused] = useState(false);
8
+ const [transcript, setTranscript] = useState('');
9
+ const [interimTranscript, setInterimTranscript] = useState('');
10
+ const [audioLevel, setAudioLevel] = useState(0);
11
+ const [error, setError] = useState('');
12
+ const [selectedLang, setSelectedLang] = useState('en-IN'); // Default to Indian English
13
+
14
+ const recognitionRef = useRef(null);
15
+ const audioContextRef = useRef(null);
16
+ const analyserRef = useRef(null);
17
+ const microphoneRef = useRef(null);
18
+ const animationFrameRef = useRef(null);
19
+
20
+ const languages = [
21
+ { code: 'en-IN', name: 'English (India)' },
22
+ { code: 'en-US', name: 'English (US)' },
23
+ { code: 'en-GB', name: 'English (UK)' },
24
+ { code: 'hi-IN', name: 'Hindi (India)' },
25
+ ];
26
+
27
+ // Initialize Web Speech API
28
+ useEffect(() => {
29
+ if ('webkitSpeechRecognition' in window || 'SpeechRecognition' in window) {
30
+ const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
31
+ recognitionRef.current = new SpeechRecognition();
32
+ recognitionRef.current.continuous = true;
33
+ recognitionRef.current.interimResults = true;
34
+ recognitionRef.current.lang = selectedLang;
35
+
36
+ recognitionRef.current.onresult = (event) => {
37
+ let interim = '';
38
+ let final = '';
39
+
40
+ for (let i = event.resultIndex; i < event.results.length; i++) {
41
+ const transcriptPiece = event.results[i][0].transcript;
42
+ if (event.results[i].isFinal) {
43
+ final += transcriptPiece + ' ';
44
+ } else {
45
+ interim += transcriptPiece;
46
+ }
47
+ }
48
+
49
+ if (final) {
50
+ setTranscript(prev => prev + final);
51
+ setInterimTranscript('');
52
+ } else {
53
+ setInterimTranscript(interim);
54
+ }
55
+ };
56
+
57
+ recognitionRef.current.onerror = (event) => {
58
+ console.error('Speech recognition error:', event.error);
59
+ if (event.error === 'no-speech') {
60
+ // Ignore no-speech errors to prevent UI flicker
61
+ return;
62
+ }
63
+ setError(`Error: ${event.error}`);
64
+ };
65
+
66
+ recognitionRef.current.onend = () => {
67
+ if (isRecording && !isPaused) {
68
+ try {
69
+ recognitionRef.current.start();
70
+ } catch (e) {
71
+ console.error("Failed to restart recognition:", e);
72
+ }
73
+ }
74
+ };
75
+ } else {
76
+ setError('Speech recognition is not supported in this browser. Please use Chrome or Edge.');
77
+ }
78
+
79
+ return () => {
80
+ if (recognitionRef.current) {
81
+ recognitionRef.current.stop();
82
+ }
83
+ stopAudioVisualization();
84
+ };
85
+ }, [selectedLang]); // Re-initialize when language changes
86
+
87
+ // Audio visualization
88
+ const startAudioVisualization = async () => {
89
+ try {
90
+ const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
91
+ audioContextRef.current = new (window.AudioContext || window.webkitAudioContext)();
92
+ analyserRef.current = audioContextRef.current.createAnalyser();
93
+ microphoneRef.current = audioContextRef.current.createMediaStreamSource(stream);
94
+
95
+ analyserRef.current.fftSize = 256;
96
+ microphoneRef.current.connect(analyserRef.current);
97
+
98
+ const dataArray = new Uint8Array(analyserRef.current.frequencyBinCount);
99
+
100
+ const updateLevel = () => {
101
+ if (!analyserRef.current) return;
102
+ analyserRef.current.getByteFrequencyData(dataArray);
103
+ const average = dataArray.reduce((a, b) => a + b) / dataArray.length;
104
+ setAudioLevel(average);
105
+ animationFrameRef.current = requestAnimationFrame(updateLevel);
106
+ };
107
+
108
+ updateLevel();
109
+ } catch (err) {
110
+ console.error('Error accessing microphone:', err);
111
+ setError('Could not access microphone. Please check permissions.');
112
+ }
113
+ };
114
+
115
+ const stopAudioVisualization = () => {
116
+ if (animationFrameRef.current) {
117
+ cancelAnimationFrame(animationFrameRef.current);
118
+ }
119
+ if (microphoneRef.current && microphoneRef.current.mediaStream) {
120
+ microphoneRef.current.mediaStream.getTracks().forEach(track => track.stop());
121
+ }
122
+ if (audioContextRef.current) {
123
+ audioContextRef.current.close();
124
+ audioContextRef.current = null;
125
+ analyserRef.current = null;
126
+ }
127
+ };
128
+
129
+ const startRecording = () => {
130
+ setError('');
131
+ setIsRecording(true);
132
+ setIsPaused(false);
133
+
134
+ if (recognitionRef.current) {
135
+ try {
136
+ recognitionRef.current.start();
137
+ } catch (e) {
138
+ console.error("Recognition already started:", e);
139
+ }
140
+ }
141
+ startAudioVisualization();
142
+ };
143
+
144
+ const pauseRecording = () => {
145
+ setIsPaused(true);
146
+ if (recognitionRef.current) {
147
+ recognitionRef.current.stop();
148
+ }
149
+ };
150
+
151
+ const resumeRecording = () => {
152
+ setIsPaused(false);
153
+ if (recognitionRef.current) {
154
+ try {
155
+ recognitionRef.current.start();
156
+ } catch (e) {
157
+ console.error("Failed to resume recognition:", e);
158
+ }
159
+ }
160
+ };
161
+
162
+ const stopRecording = () => {
163
+ setIsRecording(false);
164
+ setIsPaused(false);
165
+
166
+ if (recognitionRef.current) {
167
+ recognitionRef.current.stop();
168
+ }
169
+ stopAudioVisualization();
170
+ setAudioLevel(0);
171
+ };
172
+
173
+ const clearTranscript = () => {
174
+ setTranscript('');
175
+ setInterimTranscript('');
176
+ setError('');
177
+ };
178
+
179
+ return (
180
+ <div className="microphone-test">
181
+ <div className="mic-header">
182
+ <h2>Speech-to-Text Test</h2>
183
+ <p className="subtitle">Test your microphone with Indian English support</p>
184
+ </div>
185
+
186
+ <div className="language-selector-stt">
187
+ <label htmlFor="stt-lang">Recognition Language: </label>
188
+ <select
189
+ id="stt-lang"
190
+ value={selectedLang}
191
+ onChange={(e) => {
192
+ const wasRecording = isRecording;
193
+ if (wasRecording) stopRecording();
194
+ setSelectedLang(e.target.value);
195
+ }}
196
+ disabled={isRecording}
197
+ className="voice-select"
198
+ >
199
+ {languages.map(lang => (
200
+ <option key={lang.code} value={lang.code}>{lang.name}</option>
201
+ ))}
202
+ </select>
203
+ {isRecording && <p className="tiny-info">Stop recording to change language</p>}
204
+ </div>
205
+
206
+ {error && (
207
+ <div className="error-message">
208
+ <span>⚠️ {error}</span>
209
+ </div>
210
+ )}
211
+
212
+ <div className="audio-visualizer">
213
+ <div className="visualizer-container">
214
+ {[...Array(20)].map((_, i) => (
215
+ <div
216
+ key={i}
217
+ className="visualizer-bar"
218
+ style={{
219
+ height: `${isRecording && !isPaused ? Math.random() * audioLevel * 2 : 5}px`,
220
+ animationDelay: `${i * 0.05}s`
221
+ }}
222
+ />
223
+ ))}
224
+ </div>
225
+ <div className="audio-level-indicator">
226
+ <Volume2 size={20} />
227
+ <div className="level-bar">
228
+ <div
229
+ className="level-fill"
230
+ style={{ width: `${Math.min(audioLevel, 100)}%` }}
231
+ />
232
+ </div>
233
+ <span>{Math.round(audioLevel)}%</span>
234
+ </div>
235
+ </div>
236
+
237
+ <div className="controls">
238
+ {!isRecording ? (
239
+ <button className="btn btn-primary" onClick={startRecording}>
240
+ <Mic size={20} />
241
+ <span>Start Recording</span>
242
+ </button>
243
+ ) : (
244
+ <>
245
+ {!isPaused ? (
246
+ <button className="btn btn-warning" onClick={pauseRecording}>
247
+ <Pause size={20} />
248
+ <span>Pause</span>
249
+ </button>
250
+ ) : (
251
+ <button className="btn btn-success" onClick={resumeRecording}>
252
+ <Play size={20} />
253
+ <span>Resume</span>
254
+ </button>
255
+ )}
256
+ <button className="btn btn-danger" onClick={stopRecording}>
257
+ <MicOff size={20} />
258
+ <span>Stop</span>
259
+ </button>
260
+ </>
261
+ )}
262
+ {transcript && (
263
+ <button className="btn btn-secondary" onClick={clearTranscript}>
264
+ <Trash2 size={20} />
265
+ <span>Clear</span>
266
+ </button>
267
+ )}
268
+ </div>
269
+
270
+ <div className="transcript-box">
271
+ <div className="transcript-header">
272
+ <h3>Transcript ({languages.find(l => l.code === selectedLang)?.name})</h3>
273
+ {isRecording && (
274
+ <span className="recording-indicator">
275
+ <span className="pulse-dot"></span>
276
+ {isPaused ? 'Paused' : 'Recording...'}
277
+ </span>
278
+ )}
279
+ </div>
280
+ <div className="transcript-content">
281
+ {!transcript && !interimTranscript ? (
282
+ <p className="placeholder">Your transcription will appear here...</p>
283
+ ) : (
284
+ <>
285
+ <p className="final-transcript">{transcript}</p>
286
+ {interimTranscript && (
287
+ <p className="interim-transcript">{interimTranscript}</p>
288
+ )}
289
+ </>
290
+ )}
291
+ </div>
292
+ </div>
293
+
294
+ <div className="info-box">
295
+ <h4>💡 Tips:</h4>
296
+ <ul>
297
+ <li>Selecting <b>English (India)</b> will significantly improve recognition for Indian accents.</li>
298
+ <li>You can even try <b>Hindi</b> if you want to test multilingual support!</li>
299
+ <li>Make sure your browser has microphone permissions enabled</li>
300
+ <li>Works best in Chrome, Edge, or Safari</li>
301
+ </ul>
302
+ </div>
303
+ </div>
304
+ );
305
+ }
306
+
307
+ export default MicrophoneTest;
web_demo/src/components/SttLlmTts.css ADDED
@@ -0,0 +1,653 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .stt-llm-tts-test {
2
+ display: flex;
3
+ flex-direction: column;
4
+ gap: 2rem;
5
+ padding: 1rem 0;
6
+ height: calc(100vh - 180px);
7
+ /* Fill the vertical space */
8
+ width: 100%;
9
+ margin: 0;
10
+ }
11
+
12
+ .test-header {
13
+ display: flex;
14
+ justify-content: space-between;
15
+ align-items: center;
16
+ background: var(--card-bg);
17
+ padding: 1.25rem;
18
+ border-radius: 12px;
19
+ box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1);
20
+ }
21
+
22
+ .title-group h2 {
23
+ font-size: 1.25rem;
24
+ font-weight: 600;
25
+ margin-bottom: 0.25rem;
26
+ }
27
+
28
+ .title-group .subtitle {
29
+ font-size: 0.875rem;
30
+ color: var(--text-secondary);
31
+ }
32
+
33
+ .action-buttons {
34
+ display: flex;
35
+ gap: 0.75rem;
36
+ align-items: center;
37
+ }
38
+
39
+ .settings-btn {
40
+ display: flex;
41
+ align-items: center;
42
+ justify-content: center;
43
+ width: 42px;
44
+ height: 42px;
45
+ border: 1px solid rgba(255, 255, 255, 0.1);
46
+ background: rgba(255, 255, 255, 0.05);
47
+ color: var(--text-secondary);
48
+ border-radius: 8px;
49
+ cursor: pointer;
50
+ transition: all 0.2s ease;
51
+ }
52
+
53
+ .settings-btn:hover {
54
+ background: rgba(59, 130, 246, 0.1);
55
+ color: var(--accent-color);
56
+ border-color: var(--accent-color);
57
+ }
58
+
59
+ .record-toggle {
60
+ display: flex;
61
+ align-items: center;
62
+ gap: 0.75rem;
63
+ padding: 0.75rem 1.5rem;
64
+ border: none;
65
+ border-radius: 8px;
66
+ background: var(--accent-color);
67
+ color: white;
68
+ font-weight: 600;
69
+ cursor: pointer;
70
+ transition: all 0.2s ease;
71
+ }
72
+
73
+ .record-toggle.recording {
74
+ background: var(--error-color);
75
+ animation: pulse-red 2s infinite;
76
+ }
77
+
78
+ .reset-session {
79
+ display: flex;
80
+ align-items: center;
81
+ justify-content: center;
82
+ width: 42px;
83
+ height: 42px;
84
+ border: 1px solid rgba(255, 255, 255, 0.1);
85
+ background: rgba(255, 255, 255, 0.05);
86
+ color: var(--text-secondary);
87
+ border-radius: 8px;
88
+ cursor: pointer;
89
+ transition: all 0.2s ease;
90
+ }
91
+
92
+ .reset-session:hover {
93
+ background: rgba(255, 255, 255, 0.1);
94
+ color: var(--text-primary);
95
+ }
96
+
97
+ /* Pipeline Columns */
98
+ .pipeline-columns {
99
+ display: grid;
100
+ grid-template-columns: repeat(3, 1fr);
101
+ gap: 1.5rem;
102
+ flex: 2;
103
+ /* Take more relative space */
104
+ min-height: 450px;
105
+ /* Force substantial height */
106
+ }
107
+
108
+ .pipeline-col {
109
+ background: var(--card-bg);
110
+ border-radius: 12px;
111
+ display: flex;
112
+ flex-direction: column;
113
+ overflow: hidden;
114
+ border: 1px solid rgba(255, 255, 255, 0.05);
115
+ box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1);
116
+ }
117
+
118
+ .col-header {
119
+ display: flex;
120
+ justify-content: space-between;
121
+ align-items: center;
122
+ gap: 0.75rem;
123
+ padding: 1rem 1.25rem;
124
+ background: rgba(255, 255, 255, 0.03);
125
+ border-bottom: 1px solid rgba(255, 255, 255, 0.05);
126
+ }
127
+
128
+ .title-with-model {
129
+ display: flex;
130
+ align-items: center;
131
+ gap: 0.75rem;
132
+ }
133
+
134
+ .model-tag {
135
+ font-size: 0.625rem;
136
+ font-weight: 700;
137
+ text-transform: uppercase;
138
+ padding: 0.25rem 0.625rem;
139
+ background: rgba(255, 255, 255, 0.06);
140
+ border: 1px solid rgba(255, 255, 255, 0.1);
141
+ border-radius: 999px;
142
+ color: var(--text-secondary);
143
+ letter-spacing: 0.05em;
144
+ }
145
+
146
+ .col-header.stt .model-tag {
147
+ color: #60a5fa;
148
+ border-color: rgba(96, 165, 250, 0.3);
149
+ background: rgba(96, 165, 250, 0.1);
150
+ }
151
+
152
+ .col-header.llm .model-tag {
153
+ color: #a78bfa;
154
+ border-color: rgba(167, 139, 250, 0.3);
155
+ background: rgba(167, 139, 250, 0.1);
156
+ }
157
+
158
+ .col-header.tts .model-tag {
159
+ color: #4ade80;
160
+ border-color: rgba(74, 222, 128, 0.3);
161
+ background: rgba(74, 222, 128, 0.1);
162
+ }
163
+
164
+ .col-header h3 {
165
+ font-size: 1rem;
166
+ font-weight: 700;
167
+ letter-spacing: 0.025em;
168
+ text-transform: uppercase;
169
+ }
170
+
171
+ .col-header.stt {
172
+ color: #60a5fa;
173
+ }
174
+
175
+ .col-header.llm {
176
+ color: #a78bfa;
177
+ }
178
+
179
+ .col-header.tts {
180
+ color: #4ade80;
181
+ }
182
+
183
+ .col-content {
184
+ flex: 1;
185
+ padding: 1.25rem;
186
+ display: flex;
187
+ flex-direction: column;
188
+ overflow-y: auto;
189
+ position: relative;
190
+ }
191
+
192
+ .text-display {
193
+ flex: 1;
194
+ padding: 1.5rem;
195
+ background: rgba(15, 23, 42, 0.4);
196
+ border: 1px solid rgba(255, 255, 255, 0.05);
197
+ border-radius: 12px;
198
+ font-size: 1rem;
199
+ line-height: 1.7;
200
+ backdrop-filter: blur(4px);
201
+ transition: all 0.3s ease;
202
+ }
203
+
204
+ .text-display:hover {
205
+ background: rgba(15, 23, 42, 0.6);
206
+ border-color: rgba(255, 255, 255, 0.1);
207
+ }
208
+
209
+ .empty-msg {
210
+ color: var(--text-secondary);
211
+ font-style: italic;
212
+ font-size: 0.875rem;
213
+ text-align: center;
214
+ margin-top: 2rem;
215
+ }
216
+
217
+ /* STT Column */
218
+ .final-text {
219
+ color: var(--text-primary);
220
+ }
221
+
222
+ .interim-text {
223
+ color: var(--text-secondary);
224
+ font-style: italic;
225
+ }
226
+
227
+ .recording-pulse {
228
+ margin-top: 1rem;
229
+ font-size: 0.75rem;
230
+ color: var(--error-color);
231
+ display: flex;
232
+ align-items: center;
233
+ gap: 0.5rem;
234
+ }
235
+
236
+ .recording-pulse::before {
237
+ content: '';
238
+ width: 8px;
239
+ height: 8px;
240
+ background: var(--error-color);
241
+ border-radius: 50%;
242
+ animation: pulse-red 1s infinite;
243
+ }
244
+
245
+ .mic-muted-status {
246
+ margin-top: 1rem;
247
+ font-size: 0.75rem;
248
+ color: var(--accent-color);
249
+ display: flex;
250
+ align-items: center;
251
+ gap: 0.5rem;
252
+ padding: 0.5rem;
253
+ background: rgba(59, 130, 246, 0.1);
254
+ border-radius: 6px;
255
+ animation: fade-in 0.3s ease-out;
256
+ }
257
+
258
+ /* LLM Column */
259
+ .loading-state {
260
+ display: flex;
261
+ flex-direction: column;
262
+ align-items: center;
263
+ justify-content: center;
264
+ height: 100%;
265
+ gap: 1rem;
266
+ color: #a78bfa;
267
+ }
268
+
269
+ .spinner {
270
+ animation: rotate 2s linear infinite;
271
+ }
272
+
273
+ .response-box {
274
+ animation: fade-in 0.3s ease-out;
275
+ }
276
+
277
+ .response-text {
278
+ color: #e9d5ff;
279
+ }
280
+
281
+ /* TTS Column */
282
+ .tts-status {
283
+ flex: 1;
284
+ display: flex;
285
+ flex-direction: column;
286
+ align-items: center;
287
+ justify-content: center;
288
+ gap: 1.5rem;
289
+ }
290
+
291
+ .status-indicator {
292
+ display: flex;
293
+ flex-direction: column;
294
+ align-items: center;
295
+ gap: 1rem;
296
+ color: var(--text-secondary);
297
+ transition: all 0.3s ease;
298
+ }
299
+
300
+ .status-indicator.playing {
301
+ color: #4ade80;
302
+ }
303
+
304
+ .bouncing {
305
+ animation: bounce 1s infinite ease-in-out;
306
+ }
307
+
308
+ .replay-btn {
309
+ display: flex;
310
+ align-items: center;
311
+ gap: 0.5rem;
312
+ padding: 0.5rem 1rem;
313
+ background: rgba(255, 255, 255, 0.1);
314
+ border: 1px solid rgba(255, 255, 255, 0.1);
315
+ border-radius: 6px;
316
+ color: var(--text-primary);
317
+ font-size: 0.8125rem;
318
+ cursor: pointer;
319
+ }
320
+
321
+ .voice-selection-compact {
322
+ display: flex;
323
+ flex-direction: column;
324
+ gap: 0.5rem;
325
+ margin-bottom: 1.5rem;
326
+ padding: 1rem;
327
+ background: rgba(255, 255, 255, 0.03);
328
+ border: 1px solid rgba(255, 255, 255, 0.05);
329
+ border-radius: 10px;
330
+ }
331
+
332
+ .voice-selection-compact label {
333
+ font-size: 0.75rem;
334
+ text-transform: uppercase;
335
+ font-weight: 700;
336
+ color: var(--text-secondary);
337
+ letter-spacing: 0.05em;
338
+ }
339
+
340
+ .voice-selection-compact select {
341
+ background: rgba(15, 23, 42, 0.6);
342
+ border: 1px solid rgba(255, 255, 255, 0.1);
343
+ color: white;
344
+ padding: 0.625rem;
345
+ border-radius: 6px;
346
+ font-size: 0.875rem;
347
+ outline: none;
348
+ cursor: pointer;
349
+ transition: all 0.2s ease;
350
+ }
351
+
352
+ .voice-selection-compact select:hover {
353
+ background: rgba(15, 23, 42, 0.8);
354
+ border-color: var(--accent-color);
355
+ }
356
+
357
+ .auto-toggle {
358
+ display: flex;
359
+ align-items: center;
360
+ gap: 0.75rem;
361
+ padding-top: 1.25rem;
362
+ border-top: 1px solid rgba(255, 255, 255, 0.05);
363
+ font-size: 0.8125rem;
364
+ color: var(--text-secondary);
365
+ }
366
+
367
+ /* History Tray */
368
+ .history-tray {
369
+ flex: 1;
370
+ /* Limit growth compared to pipeline */
371
+ min-height: 150px;
372
+ background: rgba(30, 41, 59, 0.5);
373
+ backdrop-filter: blur(8px);
374
+ border-radius: 16px;
375
+ padding: 1.5rem;
376
+ border: 1px solid rgba(255, 255, 255, 0.08);
377
+ display: flex;
378
+ flex-direction: column;
379
+ }
380
+
381
+ .history-tray h4 {
382
+ font-size: 0.875rem;
383
+ font-weight: 600;
384
+ margin-bottom: 1rem;
385
+ color: var(--text-secondary);
386
+ }
387
+
388
+ .history-list {
389
+ display: flex;
390
+ flex-direction: column;
391
+ gap: 0.75rem;
392
+ max-height: 300px;
393
+ overflow-y: auto;
394
+ }
395
+
396
+ .history-item {
397
+ font-size: 0.875rem;
398
+ display: flex;
399
+ gap: 0.5rem;
400
+ }
401
+
402
+ .h-role {
403
+ font-weight: 600;
404
+ min-width: 40px;
405
+ }
406
+
407
+ .user .h-role {
408
+ color: #60a5fa;
409
+ }
410
+
411
+ .assistant .h-role {
412
+ color: #a78bfa;
413
+ }
414
+
415
+ .no-history {
416
+ font-size: 0.8125rem;
417
+ color: var(--text-secondary);
418
+ font-style: italic;
419
+ }
420
+
421
+ /* Switch Toggle */
422
+ .switch {
423
+ position: relative;
424
+ display: inline-block;
425
+ width: 34px;
426
+ height: 20px;
427
+ }
428
+
429
+ .switch input {
430
+ opacity: 0;
431
+ width: 0;
432
+ height: 0;
433
+ }
434
+
435
+ .slider {
436
+ position: absolute;
437
+ cursor: pointer;
438
+ top: 0;
439
+ left: 0;
440
+ right: 0;
441
+ bottom: 0;
442
+ background-color: #334155;
443
+ transition: .4s;
444
+ }
445
+
446
+ .slider:before {
447
+ position: absolute;
448
+ content: "";
449
+ height: 14px;
450
+ width: 14px;
451
+ left: 3px;
452
+ bottom: 3px;
453
+ background-color: white;
454
+ transition: .4s;
455
+ }
456
+
457
+ input:checked+.slider {
458
+ background-color: var(--success-color);
459
+ }
460
+
461
+ input:checked+.slider:before {
462
+ transform: translateX(14px);
463
+ }
464
+
465
+ .slider.round {
466
+ border-radius: 34px;
467
+ }
468
+
469
+ .slider.round:before {
470
+ border-radius: 50%;
471
+ }
472
+
473
+ /* Animations */
474
+ @keyframes pulse-red {
475
+ 0% {
476
+ box-shadow: 0 0 0 0 rgba(239, 68, 68, 0.4);
477
+ }
478
+
479
+ 70% {
480
+ box-shadow: 0 0 0 10px rgba(239, 68, 68, 0);
481
+ }
482
+
483
+ 100% {
484
+ box-shadow: 0 0 0 0 rgba(239, 68, 68, 0);
485
+ }
486
+ }
487
+
488
+ @keyframes rotate {
489
+ from {
490
+ transform: rotate(0deg);
491
+ }
492
+
493
+ to {
494
+ transform: rotate(360deg);
495
+ }
496
+ }
497
+
498
+ @keyframes bounce {
499
+
500
+ 0%,
501
+ 100% {
502
+ transform: translateY(0);
503
+ }
504
+
505
+ 50% {
506
+ transform: translateY(-10px);
507
+ }
508
+ }
509
+
510
+ @keyframes fade-in {
511
+ from {
512
+ opacity: 0;
513
+ transform: translateY(5px);
514
+ }
515
+
516
+ to {
517
+ opacity: 1;
518
+ transform: translateY(0);
519
+ }
520
+ }
521
+
522
+ /* Settings Overlay */
523
+ .pipeline-settings-overlay {
524
+ position: fixed;
525
+ top: 0;
526
+ left: 0;
527
+ right: 0;
528
+ bottom: 0;
529
+ background: rgba(15, 23, 42, 0.8);
530
+ backdrop-filter: blur(8px);
531
+ display: flex;
532
+ align-items: center;
533
+ justify-content: center;
534
+ z-index: 1000;
535
+ animation: fade-in 0.2s ease-out;
536
+ }
537
+
538
+ .settings-card {
539
+ background: var(--card-bg);
540
+ padding: 2.5rem;
541
+ border-radius: 20px;
542
+ width: 100%;
543
+ max-width: 650px;
544
+ border: 1px solid rgba(255, 255, 255, 0.1);
545
+ box-shadow: 0 25px 50px -12px rgba(0, 0, 0, 0.5);
546
+ }
547
+
548
+ .settings-card h3 {
549
+ margin-bottom: 1.5rem;
550
+ font-size: 1.25rem;
551
+ }
552
+
553
+ .setting-item {
554
+ display: flex;
555
+ flex-direction: column;
556
+ gap: 0.75rem;
557
+ margin-bottom: 2rem;
558
+ }
559
+
560
+ .setting-item label {
561
+ font-size: 0.875rem;
562
+ color: var(--text-secondary);
563
+ font-weight: 500;
564
+ }
565
+
566
+ .status-badge {
567
+ padding: 0.75rem 1rem;
568
+ background: rgba(0, 0, 0, 0.2);
569
+ border: 1px solid rgba(255, 255, 255, 0.1);
570
+ border-radius: 8px;
571
+ font-size: 0.875rem;
572
+ display: flex;
573
+ align-items: center;
574
+ gap: 0.5rem;
575
+ }
576
+
577
+ .setting-item textarea {
578
+ width: 100%;
579
+ padding: 0.75rem 1rem;
580
+ background: rgba(0, 0, 0, 0.3);
581
+ border: 1px solid rgba(255, 255, 255, 0.1);
582
+ border-radius: 8px;
583
+ color: white;
584
+ font-size: 0.9rem;
585
+ resize: none;
586
+ line-height: 1.5;
587
+ outline: none;
588
+ transition: border-color 0.2s;
589
+ }
590
+
591
+ .setting-item textarea:focus {
592
+ border-color: var(--accent-color);
593
+ }
594
+
595
+ .prompt-presets {
596
+ margin-top: 1.5rem;
597
+ margin-bottom: 1.5rem;
598
+ }
599
+
600
+ .prompt-presets label {
601
+ display: block;
602
+ font-size: 0.75rem;
603
+ text-transform: uppercase;
604
+ font-weight: 700;
605
+ color: var(--text-secondary);
606
+ margin-bottom: 0.75rem;
607
+ }
608
+
609
+ .preset-btns {
610
+ display: flex;
611
+ flex-wrap: wrap;
612
+ gap: 0.5rem;
613
+ }
614
+
615
+ .preset-btn {
616
+ background: rgba(255, 255, 255, 0.05);
617
+ border: 1px solid rgba(255, 255, 255, 0.1);
618
+ color: var(--text-secondary);
619
+ padding: 0.5rem 0.75rem;
620
+ border-radius: 6px;
621
+ font-size: 0.8rem;
622
+ cursor: pointer;
623
+ transition: all 0.2s;
624
+ }
625
+
626
+ .preset-btn:hover {
627
+ background: rgba(255, 255, 255, 0.1);
628
+ color: white;
629
+ border-color: var(--accent-color);
630
+ }
631
+
632
+ .hint {
633
+ font-size: 0.75rem;
634
+ color: var(--text-secondary);
635
+ font-style: italic;
636
+ }
637
+
638
+ .close-settings {
639
+ width: 100%;
640
+ padding: 1rem;
641
+ background: var(--accent-color);
642
+ border: none;
643
+ border-radius: 10px;
644
+ color: white;
645
+ font-weight: 600;
646
+ cursor: pointer;
647
+ transition: all 0.2s ease;
648
+ }
649
+
650
+ .close-settings:hover {
651
+ background: var(--accent-hover);
652
+ transform: translateY(-2px);
653
+ }
web_demo/src/components/SttLlmTts.jsx ADDED
@@ -0,0 +1,505 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import React, { useState, useEffect, useRef } from 'react';
2
+ import { Mic, MicOff, MessageSquare, Volume2, Loader2, Send, RotateCcw, Settings } from 'lucide-react';
3
+ import './SttLlmTts.css';
4
+
5
+ function SttLlmTts() {
6
+ const [isRecording, setIsRecording] = useState(false);
7
+ const [sttText, setSttText] = useState('');
8
+ const [interimStt, setInterimStt] = useState('');
9
+ const [llmResponse, setLlmResponse] = useState('');
10
+ const [isLlmLoading, setIsLlmLoading] = useState(false);
11
+ const [ttsStatus, setTtsStatus] = useState('Idle');
12
+ const [history, setHistory] = useState(() => {
13
+ const saved = localStorage.getItem('nv_history');
14
+ return saved ? JSON.parse(saved) : [];
15
+ });
16
+ const [error, setError] = useState('');
17
+ const [autoMode, setAutoMode] = useState(true);
18
+ const [apiKey, setApiKey] = useState(import.meta.env.VITE_OPENAI_API_KEY || ''); // Load from .env if available
19
+ const [showSettings, setShowSettings] = useState(false);
20
+ const [voices, setVoices] = useState([]);
21
+ const [selectedVoiceURI, setSelectedVoiceURI] = useState('');
22
+ const [systemPrompt, setSystemPrompt] = useState(() => {
23
+ return localStorage.getItem('nv_system_prompt') || 'You are a professional Health Insurance Seller. Start by greeting the user and asking if they want a plan for themselves or their family. Keep answers brief.';
24
+ });
25
+
26
+ const recognitionRef = useRef(null);
27
+ const synthRef = useRef(window.speechSynthesis);
28
+ const scrollRef = useRef(null);
29
+ const isBusyRef = useRef(false);
30
+ const autoModeRef = useRef(true);
31
+ const isRecordingRef = useRef(false); // New: Track recording state for handlers
32
+ const isMicActiveRef = useRef(false); // New: Track hardware status to prevent lock-up
33
+ const silenceTimerRef = useRef(null); // Ref for auto-processing on silence
34
+ const [isMicActuallyListening, setIsMicActuallyListening] = useState(false);
35
+
36
+ // Persistent Storage
37
+ useEffect(() => {
38
+ localStorage.setItem('nv_history', JSON.stringify(history));
39
+ }, [history]);
40
+
41
+ useEffect(() => {
42
+ localStorage.setItem('nv_system_prompt', systemPrompt);
43
+ }, [systemPrompt]);
44
+
45
+ // Auto-scroll to bottom
46
+ useEffect(() => {
47
+ if (scrollRef.current) {
48
+ scrollRef.current.scrollTop = scrollRef.current.scrollHeight;
49
+ }
50
+ }, [history, sttText, interimStt, llmResponse]);
51
+
52
+ // Load voices
53
+ useEffect(() => {
54
+ const loadVoices = () => {
55
+ const availableVoices = synthRef.current.getVoices();
56
+ setVoices(availableVoices);
57
+
58
+ // Default to Indian English if not already set
59
+ if (!selectedVoiceURI && availableVoices.length > 0) {
60
+ const indianVoice = availableVoices.find(v => v.lang === 'en-IN' || v.name.includes('India'));
61
+ const defaultVoice = indianVoice || availableVoices[0];
62
+ setSelectedVoiceURI(defaultVoice.voiceURI || defaultVoice.name);
63
+ }
64
+ };
65
+
66
+ loadVoices();
67
+ if (synthRef.current.onvoiceschanged !== undefined) {
68
+ synthRef.current.onvoiceschanged = loadVoices;
69
+ }
70
+ }, [selectedVoiceURI]);
71
+
72
+ // Initialize STT
73
+ useEffect(() => {
74
+ if ('webkitSpeechRecognition' in window || 'SpeechRecognition' in window) {
75
+ const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;
76
+ recognitionRef.current = new SpeechRecognition();
77
+ recognitionRef.current.continuous = false; // Stop when the user pauses
78
+ recognitionRef.current.interimResults = true;
79
+ recognitionRef.current.lang = 'en-IN'; // Indian English as default
80
+
81
+ recognitionRef.current.onresult = (event) => {
82
+ let interim = '';
83
+ let final = '';
84
+
85
+ for (let i = event.resultIndex; i < event.results.length; i++) {
86
+ const piece = event.results[i][0].transcript;
87
+ if (event.results[i].isFinal) {
88
+ final += piece;
89
+ } else {
90
+ interim += piece;
91
+ }
92
+ }
93
+
94
+ if (final) {
95
+ clearTimeout(silenceTimerRef.current);
96
+ handleFinalStt(final);
97
+ } else {
98
+ setInterimStt(interim);
99
+
100
+ // AUTO-STOP: If we have interim text but no final for 1.5s, process it anyway
101
+ if (interim.trim()) {
102
+ clearTimeout(silenceTimerRef.current);
103
+ silenceTimerRef.current = setTimeout(() => {
104
+ handleFinalStt(interim);
105
+ if (recognitionRef.current) recognitionRef.current.stop();
106
+ }, 1500);
107
+ }
108
+ }
109
+ };
110
+
111
+ recognitionRef.current.onerror = (event) => {
112
+ if (event.error !== 'no-speech') {
113
+ setError(`STT Error: ${event.error}`);
114
+ }
115
+ };
116
+
117
+ recognitionRef.current.onstart = () => {
118
+ isMicActiveRef.current = true;
119
+ setIsMicActuallyListening(true);
120
+ };
121
+
122
+ recognitionRef.current.onend = () => {
123
+ isMicActiveRef.current = false;
124
+ setIsMicActuallyListening(false);
125
+
126
+ // Hardware cooldown: Wait 300ms before attempting to restart to avoid hardware lock
127
+ setTimeout(() => {
128
+ if (isRecordingRef.current && !isBusyRef.current && !isMicActiveRef.current) {
129
+ try {
130
+ recognitionRef.current.start();
131
+ } catch (e) {
132
+ console.log("Mic restart safe-check:", e.message);
133
+ }
134
+ }
135
+ }, 300);
136
+ };
137
+ } else {
138
+ setError('Speech recognition not supported in this browser.');
139
+ }
140
+
141
+ return () => {
142
+ if (recognitionRef.current) recognitionRef.current.stop();
143
+ synthRef.current.cancel();
144
+ };
145
+ }, [isRecording]);
146
+
147
+ const handleFinalStt = (text) => {
148
+ if (!text.trim() || isBusyRef.current) return;
149
+
150
+ clearTimeout(silenceTimerRef.current);
151
+ setSttText(text); // Clear previous and show new turn
152
+ setInterimStt('');
153
+
154
+ // STEP 1: Lock the turn
155
+ isBusyRef.current = true;
156
+
157
+ if (recognitionRef.current) {
158
+ try { recognitionRef.current.stop(); } catch (e) { }
159
+ }
160
+ processLlm(text);
161
+ };
162
+
163
+ const processLlm = async (text) => {
164
+ setIsLlmLoading(true);
165
+ setLlmResponse('Thinking...');
166
+
167
+ try {
168
+ let responseText = '';
169
+
170
+ if (apiKey) {
171
+ // REAL AI CALL
172
+ const response = await fetch('https://api.openai.com/v1/chat/completions', {
173
+ method: 'POST',
174
+ headers: {
175
+ 'Content-Type': 'application/json',
176
+ 'Authorization': `Bearer ${apiKey}`
177
+ },
178
+ body: JSON.stringify({
179
+ model: 'gpt-4o-mini',
180
+ messages: [
181
+ { role: 'system', content: systemPrompt },
182
+ ...history.slice(-12),
183
+ { role: 'user', content: text }
184
+ ],
185
+ temperature: 0.7,
186
+ max_tokens: 100
187
+ })
188
+ });
189
+
190
+ const data = await response.json();
191
+ if (data.error) throw new Error(data.error.message);
192
+ responseText = data.choices[0].message.content;
193
+ } else {
194
+ // FALLBACK SMART MOCK (for when no key is present)
195
+ await new Promise(r => setTimeout(r, 1000));
196
+ responseText = generateFallbackResponse(text);
197
+ }
198
+
199
+ setLlmResponse(responseText);
200
+ setHistory(prev => [...prev, { role: 'user', content: text }, { role: 'assistant', content: responseText }]);
201
+
202
+ if (autoModeRef.current) {
203
+ speakText(responseText);
204
+ } else {
205
+ isBusyRef.current = false;
206
+ // Re-awaken mic if auto-play is off
207
+ setTimeout(() => {
208
+ if (isRecordingRef.current && !isMicActiveRef.current) {
209
+ try { recognitionRef.current.start(); } catch (e) { }
210
+ }
211
+ }, 300);
212
+ }
213
+ } catch (err) {
214
+ setError(`LLM Error: ${err.message}`);
215
+ setIsLlmLoading(false);
216
+ isBusyRef.current = false;
217
+ // Re-awaken mic on error
218
+ setTimeout(() => {
219
+ if (isRecordingRef.current && !isMicActiveRef.current) {
220
+ try { recognitionRef.current.start(); } catch (e) { }
221
+ }
222
+ }, 300);
223
+ } finally {
224
+ setIsLlmLoading(false);
225
+ }
226
+ };
227
+
228
+ const generateFallbackResponse = (input) => {
229
+ const text = input.toLowerCase();
230
+ if (text.includes('prime minister') && text.includes('india')) return "As of January 2026, Narendra Modi is the Prime Minister of India.";
231
+ if (text.includes('hello') || text.includes('hi')) return "Hello! How can I help you today?";
232
+ if (text.includes('time')) return `The current time is ${new Date().toLocaleTimeString()}.`;
233
+ return `I processed your request: "${input}". For real answers, please add your OpenAI API Key in settings.`;
234
+ };
235
+
236
+ const speakText = (text) => {
237
+ synthRef.current.cancel();
238
+ const utterance = new SpeechSynthesisUtterance(text);
239
+
240
+ // ALWAYS find the fresh voice object from the system right before speaking
241
+ const currentVoices = synthRef.current.getVoices();
242
+ const voice = currentVoices.find(v => (v.voiceURI || v.name) === selectedVoiceURI);
243
+
244
+ if (voice) {
245
+ utterance.voice = voice;
246
+ }
247
+
248
+ utterance.onstart = () => {
249
+ setTtsStatus('Playing...');
250
+ // Mic is already stopped by handleFinalStt, but we ensure busy state remains
251
+ isBusyRef.current = true;
252
+ };
253
+
254
+ utterance.onend = () => {
255
+ setTtsStatus('Finished');
256
+ isBusyRef.current = false;
257
+ // STEP 3: Clear busy state and resume mic with cooldown
258
+ setTimeout(() => {
259
+ if (isRecordingRef.current && !isMicActiveRef.current) {
260
+ try { recognitionRef.current.start(); } catch (e) { }
261
+ }
262
+ }, 300);
263
+ };
264
+
265
+ utterance.onerror = () => {
266
+ setTtsStatus('Error');
267
+ isBusyRef.current = false;
268
+ setTimeout(() => {
269
+ if (isRecordingRef.current && !isMicActiveRef.current) {
270
+ try { recognitionRef.current.start(); } catch (e) { }
271
+ }
272
+ }, 300);
273
+ };
274
+
275
+ synthRef.current.speak(utterance);
276
+ };
277
+
278
+ const toggleRecording = () => {
279
+ if (isRecording) {
280
+ isRecordingRef.current = false;
281
+ isBusyRef.current = false;
282
+ try { recognitionRef.current.stop(); } catch (e) { }
283
+ setIsRecording(false);
284
+ } else {
285
+ setSttText('');
286
+ setInterimStt('');
287
+ setError('');
288
+ isRecordingRef.current = true;
289
+ isBusyRef.current = false;
290
+ try { recognitionRef.current.start(); } catch (e) { }
291
+ setIsRecording(true);
292
+ }
293
+ };
294
+
295
+ const resetAll = () => {
296
+ setHistory([]);
297
+ setSttText('');
298
+ setInterimStt('');
299
+ setLlmResponse('');
300
+ setTtsStatus('Idle');
301
+ synthRef.current.cancel();
302
+ };
303
+
304
+ return (
305
+ <div className="stt-llm-tts-test">
306
+ <div className="test-header">
307
+ <div className="title-group">
308
+ <h2>STT → LLM → TTS Pipeline</h2>
309
+ <p className="subtitle">Full Loop: Voice In, AI Processing, Voice Out</p>
310
+ </div>
311
+ <div className="action-buttons">
312
+ <button className="settings-btn" onClick={() => setShowSettings(!showSettings)} title="AI Configuration">
313
+ <Settings size={18} />
314
+ </button>
315
+ <button className={`record-toggle ${isRecording ? 'recording' : ''}`} onClick={toggleRecording}>
316
+ {isRecording ? <MicOff size={20} /> : <Mic size={20} />}
317
+ <span>{isRecording ? 'Stop Session' : 'Start Session'}</span>
318
+ </button>
319
+ <button className="reset-session" onClick={resetAll}>
320
+ <RotateCcw size={18} />
321
+ </button>
322
+ </div>
323
+ </div>
324
+
325
+ {showSettings && (
326
+ <div className="pipeline-settings-overlay">
327
+ <div className="settings-card">
328
+ <h3>AI Agent Configuration</h3>
329
+ <div className="setting-item">
330
+ <label>Agent Personality (System Prompt)</label>
331
+ <textarea
332
+ value={systemPrompt}
333
+ onChange={(e) => setSystemPrompt(e.target.value)}
334
+ rows={4}
335
+ placeholder="Example: You are a professional doctor assistant..."
336
+ />
337
+ <p className="hint">This tells the AI how to behave and what to ask.</p>
338
+ </div>
339
+
340
+ <div className="prompt-presets">
341
+ <label>Quick Presets</label>
342
+ <div className="preset-btns">
343
+ <button className="preset-btn" onClick={() => setSystemPrompt('You are a concise voice assistant. Give short answers (max 20 words).')}>
344
+ General Assistant
345
+ </button>
346
+ <button className="preset-btn" onClick={() => setSystemPrompt(`You are a highly professional Health Insurance Sales Agent.
347
+ Follow this EXACT conversation flow:
348
+ 1. Greet the user and ask if they are looking for a plan for themselves or their family.
349
+ 2. Once they answer, ask for the ages of the people to be insured.
350
+ 3. Next, ask if anyone has any pre-existing medical conditions (Yes/No).
351
+ 4. Finally, ask for their preferred annual budget.
352
+
353
+ Rules:
354
+ - Ask only ONE question at a time.
355
+ - Keep your responses under 20 words.
356
+ - Be polite, empathetic, and professional.
357
+ - If they say something unrelated, steer them back to the last question.`)}>
358
+ Health Insurance Agent (Structured)
359
+ </button>
360
+ <button className="preset-btn" onClick={() => setSystemPrompt('You are a helpful travel agent. Ask the user about their favorite destination. Keep answers short.')}>
361
+ Travel Agent
362
+ </button>
363
+ </div>
364
+ </div>
365
+
366
+ <button className="close-settings" onClick={() => setShowSettings(false)}>Save & Close</button>
367
+ </div>
368
+ </div>
369
+ )}
370
+
371
+ <div className="pipeline-columns">
372
+ {/* Column 1: STT */}
373
+ <div className="pipeline-col">
374
+ <div className="col-header stt">
375
+ <div className="title-with-model">
376
+ <Mic size={18} />
377
+ <h3>1. Speech-to-Text</h3>
378
+ </div>
379
+ <span className="model-tag">Web Speech API / Vosk</span>
380
+ </div>
381
+ <div className="col-content" ref={scrollRef}>
382
+ <div className="text-display stt-display">
383
+ {sttText && <p className="final-text">{sttText}</p>}
384
+ {interimStt && <p className="interim-text">{interimStt}</p>}
385
+ {!sttText && !interimStt && (
386
+ <p className="empty-msg">Speak into your mic to start...</p>
387
+ )}
388
+ </div>
389
+ {isMicActuallyListening && (
390
+ <div className="mic-muted-status listening">
391
+ <div className="pulse-dot"></div> Listening...
392
+ </div>
393
+ )}
394
+ {isRecording && isBusyRef.current && (
395
+ <div className="mic-muted-status processing">
396
+ <Volume2 size={14} /> AI is processing/speaking...
397
+ </div>
398
+ )}
399
+ </div>
400
+ </div>
401
+
402
+ {/* Column 2: LLM */}
403
+ <div className="pipeline-col">
404
+ <div className="col-header llm">
405
+ <div className="title-with-model">
406
+ <MessageSquare size={18} />
407
+ <h3>2. LLM Processing</h3>
408
+ </div>
409
+ <span className="model-tag">GPT-4o-mini</span>
410
+ </div>
411
+ <div className="col-content">
412
+ <div className="text-display llm-display">
413
+ {isLlmLoading ? (
414
+ <div className="loading-state">
415
+ <Loader2 className="spinner" size={32} />
416
+ <p>Processing...</p>
417
+ </div>
418
+ ) : llmResponse ? (
419
+ <div className="response-box">
420
+ <p className="response-text">{llmResponse}</p>
421
+ </div>
422
+ ) : (
423
+ <p className="empty-msg">Waiting for STT input...</p>
424
+ )}
425
+ </div>
426
+ </div>
427
+ </div>
428
+
429
+ {/* Column 3: TTS */}
430
+ <div className="pipeline-col">
431
+ <div className="col-header tts">
432
+ <div className="title-with-model">
433
+ <Volume2 size={18} />
434
+ <h3>3. Text-to-Speech</h3>
435
+ </div>
436
+ <span className="model-tag">Web Speech API / Piper</span>
437
+ </div>
438
+ <div className="col-content">
439
+ <div className="tts-status">
440
+ <div className={`status-indicator ${ttsStatus.toLowerCase().replace('...', '')}`}>
441
+ <Volume2 size={48} className={ttsStatus === 'Playing...' ? 'bouncing' : ''} />
442
+ <p>{ttsStatus}</p>
443
+ </div>
444
+ {llmResponse && !isLlmLoading && (
445
+ <button className="replay-btn" onClick={() => speakText(llmResponse)}>
446
+ <Volume2 size={16} /> Replay
447
+ </button>
448
+ )}
449
+ </div>
450
+
451
+ <div className="voice-selection-compact">
452
+ <label>Voice Output</label>
453
+ <select
454
+ value={selectedVoiceURI}
455
+ onChange={(e) => setSelectedVoiceURI(e.target.value)}
456
+ >
457
+ {voices.map(v => (
458
+ <option key={v.voiceURI || v.name} value={v.voiceURI || v.name}>
459
+ {v.name} ({v.lang})
460
+ </option>
461
+ ))}
462
+ </select>
463
+ </div>
464
+
465
+ <div className="auto-toggle">
466
+ <label className="switch">
467
+ <input
468
+ type="checkbox"
469
+ checked={autoMode}
470
+ onChange={(e) => {
471
+ const val = e.target.checked;
472
+ setAutoMode(val);
473
+ autoModeRef.current = val;
474
+ }}
475
+ />
476
+ <span className="slider round"></span>
477
+ </label>
478
+ <span>Auto-play TTS</span>
479
+ </div>
480
+ </div>
481
+ </div>
482
+ </div>
483
+
484
+ {error && <div className="pipeline-error">{error}</div>}
485
+
486
+ <div className="history-tray">
487
+ <h4>Recent Interactions</h4>
488
+ <div className="history-list">
489
+ {history.length === 0 ? (
490
+ <p className="no-history">No history yet</p>
491
+ ) : (
492
+ history.map((h, i) => (
493
+ <div key={i} className={`history-item ${h.role}`}>
494
+ <span className="h-role">{h.role === 'user' ? 'You' : 'AI'}:</span>
495
+ <span className="h-text">{h.content}</span>
496
+ </div>
497
+ ))
498
+ )}
499
+ </div>
500
+ </div>
501
+ </div>
502
+ );
503
+ }
504
+
505
+ export default SttLlmTts;
web_demo/src/components/TextToSpeech.css ADDED
@@ -0,0 +1,321 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .text-to-speech {
2
+ display: flex;
3
+ flex-direction: column;
4
+ gap: 1.5rem;
5
+ padding: 0;
6
+ height: 100%;
7
+ width: 100%;
8
+ }
9
+
10
+ .tts-header {
11
+ text-align: center;
12
+ }
13
+
14
+ .tts-header h2 {
15
+ font-size: 1.5rem;
16
+ font-weight: 600;
17
+ margin-bottom: 0.5rem;
18
+ color: var(--text-primary);
19
+ }
20
+
21
+ /* Text Input Section */
22
+ .text-input-section {
23
+ display: flex;
24
+ flex-direction: column;
25
+ gap: 0.5rem;
26
+ }
27
+
28
+ .input-header {
29
+ display: flex;
30
+ justify-content: space-between;
31
+ align-items: center;
32
+ }
33
+
34
+ .input-header label {
35
+ font-weight: 500;
36
+ color: var(--text-primary);
37
+ font-size: 0.875rem;
38
+ }
39
+
40
+ .char-count {
41
+ font-size: 0.75rem;
42
+ color: var(--text-secondary);
43
+ }
44
+
45
+ .text-input {
46
+ width: 100%;
47
+ padding: 1rem;
48
+ background: rgba(255, 255, 255, 0.05);
49
+ border: 1px solid rgba(255, 255, 255, 0.1);
50
+ border-radius: 8px;
51
+ color: var(--text-primary);
52
+ font-size: 0.875rem;
53
+ font-family: inherit;
54
+ resize: vertical;
55
+ transition: all 0.2s ease;
56
+ }
57
+
58
+ .text-input:focus {
59
+ outline: none;
60
+ border-color: var(--accent-color);
61
+ background: rgba(255, 255, 255, 0.08);
62
+ box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.1);
63
+ }
64
+
65
+ .text-input::placeholder {
66
+ color: var(--text-secondary);
67
+ }
68
+
69
+ /* Sample Texts */
70
+ .sample-texts {
71
+ background: rgba(255, 255, 255, 0.03);
72
+ border-radius: 8px;
73
+ padding: 1rem;
74
+ }
75
+
76
+ .sample-label {
77
+ font-size: 0.75rem;
78
+ color: var(--text-secondary);
79
+ margin-bottom: 0.5rem;
80
+ font-weight: 500;
81
+ }
82
+
83
+ .sample-buttons {
84
+ display: flex;
85
+ gap: 0.5rem;
86
+ flex-wrap: wrap;
87
+ }
88
+
89
+ .sample-btn {
90
+ padding: 0.5rem 0.75rem;
91
+ background: rgba(255, 255, 255, 0.1);
92
+ border: 1px solid rgba(255, 255, 255, 0.1);
93
+ border-radius: 6px;
94
+ color: var(--text-secondary);
95
+ font-size: 0.75rem;
96
+ cursor: pointer;
97
+ transition: all 0.2s ease;
98
+ font-family: inherit;
99
+ }
100
+
101
+ .sample-btn:hover {
102
+ background: rgba(255, 255, 255, 0.15);
103
+ color: var(--text-primary);
104
+ border-color: var(--accent-color);
105
+ }
106
+
107
+ /* Voice Selector */
108
+ .voice-selector {
109
+ display: flex;
110
+ flex-direction: column;
111
+ gap: 0.5rem;
112
+ }
113
+
114
+ .voice-selector label {
115
+ display: flex;
116
+ align-items: center;
117
+ gap: 0.5rem;
118
+ font-weight: 500;
119
+ color: var(--text-primary);
120
+ font-size: 0.875rem;
121
+ }
122
+
123
+ .voice-select {
124
+ width: 100%;
125
+ padding: 0.75rem;
126
+ background: rgba(255, 255, 255, 0.05);
127
+ border: 1px solid rgba(255, 255, 255, 0.1);
128
+ border-radius: 8px;
129
+ color: var(--text-primary);
130
+ font-size: 0.875rem;
131
+ cursor: pointer;
132
+ font-family: inherit;
133
+ transition: all 0.2s ease;
134
+ }
135
+
136
+ .voice-select:focus {
137
+ outline: none;
138
+ border-color: var(--accent-color);
139
+ box-shadow: 0 0 0 3px rgba(59, 130, 246, 0.1);
140
+ }
141
+
142
+ .voice-select option {
143
+ background: var(--card-bg);
144
+ color: var(--text-primary);
145
+ }
146
+
147
+ /* Settings Section */
148
+ .settings-section {
149
+ background: rgba(255, 255, 255, 0.03);
150
+ border-radius: 8px;
151
+ overflow: hidden;
152
+ }
153
+
154
+ .settings-toggle {
155
+ width: 100%;
156
+ display: flex;
157
+ align-items: center;
158
+ justify-content: center;
159
+ gap: 0.5rem;
160
+ padding: 0.75rem;
161
+ background: transparent;
162
+ border: none;
163
+ color: var(--text-secondary);
164
+ font-size: 0.875rem;
165
+ cursor: pointer;
166
+ transition: all 0.2s ease;
167
+ font-family: inherit;
168
+ }
169
+
170
+ .settings-toggle:hover {
171
+ background: rgba(255, 255, 255, 0.05);
172
+ color: var(--text-primary);
173
+ }
174
+
175
+ .settings-panel {
176
+ padding: 1rem;
177
+ display: flex;
178
+ flex-direction: column;
179
+ gap: 1.5rem;
180
+ border-top: 1px solid rgba(255, 255, 255, 0.1);
181
+ }
182
+
183
+ .setting-control {
184
+ display: flex;
185
+ flex-direction: column;
186
+ gap: 0.5rem;
187
+ }
188
+
189
+ .setting-control label {
190
+ font-size: 0.875rem;
191
+ font-weight: 500;
192
+ color: var(--text-primary);
193
+ }
194
+
195
+ .slider {
196
+ width: 100%;
197
+ height: 6px;
198
+ border-radius: 3px;
199
+ background: rgba(255, 255, 255, 0.1);
200
+ outline: none;
201
+ -webkit-appearance: none;
202
+ appearance: none;
203
+ }
204
+
205
+ .slider::-webkit-slider-thumb {
206
+ -webkit-appearance: none;
207
+ appearance: none;
208
+ width: 18px;
209
+ height: 18px;
210
+ border-radius: 50%;
211
+ background: var(--accent-color);
212
+ cursor: pointer;
213
+ transition: all 0.2s ease;
214
+ }
215
+
216
+ .slider::-webkit-slider-thumb:hover {
217
+ background: #2563eb;
218
+ transform: scale(1.1);
219
+ }
220
+
221
+ .slider::-moz-range-thumb {
222
+ width: 18px;
223
+ height: 18px;
224
+ border-radius: 50%;
225
+ background: var(--accent-color);
226
+ cursor: pointer;
227
+ border: none;
228
+ transition: all 0.2s ease;
229
+ }
230
+
231
+ .slider::-moz-range-thumb:hover {
232
+ background: #2563eb;
233
+ transform: scale(1.1);
234
+ }
235
+
236
+ .slider-labels {
237
+ display: flex;
238
+ justify-content: space-between;
239
+ font-size: 0.7rem;
240
+ color: var(--text-secondary);
241
+ }
242
+
243
+ .reset-btn {
244
+ display: flex;
245
+ align-items: center;
246
+ justify-content: center;
247
+ gap: 0.5rem;
248
+ padding: 0.5rem 1rem;
249
+ background: rgba(255, 255, 255, 0.1);
250
+ border: 1px solid rgba(255, 255, 255, 0.1);
251
+ border-radius: 6px;
252
+ color: var(--text-secondary);
253
+ font-size: 0.8125rem;
254
+ cursor: pointer;
255
+ transition: all 0.2s ease;
256
+ font-family: inherit;
257
+ align-self: flex-start;
258
+ }
259
+
260
+ .reset-btn:hover {
261
+ background: rgba(255, 255, 255, 0.15);
262
+ color: var(--text-primary);
263
+ border-color: var(--accent-color);
264
+ }
265
+
266
+ /* Speaking Indicator */
267
+ .speaking-indicator {
268
+ display: flex;
269
+ flex-direction: column;
270
+ align-items: center;
271
+ gap: 1rem;
272
+ padding: 1.5rem;
273
+ background: rgba(59, 130, 246, 0.1);
274
+ border: 1px solid rgba(59, 130, 246, 0.3);
275
+ border-radius: 8px;
276
+ }
277
+
278
+ .sound-wave {
279
+ display: flex;
280
+ align-items: center;
281
+ justify-content: center;
282
+ gap: 4px;
283
+ height: 40px;
284
+ }
285
+
286
+ .wave-bar {
287
+ width: 4px;
288
+ height: 10px;
289
+ background: var(--accent-color);
290
+ border-radius: 2px;
291
+ animation: wave 1s ease-in-out infinite;
292
+ }
293
+
294
+ @keyframes wave {
295
+
296
+ 0%,
297
+ 100% {
298
+ height: 10px;
299
+ }
300
+
301
+ 50% {
302
+ height: 30px;
303
+ }
304
+ }
305
+
306
+ .speaking-indicator span {
307
+ color: var(--accent-color);
308
+ font-weight: 500;
309
+ font-size: 0.875rem;
310
+ }
311
+
312
+ /* Responsive */
313
+ @media (max-width: 640px) {
314
+ .sample-buttons {
315
+ flex-direction: column;
316
+ }
317
+
318
+ .sample-btn {
319
+ width: 100%;
320
+ }
321
+ }
web_demo/src/components/TextToSpeech.jsx ADDED
@@ -0,0 +1,327 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import React, { useState, useEffect, useRef } from 'react';
2
+ import { Volume2, VolumeX, Play, Pause, RotateCcw, Settings } from 'lucide-react';
3
+ import './TextToSpeech.css';
4
+
5
+ function TextToSpeech() {
6
+ const [text, setText] = useState('');
7
+ const [isSpeaking, setIsSpeaking] = useState(false);
8
+ const [isPaused, setIsPaused] = useState(false);
9
+ const [voices, setVoices] = useState([]);
10
+ const [selectedVoice, setSelectedVoice] = useState(null);
11
+ const [rate, setRate] = useState(1);
12
+ const [pitch, setPitch] = useState(1);
13
+ const [volume, setVolume] = useState(1);
14
+ const [error, setError] = useState('');
15
+ const [showSettings, setShowSettings] = useState(false);
16
+
17
+ const synthRef = useRef(window.speechSynthesis);
18
+
19
+ // Load available voices
20
+ useEffect(() => {
21
+ const loadVoices = () => {
22
+ const availableVoices = synthRef.current.getVoices();
23
+ setVoices(availableVoices);
24
+
25
+ // Prioritize Indian English voices (en-IN)
26
+ const indianVoice = availableVoices.find(voice =>
27
+ voice.lang === 'en-IN' ||
28
+ voice.lang === 'en_IN' ||
29
+ voice.name.toLowerCase().includes('india')
30
+ );
31
+
32
+ const defaultVoice = indianVoice ||
33
+ availableVoices.find(voice => voice.lang.startsWith('en')) ||
34
+ availableVoices[0];
35
+
36
+ setSelectedVoice(defaultVoice);
37
+ };
38
+
39
+ loadVoices();
40
+
41
+ // Chrome loads voices asynchronously
42
+ if (synthRef.current.onvoiceschanged !== undefined) {
43
+ synthRef.current.onvoiceschanged = loadVoices;
44
+ }
45
+
46
+ return () => {
47
+ synthRef.current.cancel();
48
+ };
49
+ }, []);
50
+
51
+ const speak = () => {
52
+ if (!text.trim()) {
53
+ setError('Please enter some text to speak');
54
+ return;
55
+ }
56
+
57
+ setError('');
58
+ synthRef.current.cancel(); // Cancel any ongoing speech
59
+
60
+ const utterance = new SpeechSynthesisUtterance(text);
61
+
62
+ if (selectedVoice) {
63
+ utterance.voice = selectedVoice;
64
+ }
65
+
66
+ utterance.rate = rate;
67
+ utterance.pitch = pitch;
68
+ utterance.volume = volume;
69
+
70
+ utterance.onstart = () => {
71
+ setIsSpeaking(true);
72
+ setIsPaused(false);
73
+ };
74
+
75
+ utterance.onend = () => {
76
+ setIsSpeaking(false);
77
+ setIsPaused(false);
78
+ };
79
+
80
+ utterance.onerror = (event) => {
81
+ console.error('Speech synthesis error:', event);
82
+ setError(`Error: ${event.error}`);
83
+ setIsSpeaking(false);
84
+ setIsPaused(false);
85
+ };
86
+
87
+ synthRef.current.speak(utterance);
88
+ };
89
+
90
+ const pause = () => {
91
+ if (synthRef.current.speaking && !synthRef.current.paused) {
92
+ synthRef.current.pause();
93
+ setIsPaused(true);
94
+ }
95
+ };
96
+
97
+ const resume = () => {
98
+ if (synthRef.current.paused) {
99
+ synthRef.current.resume();
100
+ setIsPaused(false);
101
+ }
102
+ };
103
+
104
+ const stop = () => {
105
+ synthRef.current.cancel();
106
+ setIsSpeaking(false);
107
+ setIsPaused(false);
108
+ };
109
+
110
+ const reset = () => {
111
+ setRate(1);
112
+ setPitch(1);
113
+ setVolume(1);
114
+ };
115
+
116
+ const sampleTexts = [
117
+ "Namaste! This is an Indian English voice test. How can I help you today?",
118
+ "Hello! This is a test of the text-to-speech system.",
119
+ "The quick brown fox jumps over the lazy dog.",
120
+ "Welcome to NeuralVoice AI. We're testing the speech synthesis capabilities of your browser.",
121
+ ];
122
+
123
+ const loadSampleText = (sample) => {
124
+ setText(sample);
125
+ setError('');
126
+ };
127
+
128
+ return (
129
+ <div className="text-to-speech">
130
+ <div className="tts-header">
131
+ <h2>Text-to-Speech Test</h2>
132
+ <p className="subtitle">Enter text and hear it spoken aloud</p>
133
+ </div>
134
+
135
+ {error && (
136
+ <div className="error-message">
137
+ <span>⚠️ {error}</span>
138
+ </div>
139
+ )}
140
+
141
+ <div className="text-input-section">
142
+ <div className="input-header">
143
+ <label htmlFor="text-input">Enter Text</label>
144
+ <span className="char-count">{text.length} characters</span>
145
+ </div>
146
+ <textarea
147
+ id="text-input"
148
+ className="text-input"
149
+ value={text}
150
+ onChange={(e) => setText(e.target.value)}
151
+ placeholder="Type or paste text here to convert to speech..."
152
+ rows={6}
153
+ />
154
+ </div>
155
+
156
+ <div className="sample-texts">
157
+ <p className="sample-label">Quick samples:</p>
158
+ <div className="sample-buttons">
159
+ {sampleTexts.map((sample, index) => (
160
+ <button
161
+ key={index}
162
+ className="sample-btn"
163
+ onClick={() => loadSampleText(sample)}
164
+ >
165
+ Sample {index + 1}
166
+ </button>
167
+ ))}
168
+ </div>
169
+ </div>
170
+
171
+ <div className="voice-selector">
172
+ <label htmlFor="voice-select">
173
+ <Settings size={16} />
174
+ Voice
175
+ </label>
176
+ <select
177
+ id="voice-select"
178
+ value={selectedVoice?.name || ''}
179
+ onChange={(e) => {
180
+ const voice = voices.find(v => v.name === e.target.value);
181
+ setSelectedVoice(voice);
182
+ }}
183
+ className="voice-select"
184
+ >
185
+ {voices.map((voice) => (
186
+ <option key={voice.name} value={voice.name}>
187
+ {voice.name} ({voice.lang})
188
+ </option>
189
+ ))}
190
+ </select>
191
+ </div>
192
+
193
+ <div className="settings-section">
194
+ <button
195
+ className="settings-toggle"
196
+ onClick={() => setShowSettings(!showSettings)}
197
+ >
198
+ <Settings size={18} />
199
+ <span>{showSettings ? 'Hide' : 'Show'} Advanced Settings</span>
200
+ </button>
201
+
202
+ {showSettings && (
203
+ <div className="settings-panel">
204
+ <div className="setting-control">
205
+ <label>
206
+ Speed: {rate.toFixed(1)}x
207
+ </label>
208
+ <input
209
+ type="range"
210
+ min="0.5"
211
+ max="2"
212
+ step="0.1"
213
+ value={rate}
214
+ onChange={(e) => setRate(parseFloat(e.target.value))}
215
+ className="slider"
216
+ />
217
+ <div className="slider-labels">
218
+ <span>Slow</span>
219
+ <span>Normal</span>
220
+ <span>Fast</span>
221
+ </div>
222
+ </div>
223
+
224
+ <div className="setting-control">
225
+ <label>
226
+ Pitch: {pitch.toFixed(1)}
227
+ </label>
228
+ <input
229
+ type="range"
230
+ min="0.5"
231
+ max="2"
232
+ step="0.1"
233
+ value={pitch}
234
+ onChange={(e) => setPitch(parseFloat(e.target.value))}
235
+ className="slider"
236
+ />
237
+ <div className="slider-labels">
238
+ <span>Low</span>
239
+ <span>Normal</span>
240
+ <span>High</span>
241
+ </div>
242
+ </div>
243
+
244
+ <div className="setting-control">
245
+ <label>
246
+ Volume: {Math.round(volume * 100)}%
247
+ </label>
248
+ <input
249
+ type="range"
250
+ min="0"
251
+ max="1"
252
+ step="0.1"
253
+ value={volume}
254
+ onChange={(e) => setVolume(parseFloat(e.target.value))}
255
+ className="slider"
256
+ />
257
+ <div className="slider-labels">
258
+ <span>Quiet</span>
259
+ <span>Normal</span>
260
+ <span>Loud</span>
261
+ </div>
262
+ </div>
263
+
264
+ <button className="reset-btn" onClick={reset}>
265
+ <RotateCcw size={16} />
266
+ Reset to Defaults
267
+ </button>
268
+ </div>
269
+ )}
270
+ </div>
271
+
272
+ <div className="controls">
273
+ {!isSpeaking ? (
274
+ <button className="btn btn-primary" onClick={speak}>
275
+ <Play size={20} />
276
+ <span>Speak</span>
277
+ </button>
278
+ ) : (
279
+ <>
280
+ {!isPaused ? (
281
+ <button className="btn btn-warning" onClick={pause}>
282
+ <Pause size={20} />
283
+ <span>Pause</span>
284
+ </button>
285
+ ) : (
286
+ <button className="btn btn-success" onClick={resume}>
287
+ <Play size={20} />
288
+ <span>Resume</span>
289
+ </button>
290
+ )}
291
+ <button className="btn btn-danger" onClick={stop}>
292
+ <VolumeX size={20} />
293
+ <span>Stop</span>
294
+ </button>
295
+ </>
296
+ )}
297
+ </div>
298
+
299
+ {isSpeaking && (
300
+ <div className="speaking-indicator">
301
+ <div className="sound-wave">
302
+ {[...Array(5)].map((_, i) => (
303
+ <div
304
+ key={i}
305
+ className="wave-bar"
306
+ style={{ animationDelay: `${i * 0.1}s` }}
307
+ />
308
+ ))}
309
+ </div>
310
+ <span>{isPaused ? 'Paused' : 'Speaking...'}</span>
311
+ </div>
312
+ )}
313
+
314
+ <div className="info-box">
315
+ <h4>💡 Tips:</h4>
316
+ <ul>
317
+ <li>Choose different voices to hear various accents and styles</li>
318
+ <li>Adjust speed, pitch, and volume for customized speech</li>
319
+ <li>Works in all modern browsers (Chrome, Firefox, Safari, Edge)</li>
320
+ <li>Try longer texts to test natural speech flow</li>
321
+ </ul>
322
+ </div>
323
+ </div>
324
+ );
325
+ }
326
+
327
+ export default TextToSpeech;
web_demo/src/index.css ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ :root {
2
+ --bg-primary: #0f172a;
3
+ --bg-secondary: #1e293b;
4
+ --text-primary: #f8fafc;
5
+ --text-secondary: #94a3b8;
6
+ --accent-primary: #3b82f6;
7
+ --accent-hover: #2563eb;
8
+ --accent-glow: rgba(59, 130, 246, 0.5);
9
+ --success: #10b981;
10
+ --error: #ef4444;
11
+ --font-sans: 'Inter', system-ui, -apple-system, sans-serif;
12
+ }
13
+
14
+ body {
15
+ margin: 0;
16
+ padding: 0;
17
+ background-color: var(--bg-primary);
18
+ color: var(--text-primary);
19
+ font-family: var(--font-sans);
20
+ -webkit-font-smoothing: antialiased;
21
+ min-height: 100vh;
22
+ }
23
+
24
+ #root {
25
+ width: 100%;
26
+ min-height: 100vh;
27
+ display: flex;
28
+ flex-direction: column;
29
+ }
30
+
31
+ /* Scrollbar */
32
+ ::-webkit-scrollbar {
33
+ width: 8px;
34
+ }
35
+
36
+ ::-webkit-scrollbar-track {
37
+ background: var(--bg-primary);
38
+ }
39
+
40
+ ::-webkit-scrollbar-thumb {
41
+ background: var(--bg-secondary);
42
+ border-radius: 4px;
43
+ }
44
+
45
+ ::-webkit-scrollbar-thumb:hover {
46
+ background: #334155;
47
+ }
web_demo/src/main.jsx ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ import { StrictMode } from 'react'
2
+ import { createRoot } from 'react-dom/client'
3
+ import './index.css'
4
+ import App from './App.jsx'
5
+
6
+ createRoot(document.getElementById('root')).render(
7
+ <StrictMode>
8
+ <App />
9
+ </StrictMode>,
10
+ )
web_demo/vite.config.js ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ import { defineConfig } from 'vite'
2
+ import react from '@vitejs/plugin-react'
3
+
4
+ // https://vite.dev/config/
5
+ export default defineConfig({
6
+ plugins: [react()],
7
+ })