# WebSocket Protocol ## Overview The Voice-to-Voice Translator uses a WebSocket-based protocol for real-time bidirectional communication between clients and the server. Messages are exchanged in JSON format (text) and raw audio data (binary). ## Connection Endpoint ``` ws://host:port/ws ``` ### Connection Parameters Query parameters (optional): - `token`: JWT authentication token (if auth enabled) - `client_id`: Unique client identifier Example: ``` ws://localhost:8000/ws?token=eyJhbGc...&client_id=client123 ``` ## Message Types All text messages follow this structure: ```json { "type": "message_type", "payload": { ... }, "timestamp": "2025-12-17T10:30:00Z", "message_id": "uuid-v4" } ``` ### Client → Server Messages #### 1. JOIN_ROOM Join or create a translation room. ```json { "type": "join_room", "payload": { "room_id": "room123", "user_id": "user1", "username": "John Doe", "source_lang": "en", "target_lang": "hi" } } ``` **Fields**: - `room_id`: Unique room identifier - `user_id`: Unique user identifier - `username`: Display name - `source_lang`: User's speaking language (ISO 639-1 code) - `target_lang`: Desired translation language **Response**: `ROOM_JOINED` or `ERROR` #### 2. LEAVE_ROOM Leave current room. ```json { "type": "leave_room", "payload": { "room_id": "room123", "user_id": "user1" } } ``` **Response**: `ROOM_LEFT` or `ERROR` #### 3. AUDIO_START Signal start of audio stream. ```json { "type": "audio_start", "payload": { "room_id": "room123", "user_id": "user1", "audio_config": { "sample_rate": 16000, "channels": 1, "format": "PCM16", "chunk_size": 4096 } } } ``` **Response**: `AUDIO_START_ACK` #### 4. AUDIO_STOP Signal end of audio stream. ```json { "type": "audio_stop", "payload": { "room_id": "room123", "user_id": "user1" } } ``` **Response**: `AUDIO_STOP_ACK` #### 5. TEXT_MESSAGE Send text message (for chat or corrections). ```json { "type": "text_message", "payload": { "room_id": "room123", "user_id": "user1", "text": "Hello, how are you?", "lang": "en" } } ``` **Response**: `TEXT_MESSAGE` (broadcast to room) #### 6. PING Heartbeat message. ```json { "type": "ping", "payload": {} } ``` **Response**: `PONG` #### 7. GET_ROOM_INFO Request current room state. ```json { "type": "get_room_info", "payload": { "room_id": "room123" } } ``` **Response**: `ROOM_INFO` ### Server → Client Messages #### 1. ROOM_JOINED Confirmation of room join. ```json { "type": "room_joined", "payload": { "room_id": "room123", "user_id": "user1", "users": [ { "user_id": "user1", "username": "John Doe", "source_lang": "en", "target_lang": "hi" }, { "user_id": "user2", "username": "Jane Smith", "source_lang": "hi", "target_lang": "en" } ] } } ``` #### 2. ROOM_LEFT Confirmation of room leave. ```json { "type": "room_left", "payload": { "room_id": "room123", "user_id": "user1" } } ``` #### 3. USER_JOINED Broadcast when another user joins. ```json { "type": "user_joined", "payload": { "room_id": "room123", "user": { "user_id": "user2", "username": "Jane Smith", "source_lang": "hi", "target_lang": "en" } } } ``` #### 4. USER_LEFT Broadcast when another user leaves. ```json { "type": "user_left", "payload": { "room_id": "room123", "user_id": "user2", "username": "Jane Smith" } } ``` #### 5. TRANSCRIPTION Intermediate transcription result (from STT). ```json { "type": "transcription", "payload": { "room_id": "room123", "user_id": "user1", "text": "Hello how are you", "lang": "en", "is_final": false, "confidence": 0.85 } } ``` #### 6. TRANSLATION Translation result. ```json { "type": "translation", "payload": { "room_id": "room123", "source_user_id": "user1", "target_user_id": "user2", "original_text": "Hello, how are you?", "translated_text": "नमस्ते, आप कैसे हैं?", "source_lang": "en", "target_lang": "hi", "confidence": 0.92 } } ``` #### 7. AUDIO_START_ACK Acknowledgment of audio start. ```json { "type": "audio_start_ack", "payload": { "room_id": "room123", "user_id": "user1", "ready": true } } ``` #### 8. AUDIO_STOP_ACK Acknowledgment of audio stop. ```json { "type": "audio_stop_ack", "payload": { "room_id": "room123", "user_id": "user1" } } ``` #### 9. PONG Heartbeat response. ```json { "type": "pong", "payload": { "timestamp": "2025-12-17T10:30:00Z" } } ``` #### 10. ROOM_INFO Room state information. ```json { "type": "room_info", "payload": { "room_id": "room123", "created_at": "2025-12-17T10:00:00Z", "users": [ ... ], "active_speakers": ["user1"], "supported_languages": ["en", "hi"] } } ``` #### 11. ERROR Error notification. ```json { "type": "error", "payload": { "code": "INVALID_ROOM", "message": "Room does not exist", "details": "Room 'room123' not found", "recoverable": true } } ``` **Error Codes**: - `INVALID_ROOM`: Room not found - `ROOM_FULL`: Maximum users reached - `INVALID_MESSAGE`: Malformed message - `AUTH_FAILED`: Authentication failed - `RATE_LIMIT`: Too many requests - `INTERNAL_ERROR`: Server error - `UNSUPPORTED_LANGUAGE`: Language not available - `AUDIO_ERROR`: Audio processing error ## Binary Audio Messages Audio data is sent as binary WebSocket frames. ### Client → Server (Audio Input) Binary message structure: ``` [Header (16 bytes)][Audio Data (variable)] ``` **Header Format**: - Bytes 0-7: User ID (UTF-8, padded) - Bytes 8-11: Sequence number (uint32, big-endian) - Bytes 12-15: Timestamp (uint32, milliseconds) **Audio Data**: - Format: PCM16 (16-bit signed integer) - Sample Rate: 16000 Hz (configurable) - Channels: 1 (mono) - Byte Order: Little-endian ### Server → Client (Translated Audio) Binary message structure: ``` [Header (24 bytes)][Audio Data (variable)] ``` **Header Format**: - Bytes 0-7: Source User ID (UTF-8, padded) - Bytes 8-15: Target User ID (UTF-8, padded) - Bytes 16-19: Sequence number (uint32, big-endian) - Bytes 20-23: Timestamp (uint32, milliseconds) **Audio Data**: - Same format as input ## Connection Lifecycle ### 1. Connection Establishment ``` Client Server │ │ ├─────── WebSocket Connect ────►│ │ │ │◄────── Connection Open ───────┤ │ │ ``` ### 2. Room Join ``` Client Server │ │ ├────────── JOIN_ROOM ─────────►│ │ │ │◄───────── ROOM_JOINED ────────┤ │ │ │◄───────── USER_JOINED ────────┤ (broadcast to others) │ │ ``` ### 3. Audio Streaming ``` Client Server Other Client │ │ │ ├───── AUDIO_START ────────────►│ │ │ │ │ │◄──── AUDIO_START_ACK ─────────┤ │ │ │ │ ├─── Binary Audio Chunk 1 ─────►│ │ ├─── Binary Audio Chunk 2 ─────►│ │ │ │ │ │◄─────── TRANSCRIPTION ────────┤ │ │ │ │ │◄─────── TRANSLATION ──────────┤ │ │ │ │ │ ├───► Binary Audio Chunk ─────►│ │ │ │ ├───── AUDIO_STOP ─────────────►│ │ │ │ │ │◄──── AUDIO_STOP_ACK ──────────┤ │ │ │ │ ``` ### 4. Disconnection ``` Client Server │ │ ├────────── LEAVE_ROOM ────────►│ │ │ │◄───────── ROOM_LEFT ──────────┤ │ │ ├─────── Close Connection ─────►│ │ │ │◄─────── Close Confirm ────────┤ │ │ ``` ## Rate Limiting Default limits: - Join room: 10 requests per minute - Audio streaming: Unlimited (quality-based throttling) - Text messages: 30 per minute - Room info requests: 60 per minute ## Reconnection Strategy 1. **Exponential Backoff**: 1s, 2s, 4s, 8s, 16s, 30s (max) 2. **Session Recovery**: Send previous `room_id` and `user_id` on reconnect 3. **State Sync**: Server sends current room state after reconnection ## Best Practices ### Client Implementation 1. **Always send AUDIO_START before binary audio** 2. **Buffer audio before sending** (minimum 100ms chunks) 3. **Include sequence numbers** for ordering 4. **Handle ERROR messages** gracefully 5. **Implement heartbeat** (PING every 30 seconds) 6. **Reconnect automatically** on disconnect ### Server Implementation 1. **Validate all messages** before processing 2. **Broadcast state changes** to all room members 3. **Clean up resources** on disconnect 4. **Log all errors** with context 5. **Rate limit** per connection and IP ## Example Client Flow (JavaScript) ```javascript const ws = new WebSocket('ws://localhost:8000/ws'); // Connect ws.onopen = () => { // Join room ws.send(JSON.stringify({ type: 'join_room', payload: { room_id: 'room123', user_id: 'user1', username: 'John', source_lang: 'en', target_lang: 'hi' } })); }; // Handle messages ws.onmessage = (event) => { if (event.data instanceof Blob) { // Binary audio data handleAudio(event.data); } else { // JSON message const msg = JSON.parse(event.data); handleMessage(msg); } }; // Send audio function sendAudio(audioBuffer) { ws.send(audioBuffer); } ``` ## Security Considerations 1. **Always use WSS (WebSocket Secure)** in production 2. **Validate JWT tokens** if authentication enabled 3. **Sanitize user inputs** (usernames, room IDs) 4. **Implement rate limiting** to prevent abuse 5. **Monitor connection count** to prevent DoS 6. **Encrypt sensitive data** in messages