Voice_backend / docs /websocket-protocol.md
Mohansai2004's picture
Upload 67 files
24dc421 verified
# WebSocket Protocol
## Overview
The Voice-to-Voice Translator uses a WebSocket-based protocol for real-time bidirectional communication between clients and the server. Messages are exchanged in JSON format (text) and raw audio data (binary).
## Connection Endpoint
```
ws://host:port/ws
```
### Connection Parameters
Query parameters (optional):
- `token`: JWT authentication token (if auth enabled)
- `client_id`: Unique client identifier
Example:
```
ws://localhost:8000/ws?token=eyJhbGc...&client_id=client123
```
## Message Types
All text messages follow this structure:
```json
{
"type": "message_type",
"payload": { ... },
"timestamp": "2025-12-17T10:30:00Z",
"message_id": "uuid-v4"
}
```
### Client → Server Messages
#### 1. JOIN_ROOM
Join or create a translation room.
```json
{
"type": "join_room",
"payload": {
"room_id": "room123",
"user_id": "user1",
"username": "John Doe",
"source_lang": "en",
"target_lang": "hi"
}
}
```
**Fields**:
- `room_id`: Unique room identifier
- `user_id`: Unique user identifier
- `username`: Display name
- `source_lang`: User's speaking language (ISO 639-1 code)
- `target_lang`: Desired translation language
**Response**: `ROOM_JOINED` or `ERROR`
#### 2. LEAVE_ROOM
Leave current room.
```json
{
"type": "leave_room",
"payload": {
"room_id": "room123",
"user_id": "user1"
}
}
```
**Response**: `ROOM_LEFT` or `ERROR`
#### 3. AUDIO_START
Signal start of audio stream.
```json
{
"type": "audio_start",
"payload": {
"room_id": "room123",
"user_id": "user1",
"audio_config": {
"sample_rate": 16000,
"channels": 1,
"format": "PCM16",
"chunk_size": 4096
}
}
}
```
**Response**: `AUDIO_START_ACK`
#### 4. AUDIO_STOP
Signal end of audio stream.
```json
{
"type": "audio_stop",
"payload": {
"room_id": "room123",
"user_id": "user1"
}
}
```
**Response**: `AUDIO_STOP_ACK`
#### 5. TEXT_MESSAGE
Send text message (for chat or corrections).
```json
{
"type": "text_message",
"payload": {
"room_id": "room123",
"user_id": "user1",
"text": "Hello, how are you?",
"lang": "en"
}
}
```
**Response**: `TEXT_MESSAGE` (broadcast to room)
#### 6. PING
Heartbeat message.
```json
{
"type": "ping",
"payload": {}
}
```
**Response**: `PONG`
#### 7. GET_ROOM_INFO
Request current room state.
```json
{
"type": "get_room_info",
"payload": {
"room_id": "room123"
}
}
```
**Response**: `ROOM_INFO`
### Server → Client Messages
#### 1. ROOM_JOINED
Confirmation of room join.
```json
{
"type": "room_joined",
"payload": {
"room_id": "room123",
"user_id": "user1",
"users": [
{
"user_id": "user1",
"username": "John Doe",
"source_lang": "en",
"target_lang": "hi"
},
{
"user_id": "user2",
"username": "Jane Smith",
"source_lang": "hi",
"target_lang": "en"
}
]
}
}
```
#### 2. ROOM_LEFT
Confirmation of room leave.
```json
{
"type": "room_left",
"payload": {
"room_id": "room123",
"user_id": "user1"
}
}
```
#### 3. USER_JOINED
Broadcast when another user joins.
```json
{
"type": "user_joined",
"payload": {
"room_id": "room123",
"user": {
"user_id": "user2",
"username": "Jane Smith",
"source_lang": "hi",
"target_lang": "en"
}
}
}
```
#### 4. USER_LEFT
Broadcast when another user leaves.
```json
{
"type": "user_left",
"payload": {
"room_id": "room123",
"user_id": "user2",
"username": "Jane Smith"
}
}
```
#### 5. TRANSCRIPTION
Intermediate transcription result (from STT).
```json
{
"type": "transcription",
"payload": {
"room_id": "room123",
"user_id": "user1",
"text": "Hello how are you",
"lang": "en",
"is_final": false,
"confidence": 0.85
}
}
```
#### 6. TRANSLATION
Translation result.
```json
{
"type": "translation",
"payload": {
"room_id": "room123",
"source_user_id": "user1",
"target_user_id": "user2",
"original_text": "Hello, how are you?",
"translated_text": "नमस्ते, आप कैसे हैं?",
"source_lang": "en",
"target_lang": "hi",
"confidence": 0.92
}
}
```
#### 7. AUDIO_START_ACK
Acknowledgment of audio start.
```json
{
"type": "audio_start_ack",
"payload": {
"room_id": "room123",
"user_id": "user1",
"ready": true
}
}
```
#### 8. AUDIO_STOP_ACK
Acknowledgment of audio stop.
```json
{
"type": "audio_stop_ack",
"payload": {
"room_id": "room123",
"user_id": "user1"
}
}
```
#### 9. PONG
Heartbeat response.
```json
{
"type": "pong",
"payload": {
"timestamp": "2025-12-17T10:30:00Z"
}
}
```
#### 10. ROOM_INFO
Room state information.
```json
{
"type": "room_info",
"payload": {
"room_id": "room123",
"created_at": "2025-12-17T10:00:00Z",
"users": [ ... ],
"active_speakers": ["user1"],
"supported_languages": ["en", "hi"]
}
}
```
#### 11. ERROR
Error notification.
```json
{
"type": "error",
"payload": {
"code": "INVALID_ROOM",
"message": "Room does not exist",
"details": "Room 'room123' not found",
"recoverable": true
}
}
```
**Error Codes**:
- `INVALID_ROOM`: Room not found
- `ROOM_FULL`: Maximum users reached
- `INVALID_MESSAGE`: Malformed message
- `AUTH_FAILED`: Authentication failed
- `RATE_LIMIT`: Too many requests
- `INTERNAL_ERROR`: Server error
- `UNSUPPORTED_LANGUAGE`: Language not available
- `AUDIO_ERROR`: Audio processing error
## Binary Audio Messages
Audio data is sent as binary WebSocket frames.
### Client → Server (Audio Input)
Binary message structure:
```
[Header (16 bytes)][Audio Data (variable)]
```
**Header Format**:
- Bytes 0-7: User ID (UTF-8, padded)
- Bytes 8-11: Sequence number (uint32, big-endian)
- Bytes 12-15: Timestamp (uint32, milliseconds)
**Audio Data**:
- Format: PCM16 (16-bit signed integer)
- Sample Rate: 16000 Hz (configurable)
- Channels: 1 (mono)
- Byte Order: Little-endian
### Server → Client (Translated Audio)
Binary message structure:
```
[Header (24 bytes)][Audio Data (variable)]
```
**Header Format**:
- Bytes 0-7: Source User ID (UTF-8, padded)
- Bytes 8-15: Target User ID (UTF-8, padded)
- Bytes 16-19: Sequence number (uint32, big-endian)
- Bytes 20-23: Timestamp (uint32, milliseconds)
**Audio Data**:
- Same format as input
## Connection Lifecycle
### 1. Connection Establishment
```
Client Server
│ │
├─────── WebSocket Connect ────►│
│ │
│◄────── Connection Open ───────┤
│ │
```
### 2. Room Join
```
Client Server
│ │
├────────── JOIN_ROOM ─────────►│
│ │
│◄───────── ROOM_JOINED ────────┤
│ │
│◄───────── USER_JOINED ────────┤ (broadcast to others)
│ │
```
### 3. Audio Streaming
```
Client Server Other Client
│ │ │
├───── AUDIO_START ────────────►│ │
│ │ │
│◄──── AUDIO_START_ACK ─────────┤ │
│ │ │
├─── Binary Audio Chunk 1 ─────►│ │
├─── Binary Audio Chunk 2 ─────►│ │
│ │ │
│◄─────── TRANSCRIPTION ────────┤ │
│ │ │
│◄─────── TRANSLATION ──────────┤ │
│ │ │
│ ├───► Binary Audio Chunk ─────►│
│ │ │
├───── AUDIO_STOP ─────────────►│ │
│ │ │
│◄──── AUDIO_STOP_ACK ──────────┤ │
│ │ │
```
### 4. Disconnection
```
Client Server
│ │
├────────── LEAVE_ROOM ────────►│
│ │
│◄───────── ROOM_LEFT ──────────┤
│ │
├─────── Close Connection ─────►│
│ │
│◄─────── Close Confirm ────────┤
│ │
```
## Rate Limiting
Default limits:
- Join room: 10 requests per minute
- Audio streaming: Unlimited (quality-based throttling)
- Text messages: 30 per minute
- Room info requests: 60 per minute
## Reconnection Strategy
1. **Exponential Backoff**: 1s, 2s, 4s, 8s, 16s, 30s (max)
2. **Session Recovery**: Send previous `room_id` and `user_id` on reconnect
3. **State Sync**: Server sends current room state after reconnection
## Best Practices
### Client Implementation
1. **Always send AUDIO_START before binary audio**
2. **Buffer audio before sending** (minimum 100ms chunks)
3. **Include sequence numbers** for ordering
4. **Handle ERROR messages** gracefully
5. **Implement heartbeat** (PING every 30 seconds)
6. **Reconnect automatically** on disconnect
### Server Implementation
1. **Validate all messages** before processing
2. **Broadcast state changes** to all room members
3. **Clean up resources** on disconnect
4. **Log all errors** with context
5. **Rate limit** per connection and IP
## Example Client Flow (JavaScript)
```javascript
const ws = new WebSocket('ws://localhost:8000/ws');
// Connect
ws.onopen = () => {
// Join room
ws.send(JSON.stringify({
type: 'join_room',
payload: {
room_id: 'room123',
user_id: 'user1',
username: 'John',
source_lang: 'en',
target_lang: 'hi'
}
}));
};
// Handle messages
ws.onmessage = (event) => {
if (event.data instanceof Blob) {
// Binary audio data
handleAudio(event.data);
} else {
// JSON message
const msg = JSON.parse(event.data);
handleMessage(msg);
}
};
// Send audio
function sendAudio(audioBuffer) {
ws.send(audioBuffer);
}
```
## Security Considerations
1. **Always use WSS (WebSocket Secure)** in production
2. **Validate JWT tokens** if authentication enabled
3. **Sanitize user inputs** (usernames, room IDs)
4. **Implement rate limiting** to prevent abuse
5. **Monitor connection count** to prevent DoS
6. **Encrypt sensitive data** in messages