Voice_backend / docs /websocket-protocol.md
Mohansai2004's picture
Upload 67 files
24dc421 verified

WebSocket Protocol

Overview

The Voice-to-Voice Translator uses a WebSocket-based protocol for real-time bidirectional communication between clients and the server. Messages are exchanged in JSON format (text) and raw audio data (binary).

Connection Endpoint

ws://host:port/ws

Connection Parameters

Query parameters (optional):

  • token: JWT authentication token (if auth enabled)
  • client_id: Unique client identifier

Example:

ws://localhost:8000/ws?token=eyJhbGc...&client_id=client123

Message Types

All text messages follow this structure:

{
  "type": "message_type",
  "payload": { ... },
  "timestamp": "2025-12-17T10:30:00Z",
  "message_id": "uuid-v4"
}

Client โ†’ Server Messages

1. JOIN_ROOM

Join or create a translation room.

{
  "type": "join_room",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "username": "John Doe",
    "source_lang": "en",
    "target_lang": "hi"
  }
}

Fields:

  • room_id: Unique room identifier
  • user_id: Unique user identifier
  • username: Display name
  • source_lang: User's speaking language (ISO 639-1 code)
  • target_lang: Desired translation language

Response: ROOM_JOINED or ERROR

2. LEAVE_ROOM

Leave current room.

{
  "type": "leave_room",
  "payload": {
    "room_id": "room123",
    "user_id": "user1"
  }
}

Response: ROOM_LEFT or ERROR

3. AUDIO_START

Signal start of audio stream.

{
  "type": "audio_start",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "audio_config": {
      "sample_rate": 16000,
      "channels": 1,
      "format": "PCM16",
      "chunk_size": 4096
    }
  }
}

Response: AUDIO_START_ACK

4. AUDIO_STOP

Signal end of audio stream.

{
  "type": "audio_stop",
  "payload": {
    "room_id": "room123",
    "user_id": "user1"
  }
}

Response: AUDIO_STOP_ACK

5. TEXT_MESSAGE

Send text message (for chat or corrections).

{
  "type": "text_message",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "text": "Hello, how are you?",
    "lang": "en"
  }
}

Response: TEXT_MESSAGE (broadcast to room)

6. PING

Heartbeat message.

{
  "type": "ping",
  "payload": {}
}

Response: PONG

7. GET_ROOM_INFO

Request current room state.

{
  "type": "get_room_info",
  "payload": {
    "room_id": "room123"
  }
}

Response: ROOM_INFO

Server โ†’ Client Messages

1. ROOM_JOINED

Confirmation of room join.

{
  "type": "room_joined",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "users": [
      {
        "user_id": "user1",
        "username": "John Doe",
        "source_lang": "en",
        "target_lang": "hi"
      },
      {
        "user_id": "user2",
        "username": "Jane Smith",
        "source_lang": "hi",
        "target_lang": "en"
      }
    ]
  }
}

2. ROOM_LEFT

Confirmation of room leave.

{
  "type": "room_left",
  "payload": {
    "room_id": "room123",
    "user_id": "user1"
  }
}

3. USER_JOINED

Broadcast when another user joins.

{
  "type": "user_joined",
  "payload": {
    "room_id": "room123",
    "user": {
      "user_id": "user2",
      "username": "Jane Smith",
      "source_lang": "hi",
      "target_lang": "en"
    }
  }
}

4. USER_LEFT

Broadcast when another user leaves.

{
  "type": "user_left",
  "payload": {
    "room_id": "room123",
    "user_id": "user2",
    "username": "Jane Smith"
  }
}

5. TRANSCRIPTION

Intermediate transcription result (from STT).

{
  "type": "transcription",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "text": "Hello how are you",
    "lang": "en",
    "is_final": false,
    "confidence": 0.85
  }
}

6. TRANSLATION

Translation result.

{
  "type": "translation",
  "payload": {
    "room_id": "room123",
    "source_user_id": "user1",
    "target_user_id": "user2",
    "original_text": "Hello, how are you?",
    "translated_text": "เคจเคฎเคธเฅเคคเฅ‡, เค†เคช เค•เฅˆเคธเฅ‡ เคนเฅˆเค‚?",
    "source_lang": "en",
    "target_lang": "hi",
    "confidence": 0.92
  }
}

7. AUDIO_START_ACK

Acknowledgment of audio start.

{
  "type": "audio_start_ack",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "ready": true
  }
}

8. AUDIO_STOP_ACK

Acknowledgment of audio stop.

{
  "type": "audio_stop_ack",
  "payload": {
    "room_id": "room123",
    "user_id": "user1"
  }
}

9. PONG

Heartbeat response.

{
  "type": "pong",
  "payload": {
    "timestamp": "2025-12-17T10:30:00Z"
  }
}

10. ROOM_INFO

Room state information.

{
  "type": "room_info",
  "payload": {
    "room_id": "room123",
    "created_at": "2025-12-17T10:00:00Z",
    "users": [ ... ],
    "active_speakers": ["user1"],
    "supported_languages": ["en", "hi"]
  }
}

11. ERROR

Error notification.

{
  "type": "error",
  "payload": {
    "code": "INVALID_ROOM",
    "message": "Room does not exist",
    "details": "Room 'room123' not found",
    "recoverable": true
  }
}

Error Codes:

  • INVALID_ROOM: Room not found
  • ROOM_FULL: Maximum users reached
  • INVALID_MESSAGE: Malformed message
  • AUTH_FAILED: Authentication failed
  • RATE_LIMIT: Too many requests
  • INTERNAL_ERROR: Server error
  • UNSUPPORTED_LANGUAGE: Language not available
  • AUDIO_ERROR: Audio processing error

Binary Audio Messages

Audio data is sent as binary WebSocket frames.

Client โ†’ Server (Audio Input)

Binary message structure:

[Header (16 bytes)][Audio Data (variable)]

Header Format:

  • Bytes 0-7: User ID (UTF-8, padded)
  • Bytes 8-11: Sequence number (uint32, big-endian)
  • Bytes 12-15: Timestamp (uint32, milliseconds)

Audio Data:

  • Format: PCM16 (16-bit signed integer)
  • Sample Rate: 16000 Hz (configurable)
  • Channels: 1 (mono)
  • Byte Order: Little-endian

Server โ†’ Client (Translated Audio)

Binary message structure:

[Header (24 bytes)][Audio Data (variable)]

Header Format:

  • Bytes 0-7: Source User ID (UTF-8, padded)
  • Bytes 8-15: Target User ID (UTF-8, padded)
  • Bytes 16-19: Sequence number (uint32, big-endian)
  • Bytes 20-23: Timestamp (uint32, milliseconds)

Audio Data:

  • Same format as input

Connection Lifecycle

1. Connection Establishment

Client                          Server
  โ”‚                               โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€ WebSocket Connect โ”€โ”€โ”€โ”€โ–บโ”‚
  โ”‚                               โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€ Connection Open โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
  โ”‚                               โ”‚

2. Room Join

Client                          Server
  โ”‚                               โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ JOIN_ROOM โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚
  โ”‚                               โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ROOM_JOINED โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
  โ”‚                               โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ USER_JOINED โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค (broadcast to others)
  โ”‚                               โ”‚

3. Audio Streaming

Client                          Server                     Other Client
  โ”‚                               โ”‚                              โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€ AUDIO_START โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚                              โ”‚
  โ”‚                               โ”‚                              โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€ AUDIO_START_ACK โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค                              โ”‚
  โ”‚                               โ”‚                              โ”‚
  โ”œโ”€โ”€โ”€ Binary Audio Chunk 1 โ”€โ”€โ”€โ”€โ”€โ–บโ”‚                              โ”‚
  โ”œโ”€โ”€โ”€ Binary Audio Chunk 2 โ”€โ”€โ”€โ”€โ”€โ–บโ”‚                              โ”‚
  โ”‚                               โ”‚                              โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€ TRANSCRIPTION โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค                              โ”‚
  โ”‚                               โ”‚                              โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€ TRANSLATION โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค                              โ”‚
  โ”‚                               โ”‚                              โ”‚
  โ”‚                               โ”œโ”€โ”€โ”€โ–บ Binary Audio Chunk โ”€โ”€โ”€โ”€โ”€โ–บโ”‚
  โ”‚                               โ”‚                              โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€ AUDIO_STOP โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚                              โ”‚
  โ”‚                               โ”‚                              โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€ AUDIO_STOP_ACK โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค                              โ”‚
  โ”‚                               โ”‚                              โ”‚

4. Disconnection

Client                          Server
  โ”‚                               โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ LEAVE_ROOM โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚
  โ”‚                               โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ROOM_LEFT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
  โ”‚                               โ”‚
  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€ Close Connection โ”€โ”€โ”€โ”€โ”€โ–บโ”‚
  โ”‚                               โ”‚
  โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Close Confirm โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
  โ”‚                               โ”‚

Rate Limiting

Default limits:

  • Join room: 10 requests per minute
  • Audio streaming: Unlimited (quality-based throttling)
  • Text messages: 30 per minute
  • Room info requests: 60 per minute

Reconnection Strategy

  1. Exponential Backoff: 1s, 2s, 4s, 8s, 16s, 30s (max)
  2. Session Recovery: Send previous room_id and user_id on reconnect
  3. State Sync: Server sends current room state after reconnection

Best Practices

Client Implementation

  1. Always send AUDIO_START before binary audio
  2. Buffer audio before sending (minimum 100ms chunks)
  3. Include sequence numbers for ordering
  4. Handle ERROR messages gracefully
  5. Implement heartbeat (PING every 30 seconds)
  6. Reconnect automatically on disconnect

Server Implementation

  1. Validate all messages before processing
  2. Broadcast state changes to all room members
  3. Clean up resources on disconnect
  4. Log all errors with context
  5. Rate limit per connection and IP

Example Client Flow (JavaScript)

const ws = new WebSocket('ws://localhost:8000/ws');

// Connect
ws.onopen = () => {
  // Join room
  ws.send(JSON.stringify({
    type: 'join_room',
    payload: {
      room_id: 'room123',
      user_id: 'user1',
      username: 'John',
      source_lang: 'en',
      target_lang: 'hi'
    }
  }));
};

// Handle messages
ws.onmessage = (event) => {
  if (event.data instanceof Blob) {
    // Binary audio data
    handleAudio(event.data);
  } else {
    // JSON message
    const msg = JSON.parse(event.data);
    handleMessage(msg);
  }
};

// Send audio
function sendAudio(audioBuffer) {
  ws.send(audioBuffer);
}

Security Considerations

  1. Always use WSS (WebSocket Secure) in production
  2. Validate JWT tokens if authentication enabled
  3. Sanitize user inputs (usernames, room IDs)
  4. Implement rate limiting to prevent abuse
  5. Monitor connection count to prevent DoS
  6. Encrypt sensitive data in messages