Spaces:

Mohansai2004
/

Voice_backend

Sleeping

App Files Files Community

Voice_backend / docs /websocket-protocol.md

Mohansai2004

Upload 67 files

24dc421 verified 2 months ago

preview code

raw

history blame contribute delete

12.1 kB

WebSocket Protocol

Overview

The Voice-to-Voice Translator uses a WebSocket-based protocol for real-time bidirectional communication between clients and the server. Messages are exchanged in JSON format (text) and raw audio data (binary).

Connection Endpoint

ws://host:port/ws

Connection Parameters

Query parameters (optional):

token: JWT authentication token (if auth enabled)
client_id: Unique client identifier

Example:

ws://localhost:8000/ws?token=eyJhbGc...&client_id=client123

Message Types

All text messages follow this structure:

{
  "type": "message_type",
  "payload": { ... },
  "timestamp": "2025-12-17T10:30:00Z",
  "message_id": "uuid-v4"
}

Client → Server Messages

1. JOIN_ROOM

Join or create a translation room.

{
  "type": "join_room",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "username": "John Doe",
    "source_lang": "en",
    "target_lang": "hi"
  }
}

Fields:

room_id: Unique room identifier
user_id: Unique user identifier
username: Display name
source_lang: User's speaking language (ISO 639-1 code)
target_lang: Desired translation language

Response: ROOM_JOINED or ERROR

2. LEAVE_ROOM

Leave current room.

{
  "type": "leave_room",
  "payload": {
    "room_id": "room123",
    "user_id": "user1"
  }
}

Response: ROOM_LEFT or ERROR

3. AUDIO_START

Signal start of audio stream.

{
  "type": "audio_start",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "audio_config": {
      "sample_rate": 16000,
      "channels": 1,
      "format": "PCM16",
      "chunk_size": 4096
    }
  }
}

Response: AUDIO_START_ACK

4. AUDIO_STOP

Signal end of audio stream.

{
  "type": "audio_stop",
  "payload": {
    "room_id": "room123",
    "user_id": "user1"
  }
}

Response: AUDIO_STOP_ACK

5. TEXT_MESSAGE

Send text message (for chat or corrections).

{
  "type": "text_message",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "text": "Hello, how are you?",
    "lang": "en"
  }
}

Response: TEXT_MESSAGE (broadcast to room)

6. PING

Heartbeat message.

{
  "type": "ping",
  "payload": {}
}

Response: PONG

7. GET_ROOM_INFO

Request current room state.

{
  "type": "get_room_info",
  "payload": {
    "room_id": "room123"
  }
}

Response: ROOM_INFO

Server → Client Messages

1. ROOM_JOINED

Confirmation of room join.

{
  "type": "room_joined",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "users": [
      {
        "user_id": "user1",
        "username": "John Doe",
        "source_lang": "en",
        "target_lang": "hi"
      },
      {
        "user_id": "user2",
        "username": "Jane Smith",
        "source_lang": "hi",
        "target_lang": "en"
      }
    ]
  }
}

2. ROOM_LEFT

Confirmation of room leave.

{
  "type": "room_left",
  "payload": {
    "room_id": "room123",
    "user_id": "user1"
  }
}

3. USER_JOINED

Broadcast when another user joins.

{
  "type": "user_joined",
  "payload": {
    "room_id": "room123",
    "user": {
      "user_id": "user2",
      "username": "Jane Smith",
      "source_lang": "hi",
      "target_lang": "en"
    }
  }
}

4. USER_LEFT

Broadcast when another user leaves.

{
  "type": "user_left",
  "payload": {
    "room_id": "room123",
    "user_id": "user2",
    "username": "Jane Smith"
  }
}

5. TRANSCRIPTION

Intermediate transcription result (from STT).

{
  "type": "transcription",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "text": "Hello how are you",
    "lang": "en",
    "is_final": false,
    "confidence": 0.85
  }
}

6. TRANSLATION

Translation result.

{
  "type": "translation",
  "payload": {
    "room_id": "room123",
    "source_user_id": "user1",
    "target_user_id": "user2",
    "original_text": "Hello, how are you?",
    "translated_text": "नमस्ते, आप कैसे हैं?",
    "source_lang": "en",
    "target_lang": "hi",
    "confidence": 0.92
  }
}

7. AUDIO_START_ACK

Acknowledgment of audio start.

{
  "type": "audio_start_ack",
  "payload": {
    "room_id": "room123",
    "user_id": "user1",
    "ready": true
  }
}

8. AUDIO_STOP_ACK

Acknowledgment of audio stop.

{
  "type": "audio_stop_ack",
  "payload": {
    "room_id": "room123",
    "user_id": "user1"
  }
}

9. PONG

Heartbeat response.

{
  "type": "pong",
  "payload": {
    "timestamp": "2025-12-17T10:30:00Z"
  }
}

10. ROOM_INFO

Room state information.

{
  "type": "room_info",
  "payload": {
    "room_id": "room123",
    "created_at": "2025-12-17T10:00:00Z",
    "users": [ ... ],
    "active_speakers": ["user1"],
    "supported_languages": ["en", "hi"]
  }
}

11. ERROR

Error notification.

{
  "type": "error",
  "payload": {
    "code": "INVALID_ROOM",
    "message": "Room does not exist",
    "details": "Room 'room123' not found",
    "recoverable": true
  }
}

Error Codes:

INVALID_ROOM: Room not found
ROOM_FULL: Maximum users reached
INVALID_MESSAGE: Malformed message
AUTH_FAILED: Authentication failed
RATE_LIMIT: Too many requests
INTERNAL_ERROR: Server error
UNSUPPORTED_LANGUAGE: Language not available
AUDIO_ERROR: Audio processing error

Binary Audio Messages

Audio data is sent as binary WebSocket frames.

Client → Server (Audio Input)

Binary message structure:

[Header (16 bytes)][Audio Data (variable)]

Header Format:

Bytes 0-7: User ID (UTF-8, padded)
Bytes 8-11: Sequence number (uint32, big-endian)
Bytes 12-15: Timestamp (uint32, milliseconds)

Audio Data:

Format: PCM16 (16-bit signed integer)
Sample Rate: 16000 Hz (configurable)
Channels: 1 (mono)
Byte Order: Little-endian

Server → Client (Translated Audio)

Binary message structure:

[Header (24 bytes)][Audio Data (variable)]

Header Format:

Bytes 0-7: Source User ID (UTF-8, padded)
Bytes 8-15: Target User ID (UTF-8, padded)
Bytes 16-19: Sequence number (uint32, big-endian)
Bytes 20-23: Timestamp (uint32, milliseconds)

Audio Data:

Same format as input

Connection Lifecycle

1. Connection Establishment

Client                          Server
  │                               │
  ├─────── WebSocket Connect ────►│
  │                               │
  │◄────── Connection Open ───────┤
  │                               │

2. Room Join

Client                          Server
  │                               │
  ├────────── JOIN_ROOM ─────────►│
  │                               │
  │◄───────── ROOM_JOINED ────────┤
  │                               │
  │◄───────── USER_JOINED ────────┤ (broadcast to others)
  │                               │

3. Audio Streaming

Client                          Server                     Other Client
  │                               │                              │
  ├───── AUDIO_START ────────────►│                              │
  │                               │                              │
  │◄──── AUDIO_START_ACK ─────────┤                              │
  │                               │                              │
  ├─── Binary Audio Chunk 1 ─────►│                              │
  ├─── Binary Audio Chunk 2 ─────►│                              │
  │                               │                              │
  │◄─────── TRANSCRIPTION ────────┤                              │
  │                               │                              │
  │◄─────── TRANSLATION ──────────┤                              │
  │                               │                              │
  │                               ├───► Binary Audio Chunk ─────►│
  │                               │                              │
  ├───── AUDIO_STOP ─────────────►│                              │
  │                               │                              │
  │◄──── AUDIO_STOP_ACK ──────────┤                              │
  │                               │                              │

4. Disconnection

Client                          Server
  │                               │
  ├────────── LEAVE_ROOM ────────►│
  │                               │
  │◄───────── ROOM_LEFT ──────────┤
  │                               │
  ├─────── Close Connection ─────►│
  │                               │
  │◄─────── Close Confirm ────────┤
  │                               │

Rate Limiting

Default limits:

Join room: 10 requests per minute
Audio streaming: Unlimited (quality-based throttling)
Text messages: 30 per minute
Room info requests: 60 per minute

Reconnection Strategy

Exponential Backoff: 1s, 2s, 4s, 8s, 16s, 30s (max)
Session Recovery: Send previous room_id and user_id on reconnect
State Sync: Server sends current room state after reconnection

Best Practices

Client Implementation

Always send AUDIO_START before binary audio
Buffer audio before sending (minimum 100ms chunks)
Include sequence numbers for ordering
Handle ERROR messages gracefully
Implement heartbeat (PING every 30 seconds)
Reconnect automatically on disconnect

Server Implementation

Validate all messages before processing
Broadcast state changes to all room members
Clean up resources on disconnect
Log all errors with context
Rate limit per connection and IP

Example Client Flow (JavaScript)

const ws = new WebSocket('ws://localhost:8000/ws');

// Connect
ws.onopen = () => {
  // Join room
  ws.send(JSON.stringify({
    type: 'join_room',
    payload: {
      room_id: 'room123',
      user_id: 'user1',
      username: 'John',
      source_lang: 'en',
      target_lang: 'hi'
    }
  }));
};

// Handle messages
ws.onmessage = (event) => {
  if (event.data instanceof Blob) {
    // Binary audio data
    handleAudio(event.data);
  } else {
    // JSON message
    const msg = JSON.parse(event.data);
    handleMessage(msg);
  }
};

// Send audio
function sendAudio(audioBuffer) {
  ws.send(audioBuffer);
}

Security Considerations

Always use WSS (WebSocket Secure) in production
Validate JWT tokens if authentication enabled
Sanitize user inputs (usernames, room IDs)
Implement rate limiting to prevent abuse
Monitor connection count to prevent DoS
Encrypt sensitive data in messages