ketav's picture
Upload folder using huggingface_hub
4dfb744 verified

Parakeet TDT Streaming ASR Server

Low-latency WebSocket server for NVIDIA Parakeet TDT speech recognition with sliding buffer context.

Features

  • Sliding buffer streaming: 3-second context window for accurate transcription
  • Real-time updates: 320ms chunk processing (~3 updates/second)
  • Direct tensor inference: Bypasses file I/O for ~80-100ms latency
  • Pipecat compatible: WebSocket protocol for voice AI pipelines
  • LCS-based incremental output: Shows only new text in each update

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Download a Parakeet model

# Option A: Official NVIDIA v3 model (25 languages)
huggingface-cli download nvidia/parakeet-tdt-0.6b-v3 --local-dir ./parakeet-tdt-0.6b-v3

# Option B: Hindi/English bilingual model
huggingface-cli download ketav/parakeet-tdt-0.6b-hindi --local-dir ./parakeet-tdt-0.6b-hindi

3. Start the server

# With v3 model
python parakeet_server_streaming.py --port 7001 --model ./parakeet-tdt-0.6b-v3/parakeet-tdt-0.6b-v3.nemo

# With Hindi model
python parakeet_server_streaming.py --port 7001 --model ./parakeet-tdt-0.6b-hindi/parakeet-tdt-0.6b-hindi.nemo

WebSocket Protocol

Client sends:

  1. Config message (optional):
{"type": "config", "language": "auto"}
  1. Audio data: Raw float32 PCM bytes at 16kHz

  2. Finalize:

{"type": "finalize"}

Server responds:

  1. Partial results (streaming):
{
  "type": "partial",
  "text": "what is going on",
  "new_text": "going on",
  "is_final": false,
  "latency_ms": 85
}
  1. Final result:
{
  "type": "transcript",
  "text": "what is going on today",
  "is_final": true,
  "latency_ms": 0
}

Configuration Options

Option Default Description
--port 7001 WebSocket server port
--host 0.0.0.0 Server host
--model ./parakeet-tdt-0.6b-v3/*.nemo Path to .nemo model
--buffer-sec 3.0 Sliding buffer duration (context)
--chunk-ms 320 Processing chunk size

Pipecat Integration

Use pipecat_parakeet_stt.py for integration with Pipecat voice AI pipelines:

from pipecat_parakeet_stt import ParakeetSTTService

stt = ParakeetSTTService(
    ws_url="ws://localhost:7001",
    sample_rate=16000
)

Architecture

Audio Stream
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Incoming Buffer (accumulate)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ every 320ms
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Sliding Buffer (3 sec context) β”‚
β”‚  [=========================]    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚ transcribe full buffer
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LCS Merge (find new text)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
  Output: "new_text" + "text"

Performance

  • Latency: ~80-100ms per chunk (GPU)
  • RTF: ~0.25 (4x real-time)
  • Memory: ~5GB VRAM for 0.6B model

License

Apache 2.0 (same as NeMo)