Spaces:
Configuration error
Configuration error
Parakeet TDT Streaming ASR Server
Low-latency WebSocket server for NVIDIA Parakeet TDT speech recognition with sliding buffer context.
Features
- Sliding buffer streaming: 3-second context window for accurate transcription
- Real-time updates: 320ms chunk processing (~3 updates/second)
- Direct tensor inference: Bypasses file I/O for ~80-100ms latency
- Pipecat compatible: WebSocket protocol for voice AI pipelines
- LCS-based incremental output: Shows only new text in each update
Quick Start
1. Install dependencies
pip install -r requirements.txt
2. Download a Parakeet model
# Option A: Official NVIDIA v3 model (25 languages)
huggingface-cli download nvidia/parakeet-tdt-0.6b-v3 --local-dir ./parakeet-tdt-0.6b-v3
# Option B: Hindi/English bilingual model
huggingface-cli download ketav/parakeet-tdt-0.6b-hindi --local-dir ./parakeet-tdt-0.6b-hindi
3. Start the server
# With v3 model
python parakeet_server_streaming.py --port 7001 --model ./parakeet-tdt-0.6b-v3/parakeet-tdt-0.6b-v3.nemo
# With Hindi model
python parakeet_server_streaming.py --port 7001 --model ./parakeet-tdt-0.6b-hindi/parakeet-tdt-0.6b-hindi.nemo
WebSocket Protocol
Client sends:
- Config message (optional):
{"type": "config", "language": "auto"}
Audio data: Raw float32 PCM bytes at 16kHz
Finalize:
{"type": "finalize"}
Server responds:
- Partial results (streaming):
{
"type": "partial",
"text": "what is going on",
"new_text": "going on",
"is_final": false,
"latency_ms": 85
}
- Final result:
{
"type": "transcript",
"text": "what is going on today",
"is_final": true,
"latency_ms": 0
}
Configuration Options
| Option | Default | Description |
|---|---|---|
--port |
7001 | WebSocket server port |
--host |
0.0.0.0 | Server host |
--model |
./parakeet-tdt-0.6b-v3/*.nemo | Path to .nemo model |
--buffer-sec |
3.0 | Sliding buffer duration (context) |
--chunk-ms |
320 | Processing chunk size |
Pipecat Integration
Use pipecat_parakeet_stt.py for integration with Pipecat voice AI pipelines:
from pipecat_parakeet_stt import ParakeetSTTService
stt = ParakeetSTTService(
ws_url="ws://localhost:7001",
sample_rate=16000
)
Architecture
Audio Stream
β
βΌ
βββββββββββββββββββββββββββββββββββ
β Incoming Buffer (accumulate) β
βββββββββββββββββββββββββββββββββββ
β every 320ms
βΌ
βββββββββββββββββββββββββββββββββββ
β Sliding Buffer (3 sec context) β
β [=========================] β
βββββββββββββββββββββββββββββββββββ
β transcribe full buffer
βΌ
βββββββββββββββββββββββββββββββββββ
β LCS Merge (find new text) β
βββββββββββββββββββββββββββββββββββ
β
βΌ
Output: "new_text" + "text"
Performance
- Latency: ~80-100ms per chunk (GPU)
- RTF: ~0.25 (4x real-time)
- Memory: ~5GB VRAM for 0.6B model
License
Apache 2.0 (same as NeMo)