Instructions to use pltobing/streaming-speech-translation with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pltobing/streaming-speech-translation with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pltobing/streaming-speech-translation",
	filename="models/nmt/translategemma-4b-it-q4_k_m-gguf/translategemma-4b-it-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = "\"sample1.flac\""
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use pltobing/streaming-speech-translation with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf pltobing/streaming-speech-translation:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf pltobing/streaming-speech-translation:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf pltobing/streaming-speech-translation:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf pltobing/streaming-speech-translation:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pltobing/streaming-speech-translation:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf pltobing/streaming-speech-translation:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pltobing/streaming-speech-translation:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pltobing/streaming-speech-translation:Q4_K_M

Use Docker

docker model run hf.co/pltobing/streaming-speech-translation:Q4_K_M

LM Studio
Jan
Ollama
How to use pltobing/streaming-speech-translation with Ollama:
```
ollama run hf.co/pltobing/streaming-speech-translation:Q4_K_M
```

Unsloth Studio

How to use pltobing/streaming-speech-translation with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pltobing/streaming-speech-translation to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pltobing/streaming-speech-translation to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pltobing/streaming-speech-translation to start chatting

Atomic Chat new
Docker Model Runner
How to use pltobing/streaming-speech-translation with Docker Model Runner:
```
docker model run hf.co/pltobing/streaming-speech-translation:Q4_K_M
```

Lemonade

How to use pltobing/streaming-speech-translation with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pltobing/streaming-speech-translation:Q4_K_M

Run and chat with the model

lemonade run user.streaming-speech-translation-Q4_K_M

List all available models

lemonade list

Access Streaming Speech Translation — Vertox-AI

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

To access Streaming Speech Translation — Vertox-AI, you must review and agree to the CC BY-NC 4.0 license. By submitting this form, you confirm that you have read the license and will only use the model under its terms. Requests are processed immediately.

Streaming Speech Translation Pipeline

Real-time English → Russian speech translation: Audio In → ASR → NMT → TTS → Audio Out Translates spoken English into spoken Russian with streaming output over WebSocket.

Input can only be English for now (due to ASR NeMo), while output language depending on TranslateGemma (NMT) and Qwen3-TTS/XTTSv2 (TTS). You can modify these accordingly.

Updates

As of 2026/05/05, you can use Qwen3-TTS, which is based on our Qwen3-TTS-Streaming-ONNX. This is integrated with the modular TTS implementation, where you can plug-in your favorite streaming TTS module.
As of 2026/05/14, the NMT has been made modular, so, you can plug-in your favorite NMT module for streaming, similarly as TTS.

Architecture

Audio Input → ASR (ONNX)                           → NMT (GGUF/vLLM)        → TTS (ONNX)       → Audio Output
(PCM16)       NeMo-Cache-Aware-FastConformer-RNN-T   TranslateGemma/Gemma-4   Qwen3-TTS/XTTSv2   (PCM16)

ASR: NVIDIA NeMo FastConformer RNN-T (cache-aware streaming, ONNX)
NMT: TranslateGemma 4B (GGUF Q8_0, llama-cpp-python) with streaming segmentation and translation merging or Gemma-4-E4B-it (vLLM with MTP)
TTS: Qwen3-TTS-Streaming or XTTSv2-Streaming (ONNX), 24kHz output

See ARCHITECTURE.md for detailed design documentation.

Requirements

Python 3.12
CUDA 12.8, CUDNN 9
llama-cpp-python with GPU
onnxruntime with GPU
Model files:
- ASR: NeMo Cache-Aware FastConformer RNN-T ONNX model directory
- NMT: TranslateGemma 4B GGUF file or vLLM server for Gemma-4-E4B-it with MTP
- TTS: Qwen3-TTS/XTTSv2 ONNX model and configs directory

Installation

For CUDA and CUDNN:

# uninstall, remove, and purge previous CUDA and CUDNN installation if they are not the correct versions
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-8 cudnn9-cuda-12
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}' >> ~/.bashrc
echo 'export PATH=/usr/local/cuda-12.8/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

For llama-cpp-python (GGUF):

echo 'export CMAKE_ARGS="-DGGML_CUDA=on"' >> ~/.bashrc
echo 'export CUDACXX=/usr/local/cuda/bin/nvcc' >> ~/.bashrc
echo 'export FORCE_CMAKE=1' >> ~/.bashrc
source ~/.bashrc
sudo apt-get install -y build-essential cmake ninja-build python3-dev
pip uninstall -y llama-cpp-python
pip install --no-cache-dir --force-reinstall llama-cpp-python

For this model package:

pip install -r requirements.txt

System Dependencies

# Ubuntu/Debian
apt-get install libsndfile1 libportaudio2

Usage

Start the Server

Recommended to use at least g6.xlarge (L4, 4vCPUs, 16GB RAM). GPU memory usage will be around 17GB/9GB for all 3 ASR-NMT-Qwen3TTS/XTTSv2 models.
Total model latencies will be around 350/300 ms (ASR 18 ms., NMT 150 ms., Qwen3-TTS/XTTSv2 200/150 ms.). g5.xlarge with A10G will have a 1.5x faster processing time.
This, still bounded by the ASR lookahead 560 ms (NeMO Cache-Aware Streaming ASR FastConformer-RNN-T).
Hence, the effective delay is 560+350/300=910/860 ms, which is very comfortable for a simultaneous/streaming speech translation.

python app.py \
  --asr-onnx-path models/asr/nemo-cache-aware-streaming-560ms-onnx/ \
  --streaming-nmt-config-path configs/nmt/translategemma/translategemma_streaming_nmt_config.json \
  --streaming-nmt-pipeline-config-path configs/nmt/translategemma/translategemma_streaming_nmt_pipeline_config.json \
  --streaming-tts-config-path configs/tts/qwen3_tts/qwen3_tts_streaming_tts_config.json \
  --streaming-tts-pipeline-config-path configs/tts/qwen3_tts/qwen3_tts_streaming_tts_pipeline_config.json \
  --tts-ref-audio-path audio_ref/male_stewie.mp3 \
  --host 0.0.0.0 \
  --port 8765

CLI Options

You can find TTS model specific related configurations in the configs/tts/qwen3_tts/ or configs/tts/xtts/ dir., which are used for the streaming-tts-config-path and streaming-tts-pipeline-config-path.
These include TTS's model location, queue sizes, chunk sizes, number of threads, and gpu usage.

Flag	Default	Description
`--asr-onnx-path`	(required)	ASR ONNX model directory
`--asr-chunk-ms`	10	ASR audio chunk duration (ms)
`--asr-sample-rate`	16000	ASR expected sample rate
`--streaming-nmt-config-path`	(required)	Config file for StreamingNMT instantiation
`--streaming-nmt-pipeline-config-path`	(required)	Config file for StreamingNMTPipeline instantiation
`--streaming-tts-config-path`	(required)	Config file for StreamingTTS instantiation
`--streaming-tts-pipeline-config-path`	(required)	Config file for StreamingTTSPipeline instantiation
`--tts-ref-audio-path`	(required)	TTS reference speaker audio
`--tts-language`	ru	TTS target language code
`--audio-queue-max`	256	Audio input queue max size
`--audio-out-queue-max`	32	Audio output queue max size
`--host`	0.0.0.0	Server bind host
`--port`	8765	Server port

Python Client

Captures microphone audio and plays back translated speech:

pip install -r requirements_client.txt
python clients/python_client.py --uri ws://<server_ip_address/localhost>:8765

Web Client

TBD

WebSocket Protocol

Direction	Type	Format	Description
Client→	Binary	PCM16	Raw audio at declared sample rate
Client→	Text	JSON	`{"action": "start", "sample_rate": 16000}`
Client→	Text	JSON	`{"action": "stop"}`
→Client	Binary	PCM16	Synthesized audio at 24kHz
→Client	Text	JSON	`{"type": "transcript", "text": "..."}`
→Client	Text	JSON	`{"type": "translation", "text": "..."}`
→Client	Text	JSON	`{"type": "status", "status": "started"}`

Docker

TBD

Project Structure

streaming_speech_translation/
├── app.py                              # Main entry point
├── requirements.txt
├── README.md
├── ARCHITECTURE.md
├── Dockerfile
├── models/
│   ├── asr/
│   │   └── nemo-cache-aware-streaming-560ms-onnx/
│   ├── nmt/
│   │   ├── translategemma-4b-it-q8_0-gguf/
│   │   └── translategemma-4b-it-q4_k_m-gguf/
│   └── tts/
│       ├── qwen3-tts-onnx/
│       └── xttsv2-onnx/
├── src/
│   ├── asr/
│   │   ├── streaming_asr.py            # StreamingASR wrapper
│   │   ├── cache_aware_modules.py      # Audio buffer + streaming ASR
│   │   ├── cache_aware_modules_config.py
│   │   ├── modules.py                  # ONNX model loading
│   │   ├── modules_config.py
│   │   ├── onnx_utils.py
│   │   └── utils.py                    # Audio utilities
│   ├── nmt/
│   │   ├── base
│   │   │   ├── streaming_nmt_base.py               # Base class for StreamingNMT wrapper
│   │   │   └── streaming_nmt_pipeline_base.py      # Base class for StreamingNMTPipeline wrapper
│   │   ├── factory
│   │   │   ├── streaming_nmt_factory.py            # Factory class for StreamingNMT wrapper
│   │   │   └── streaming_nmt_pipeline_factory.py   # Factory class for StreamingNMTPipeline wrapper
│   │   ├── pipeline
│   │   │   └── translategemma/
│   │   │        └── translategemma_nmt_streaming_pipeline.py # TranslateGemma-specific StreamingNMTPipeline class
│   │   ├── streaming
│   │   │   └── translategemma/
│   │   │        └── translategemma_streaming_nmt.py          # TranslateGemma-specific StreamingNMT class
│   │   └── utils/
│   │       ├── text_utils.py                    
│   │       └── translategemma/
│   │            ├── streaming_segmenter.py             # ASR output segmenter for NMT input
│   │            └── streaming_translation_merger.py    # Handler to stabilize NMT output revision
│   ├── tts/
│   │   ├── base
│   │   │   ├── streaming_tts_base.py               # Base class for StreamingTTS wrapper
│   │   │   └── streaming_tts_pipeline_base.py      # Base class for StreamingTTSPipeline wrapper
│   │   ├── factory
│   │   │   ├── streaming_tts_factory.py            # Factory class for StreamingTTS wrapper
│   │   │   └── streaming_tts_pipeline_factory.py   # Factory class for StreamingTTSPipeline wrapper
│   │   ├── pipeline
│   │   │   ├── qwen3_tts/
│   │   │   │    └── qwen3_tts_streaming_pipeline.py# Qwen3-TTS-specific StreamingTTSPipeline class
│   │   │   └── xtts/
│   │   │        └── xtts_streaming_pipeline.py     # XTTSv2-specific StreamingTTSPipeline class
│   │   ├── streaming
│   │   │   ├── qwen3_tts/
│   │   │   │    └── qwen3_tts_streaming_tts.py     # Qwen3-TTS-specific StreamingTTS class
│   │   │   └── xtts/
│   │   │        └── xtts_streaming_tts.py          # XTTSv2-specific StreamingTTS class
│   │   └── utils/
│   │       ├── tts_segmenter.py                    # Handler for TTS text input segmenter
│   │       ├── qwen3_tts/
│   │       │    ├── audio_utils.py                 # Audio utility functions for Qwen3-TTS
│   │       │    ├── qwen3_tts_config.py            # Model and sampling configs dataclass for Qwen3-TTS
│   │       │    ├── qwen3_tts_onnx_session_manager.py # ONNX sessions handler for Qwen3-TTS modules
│   │       │    ├── qwen3_tts_text_processor.py    # Qwen3-TTS-specific text front-end
│   │       │    └── utils.py                       # Qwen3-TTS-specific utilities functions (sampling, RoPE)
│   │       └── xtts/
│   │            ├── xtts_onnx_orchestrator.py      # ONNX sessions handler for XTTSv2 modules
│   │            ├── xtts_tokenizer.py              # XTTSv2 BPE tokenizer
│   │            ├── xtts_tts_warmup.py             # Warmup handler for XTTSv2 ONNX sessions
│   │            └── zh_num2words.py                # XTTSv2 Chinese text normalization
│   ├── pipeline/
│   │   ├── orchestrator.py             # PipelineOrchestrator
│   │   └── config.py                   # PipelineConfig
│   └── server/
│       └── websocket_server.py         # WebSocket server
└── clients/
    ├── python_client.py                # Python CLI client
    └── web_client.html                 # Browser client

See ARCHITECTURE.md for the full concurrency diagram and queue map.

LICENSE and COPYRIGHT

This repository is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:

✅ Research and academic use
✅ Personal experimentation
✅ Open-source contributions
❌ Commercial applications
❌ Production deployment
❌ Monetized services

By: Patrick Lumbantobing

Copyright@VertoX-AI

Citation

If you use this system in your research, please cite:

@misc{vertoxai2026streamingspeechtranslation,
  title={Streaming Speech Translation — VertoX-AI},
  author={Tobing, P. L., VertoX-AI},
  year={2026},
  publisher={HuggingFace},
}

Acknowledgments

NVIDIA for Cache-Aware ASR NeMo
istupakov for the ONNX reference
Google for the TranslateGemma NMT model and Gemma-4
Coqui for the XTTSv2
Qwen for the Qwen3-TTS

Downloads last month: 3

GGUF

Model size

4B params

Architecture

gemma3

Hardware compatibility

4-bit

8-bit

Model tree for pltobing/streaming-speech-translation

Base model

nvidia/nemotron-speech-streaming-en-0.6b

Quantized

(17)

this model

Collection including pltobing/streaming-speech-translation

Streaming ST

Collection

Streaming speech translation models and frameworks • 3 items • Updated Mar 28