Audio-to-Audio
ONNX
GGUF
speech-translation
streaming-speech-translation
speech
audio
speech-recognition
automatic-speech-recognition
streaming-asr
ASR
NeMo
ONNX
cache-aware ASR
FastConformer
RNNT
Parakeet
neural-machine-translation
NMT
Translation
gemma3
llama-cpp
GGUF
conversational
text-to-speech
TTS
xtts
xttsv2
voice-clone
gpt2
hifigan
multilingual
vq
perceiver-encoder
websocket
| language: | |
| - en | |
| - es | |
| - fr | |
| - de | |
| - it | |
| - pt | |
| - pl | |
| - tr | |
| - ru | |
| - nl | |
| - cs | |
| - ar | |
| - zh | |
| - ja | |
| - hu | |
| - ko | |
| - hi | |
| tags: | |
| - speech-translation | |
| - streaming-speech-translation | |
| - speech | |
| - audio | |
| - speech-recognition | |
| - automatic-speech-recognition | |
| - streaming-asr | |
| - ASR | |
| - NeMo | |
| - ONNX | |
| - cache-aware ASR | |
| - FastConformer | |
| - RNNT | |
| - Parakeet | |
| - neural-machine-translation | |
| - NMT | |
| - Translation | |
| - gemma3 | |
| - llama-cpp | |
| - GGUF | |
| - conversational | |
| - text-to-speech | |
| - TTS | |
| - xtts | |
| - xttsv2 | |
| - voice-clone | |
| - gpt2 | |
| - hifigan | |
| - multilingual | |
| - vq | |
| - perceiver-encoder | |
| - websocket | |
| pipeline_tag: audio-to-audio | |
| license: cc-by-nc-4.0 | |
| base_model: nvidia/nemotron-speech-streaming-en-0.6b | |
| extra_gated_heading: "Access Streaming Speech Translation — Vertox-AI" | |
| extra_gated_prompt: >- | |
| To access Streaming Speech Translation — Vertox-AI, you must review and agree | |
| to the CC BY-NC 4.0 license. By submitting this form, you confirm that you | |
| have read the license and will only use the model under its terms. Requests | |
| are processed immediately. | |
| extra_gated_fields: | |
| Full name: text | |
| Affiliation: text | |
| Type of affiliation: | |
| type: select | |
| options: | |
| - Academia | |
| - Industry | |
| - label: Other | |
| value: other | |
| Institutional email (ideally matches your primary Hugging Face email): text | |
| Please briefly describe your intended research use: text | |
| I agree to the license and terms of use described above: checkbox | |
| extra_gated_button_content: "Submit access request" | |
| # Streaming Speech Translation Pipeline | |
| Real-time English → Russian speech translation: **Audio In → ASR → NMT → TTS → Audio Out** | |
| Translates spoken English into spoken Russian with streaming output over WebSocket. | |
| Input can only be English for now (due to ASR NeMo), while output language depending on TranslateGemma (NMT) and XTTSv2 (TTS). You can modify these accordingly. | |
| ## Architecture | |
| ``` | |
| Audio Input → ASR (ONNX) → NMT (GGUF) → TTS (ONNX) → Audio Output | |
| (PCM16) Conformer RNN-T TranslateGemma XTTSv2 (PCM16) | |
| ``` | |
| - **ASR**: NVIDIA NeMo FastConformer RNN-T (cache-aware streaming, ONNX) | |
| - **NMT**: TranslateGemma 4B (GGUF Q8_0, llama-cpp-python) with streaming segmentation and translation merging | |
| - **TTS**: XTTSv2 with GPT-2 AR model + HiFi-GAN vocoder (ONNX), 24kHz output | |
| See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed design documentation. | |
| ## Requirements | |
| - Python 3.10+ | |
| - Model files: | |
| - ASR: NeMo FastConformer RNN-T ONNX model directory | |
| - NMT: TranslateGemma 4B GGUF file | |
| - TTS: XTTSv2 ONNX model directory, BPE vocab, mel normalization stats, reference audio | |
| ## Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### System Dependencies | |
| ```bash | |
| # Ubuntu/Debian | |
| apt-get install libsndfile1 libportaudio2 | |
| ``` | |
| ## Usage | |
| ### Start the Server | |
| - Recommended to use --tts-int8-gpt if using CPU. | |
| - Recommended to at least use 8 core CPUs, e.g., m8a.2xlarge, with the default --nmt-n-threads 2 and --tts-threads-gpt 1. | |
| - Recommended to increase the --nmt-n-threads to 4 and --tts-threads-gpt to 2 with 16 core CPUs, e.g., m8a.4xlarge, to get smooth processing. | |
| ```bash | |
| python app.py \ | |
| --asr-onnx-path models/asr/nemo-cache-aware-streaming-560ms-onnx/ \ | |
| --nmt-gguf-path models/nmt/translategemma-4b-it-q8_0-gguf/translategemma-4b-it-q8_0.gguf \ | |
| --tts-model-dir models/tts/xttsv2-onnx/ \ | |
| --tts-vocab-path models/tts/xttsv2-onnx/vocab.json \ | |
| --tts-mel-norms-path models/tts/xttsv2-onnx/mel_stats.npy \ | |
| --tts-ref-audio-path audio_ref/male_stewie.mp3 \ | |
| --tts-int8-gpt \ | |
| --host 0.0.0.0 \ | |
| --port 8765 | |
| ``` | |
| ### CLI Options | |
| | Flag | Default | Description | | |
| |------|---------|-------------| | |
| | `--asr-onnx-path` | *(required)* | ASR ONNX model directory | | |
| | `--asr-chunk-ms` | 10 | ASR audio chunk duration (ms) | | |
| | `--asr-sample-rate` | 16000 | ASR expected sample rate | | |
| | `--nmt-gguf-path` | *(required)* | NMT GGUF model file | | |
| | `--nmt-n-threads` | 2 | NMT CPU threads | | |
| | `--tts-model-dir` | *(required)* | TTS ONNX model directory | | |
| | `--tts-vocab-path` | *(required)* | TTS BPE vocab.json | | |
| | `--tts-mel-norms-path` | *(required)* | TTS mel_stats.npy | | |
| | `--tts-ref-audio-path` | *(required)* | TTS reference speaker audio | | |
| | `--tts-language` | ru | TTS target language code | | |
| | `--tts-int8-gpt` | False | Use INT8 quantized GPT | | |
| | `--tts-threads-gpt` | 1 | TTS GPT ONNX threads | | |
| | `--tts-chunk-size` | 20 | TTS AR tokens per vocoder chunk | | |
| | `--audio-queue-max` | 256 | Audio input queue max size | | |
| | `--text-queue-max` | 64 | Text queue max size | | |
| | `--tts-queue-max` | 16 | NMT→TTS text queue max size | | |
| | `--audio-out-queue-max` | 32 | Audio output queue max size | | |
| | `--host` | 0.0.0.0 | Server bind host | | |
| | `--port` | 8765 | Server port | | |
| ### Python Client | |
| Captures microphone audio and plays back translated speech: | |
| ```bash | |
| pip install -r requirements_client.txt | |
| python clients/python_client.py --uri ws://localhost:8765 | |
| ``` | |
| ### Web Client | |
| TBD | |
| ## WebSocket Protocol | |
| | Direction | Type | Format | Description | | |
| |-----------|------|--------|-------------| | |
| | Client→ | Binary | PCM16 | Raw audio at declared sample rate | | |
| | Client→ | Text | JSON | `{"action": "start", "sample_rate": 16000}` | | |
| | Client→ | Text | JSON | `{"action": "stop"}` | | |
| | →Client | Binary | PCM16 | Synthesized audio at 24kHz | | |
| | →Client | Text | JSON | `{"type": "transcript", "text": "..."}` | | |
| | →Client | Text | JSON | `{"type": "translation", "text": "..."}` | | |
| | →Client | Text | JSON | `{"type": "status", "status": "started"}` | | |
| ## Docker | |
| ```bash | |
| docker build -t streaming-translation . | |
| docker run -p 8765:8765 \ | |
| -v /path/to/models:/models \ | |
| streaming-translation \ | |
| --asr-onnx-path /models/asr/ \ | |
| --nmt-gguf-path /models/translategemma-4b-it-q8_0.gguf \ | |
| --tts-model-dir /models/xtts/ \ | |
| --tts-vocab-path /models/xtts/vocab.json \ | |
| --tts-mel-norms-path /models/xtts/mel_stats.npy \ | |
| --tts-ref-audio-path /models/reference.wav | |
| ``` | |
| ## Project Structure | |
| ``` | |
| streaming_speech_translation/ | |
| ├── app.py # Main entry point | |
| ├── requirements.txt | |
| ├── README.md | |
| ├── ARCHITECTURE.md | |
| ├── Dockerfile | |
| ├── models/ | |
| │ ├── asr/ | |
| │ │ └── nemo-cache-aware-streaming-560ms-onnx/ | |
| │ ├── nmt/ | |
| │ │ ├── translategemma-4b-it-q8_0-gguf/ | |
| │ │ └── translategemma-4b-it-q4_k_m-gguf/ | |
| │ └── tts/ | |
| │ └── xttsv2-onnx/ | |
| ├── src/ | |
| │ ├── asr/ | |
| │ │ ├── streaming_asr.py # StreamingASR wrapper | |
| │ │ ├── cache_aware_modules.py # Audio buffer + streaming ASR | |
| │ │ ├── cache_aware_modules_config.py | |
| │ │ ├── modules.py # ONNX model loading | |
| │ │ ├── modules_config.py | |
| │ │ ├── onnx_utils.py | |
| │ │ └── utils.py # Audio utilities | |
| │ ├── nmt/ | |
| │ │ ├── streaming_nmt.py # StreamingNMT wrapper | |
| │ │ ├── streaming_segmenter.py # Word-group segmentation | |
| │ │ ├── streaming_translation_merger.py | |
| │ │ └── translator_module.py # TranslateGemma via llama-cpp | |
| │ ├── tts/ | |
| │ │ ├── streaming_tts.py # StreamingTTS wrapper | |
| │ │ ├── xtts_streaming_pipeline.py # Full TTS pipeline | |
| │ │ ├── xtts_onnx_orchestrator.py # GPT-2 AR + vocoder | |
| │ │ ├── xtts_tokenizer.py # BPE tokenizer | |
| │ │ └── zh_num2words.py # Chinese text normalization | |
| │ ├── pipeline/ | |
| │ │ ├── orchestrator.py # PipelineOrchestrator | |
| │ │ └── config.py # PipelineConfig | |
| │ └── server/ | |
| │ └── websocket_server.py # WebSocket server | |
| └── clients/ | |
| ├── python_client.py # Python CLI client | |
| └── web_client.html # Browser client | |
| ``` | |
| ## TTS Threading Update (v2 Refactor) | |
| The TTS integration has been revised to match the 3-thread ASR model. | |
| ### Previous design | |
| Both GPT-2 AR generation and HiFi-GAN vocoding ran inside a single | |
| `synthesize_stream()` call that was dispatched to the shared | |
| `ThreadPoolExecutor`: | |
| ``` | |
| [orchestrator asyncio loop] | |
| └─ run_in_executor ──► synthesize_stream() | |
| ├─ GPT-2 AR loop (blocking) | |
| └─ HiFi-GAN (blocking) | |
| ``` | |
| This meant the executor slot was held for the entire TTS inference duration, | |
| blocking NMT dispatches and delivering audio only after full-segment synthesis. | |
| ### New design | |
| Two dedicated daemon threads decouple GPT generation from vocoding: | |
| ``` | |
| text ──► [TTS-GPT Thread] ──latent batches──► [TTS-Vocoder Thread] ──► audio | |
| BPE + AR loop HiFi-GAN + crossfade | |
| ``` | |
| The vocoder starts producing audio as soon as the first `gpt_chunk_size` | |
| (default 20) AR tokens are generated, rather than waiting for the full segment. | |
| ### New CLI flags | |
| | Flag | Default | Description | | |
| |------|---------|-------------| | |
| | `--tts-text-queue-max` | 8 | Max segments in TTS text input queue | | |
| | `--tts-latent-queue-max` | 4 | Max latent batches in TTS-GPT→Vocoder queue | | |
| See [ARCHITECTURE.md](ARCHITECTURE.md) for the full concurrency diagram and queue map. | |
| ### LICENSE and COPYRIGHT | |
| This repository is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means: | |
| - ✅ Research and academic use | |
| - ✅ Personal experimentation | |
| - ✅ Open-source contributions | |
| - ❌ Commercial applications | |
| - ❌ Production deployment | |
| - ❌ Monetized services | |
| #### By: [Patrick Lumbantobing](https://www.linkedin.com/in/patrick-lumban-tobing) | |
| #### Copyright@[VertoX-AI](https://www.linkedin.com/company/vertoxai/) | |
| ### Citation | |
| If you use this system in your research, please cite: | |
| ```bibtex | |
| @misc{vertoxai2026streamingspeechtranslation, | |
| title={Streaming Speech Translation — VertoX-AI}, | |
| author={Tobing, P. L., VertoX-AI}, | |
| year={2026}, | |
| publisher={HuggingFace}, | |
| } | |
| ``` | |
| ### Acknowledgments | |
| - [NVIDIA](https://huggingface.co/nvidia) for Cache-Aware ASR NeMo | |
| - [istupakov](https://huggingface.co/istupakov) for the ONNX reference | |
| - [Google](https://huggingface.co/google) for the TranslateGemma NMT model | |
| - [Coqui](https://huggingface.co/coqui) for the XTTSv2 | |