README.md · pltobing/streaming-speech-translation at main

streaming-speech-translation / README.md

pltobing

docs: add gated access request form

267a0e1 14 days ago

preview code

raw

history blame contribute delete

10.5 kB

	---
	language:
	- en
	- es
	- fr
	- de
	- it
	- pt
	- pl
	- tr
	- ru
	- nl
	- cs
	- ar
	- zh
	- ja
	- hu
	- ko
	- hi
	tags:
	- speech-translation
	- streaming-speech-translation
	- speech
	- audio
	- speech-recognition
	- automatic-speech-recognition
	- streaming-asr
	- ASR
	- NeMo
	- ONNX
	- cache-aware ASR
	- FastConformer
	- RNNT
	- Parakeet
	- neural-machine-translation
	- NMT
	- Translation
	- gemma3
	- llama-cpp
	- GGUF
	- conversational
	- text-to-speech
	- TTS
	- xtts
	- xttsv2
	- voice-clone
	- gpt2
	- hifigan
	- multilingual
	- vq
	- perceiver-encoder
	- websocket
	pipeline_tag: audio-to-audio
	license: cc-by-nc-4.0
	base_model: nvidia/nemotron-speech-streaming-en-0.6b
	extra_gated_heading: "Access Streaming Speech Translation — Vertox-AI"
	extra_gated_prompt: >-
	To access Streaming Speech Translation — Vertox-AI, you must review and agree
	to the CC BY-NC 4.0 license. By submitting this form, you confirm that you
	have read the license and will only use the model under its terms. Requests
	are processed immediately.
	extra_gated_fields:
	Full name: text
	Affiliation: text
	Type of affiliation:
	type: select
	options:
	- Academia
	- Industry
	- label: Other
	value: other
	Institutional email (ideally matches your primary Hugging Face email): text
	Please briefly describe your intended research use: text
	I agree to the license and terms of use described above: checkbox
	extra_gated_button_content: "Submit access request"
	---
	# Streaming Speech Translation Pipeline

	Real-time English → Russian speech translation: Audio In → ASR → NMT → TTS → Audio Out

	Translates spoken English into spoken Russian with streaming output over WebSocket.

	Input can only be English for now (due to ASR NeMo), while output language depending on TranslateGemma (NMT) and XTTSv2 (TTS). You can modify these accordingly.

	## Architecture

	```
	Audio Input → ASR (ONNX) → NMT (GGUF) → TTS (ONNX) → Audio Output
	(PCM16) Conformer RNN-T TranslateGemma XTTSv2 (PCM16)
	```

	- ASR: NVIDIA NeMo FastConformer RNN-T (cache-aware streaming, ONNX)
	- NMT: TranslateGemma 4B (GGUF Q8_0, llama-cpp-python) with streaming segmentation and translation merging
	- TTS: XTTSv2 with GPT-2 AR model + HiFi-GAN vocoder (ONNX), 24kHz output

	See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed design documentation.

	## Requirements

	- Python 3.10+
	- Model files:
	- ASR: NeMo FastConformer RNN-T ONNX model directory
	- NMT: TranslateGemma 4B GGUF file
	- TTS: XTTSv2 ONNX model directory, BPE vocab, mel normalization stats, reference audio

	## Installation

	```bash
	pip install -r requirements.txt
	```

	### System Dependencies

	```bash
	# Ubuntu/Debian
	apt-get install libsndfile1 libportaudio2
	```

	## Usage

	### Start the Server

	- Recommended to use --tts-int8-gpt if using CPU.
	- Recommended to at least use 8 core CPUs, e.g., m8a.2xlarge, with the default --nmt-n-threads 2 and --tts-threads-gpt 1.
	- Recommended to increase the --nmt-n-threads to 4 and --tts-threads-gpt to 2 with 16 core CPUs, e.g., m8a.4xlarge, to get smooth processing.

	```bash
	python app.py \
	--asr-onnx-path models/asr/nemo-cache-aware-streaming-560ms-onnx/ \
	--nmt-gguf-path models/nmt/translategemma-4b-it-q8_0-gguf/translategemma-4b-it-q8_0.gguf \
	--tts-model-dir models/tts/xttsv2-onnx/ \
	--tts-vocab-path models/tts/xttsv2-onnx/vocab.json \
	--tts-mel-norms-path models/tts/xttsv2-onnx/mel_stats.npy \
	--tts-ref-audio-path audio_ref/male_stewie.mp3 \
	--tts-int8-gpt \
	--host 0.0.0.0 \
	--port 8765
	```

	### CLI Options

	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `--asr-onnx-path` \| (required) \| ASR ONNX model directory \|
	\| `--asr-chunk-ms` \| 10 \| ASR audio chunk duration (ms) \|
	\| `--asr-sample-rate` \| 16000 \| ASR expected sample rate \|
	\| `--nmt-gguf-path` \| (required) \| NMT GGUF model file \|
	\| `--nmt-n-threads` \| 2 \| NMT CPU threads \|
	\| `--tts-model-dir` \| (required) \| TTS ONNX model directory \|
	\| `--tts-vocab-path` \| (required) \| TTS BPE vocab.json \|
	\| `--tts-mel-norms-path` \| (required) \| TTS mel_stats.npy \|
	\| `--tts-ref-audio-path` \| (required) \| TTS reference speaker audio \|
	\| `--tts-language` \| ru \| TTS target language code \|
	\| `--tts-int8-gpt` \| False \| Use INT8 quantized GPT \|
	\| `--tts-threads-gpt` \| 1 \| TTS GPT ONNX threads \|
	\| `--tts-chunk-size` \| 20 \| TTS AR tokens per vocoder chunk \|
	\| `--audio-queue-max` \| 256 \| Audio input queue max size \|
	\| `--text-queue-max` \| 64 \| Text queue max size \|
	\| `--tts-queue-max` \| 16 \| NMT→TTS text queue max size \|
	\| `--audio-out-queue-max` \| 32 \| Audio output queue max size \|
	\| `--host` \| 0.0.0.0 \| Server bind host \|
	\| `--port` \| 8765 \| Server port \|

	### Python Client

	Captures microphone audio and plays back translated speech:

	```bash
	pip install -r requirements_client.txt
	python clients/python_client.py --uri ws://localhost:8765
	```

	### Web Client

	TBD

	## WebSocket Protocol

	\| Direction \| Type \| Format \| Description \|
	\|-----------\|------\|--------\|-------------\|
	\| Client→ \| Binary \| PCM16 \| Raw audio at declared sample rate \|
	\| Client→ \| Text \| JSON \| `{"action": "start", "sample_rate": 16000}` \|
	\| Client→ \| Text \| JSON \| `{"action": "stop"}` \|
	\| →Client \| Binary \| PCM16 \| Synthesized audio at 24kHz \|
	\| →Client \| Text \| JSON \| `{"type": "transcript", "text": "..."}` \|
	\| →Client \| Text \| JSON \| `{"type": "translation", "text": "..."}` \|
	\| →Client \| Text \| JSON \| `{"type": "status", "status": "started"}` \|

	## Docker

	```bash
	docker build -t streaming-translation .
	docker run -p 8765:8765 \
	-v /path/to/models:/models \
	streaming-translation \
	--asr-onnx-path /models/asr/ \
	--nmt-gguf-path /models/translategemma-4b-it-q8_0.gguf \
	--tts-model-dir /models/xtts/ \
	--tts-vocab-path /models/xtts/vocab.json \
	--tts-mel-norms-path /models/xtts/mel_stats.npy \
	--tts-ref-audio-path /models/reference.wav
	```

	## Project Structure

	```
	streaming_speech_translation/
	├── app.py # Main entry point
	├── requirements.txt
	├── README.md
	├── ARCHITECTURE.md
	├── Dockerfile
	├── models/
	│ ├── asr/
	│ │ └── nemo-cache-aware-streaming-560ms-onnx/
	│ ├── nmt/
	│ │ ├── translategemma-4b-it-q8_0-gguf/
	│ │ └── translategemma-4b-it-q4_k_m-gguf/
	│ └── tts/
	│ └── xttsv2-onnx/
	├── src/
	│ ├── asr/
	│ │ ├── streaming_asr.py # StreamingASR wrapper
	│ │ ├── cache_aware_modules.py # Audio buffer + streaming ASR
	│ │ ├── cache_aware_modules_config.py
	│ │ ├── modules.py # ONNX model loading
	│ │ ├── modules_config.py
	│ │ ├── onnx_utils.py
	│ │ └── utils.py # Audio utilities
	│ ├── nmt/
	│ │ ├── streaming_nmt.py # StreamingNMT wrapper
	│ │ ├── streaming_segmenter.py # Word-group segmentation
	│ │ ├── streaming_translation_merger.py
	│ │ └── translator_module.py # TranslateGemma via llama-cpp
	│ ├── tts/
	│ │ ├── streaming_tts.py # StreamingTTS wrapper
	│ │ ├── xtts_streaming_pipeline.py # Full TTS pipeline
	│ │ ├── xtts_onnx_orchestrator.py # GPT-2 AR + vocoder
	│ │ ├── xtts_tokenizer.py # BPE tokenizer
	│ │ └── zh_num2words.py # Chinese text normalization
	│ ├── pipeline/
	│ │ ├── orchestrator.py # PipelineOrchestrator
	│ │ └── config.py # PipelineConfig
	│ └── server/
	│ └── websocket_server.py # WebSocket server
	└── clients/
	├── python_client.py # Python CLI client
	└── web_client.html # Browser client
	```

	## TTS Threading Update (v2 Refactor)

	The TTS integration has been revised to match the 3-thread ASR model.

	### Previous design

	Both GPT-2 AR generation and HiFi-GAN vocoding ran inside a single
	`synthesize_stream()` call that was dispatched to the shared
	`ThreadPoolExecutor`:

	```
	[orchestrator asyncio loop]
	└─ run_in_executor ──► synthesize_stream()
	├─ GPT-2 AR loop (blocking)
	└─ HiFi-GAN (blocking)
	```

	This meant the executor slot was held for the entire TTS inference duration,
	blocking NMT dispatches and delivering audio only after full-segment synthesis.

	### New design

	Two dedicated daemon threads decouple GPT generation from vocoding:

	```
	text ──► [TTS-GPT Thread] ──latent batches──► [TTS-Vocoder Thread] ──► audio
	BPE + AR loop HiFi-GAN + crossfade
	```

	The vocoder starts producing audio as soon as the first `gpt_chunk_size`
	(default 20) AR tokens are generated, rather than waiting for the full segment.

	### New CLI flags

	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `--tts-text-queue-max` \| 8 \| Max segments in TTS text input queue \|
	\| `--tts-latent-queue-max` \| 4 \| Max latent batches in TTS-GPT→Vocoder queue \|

	See [ARCHITECTURE.md](ARCHITECTURE.md) for the full concurrency diagram and queue map.

	### LICENSE and COPYRIGHT

	This repository is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:
	- ✅ Research and academic use
	- ✅ Personal experimentation
	- ✅ Open-source contributions
	- ❌ Commercial applications
	- ❌ Production deployment
	- ❌ Monetized services

	#### By: [Patrick Lumbantobing](https://www.linkedin.com/in/patrick-lumban-tobing)

	#### Copyright@[VertoX-AI](https://www.linkedin.com/company/vertoxai/)

	### Citation

	If you use this system in your research, please cite:

	```bibtex
	@misc{vertoxai2026streamingspeechtranslation,
	title={Streaming Speech Translation — VertoX-AI},
	author={Tobing, P. L., VertoX-AI},
	year={2026},
	publisher={HuggingFace},
	}
	```
	### Acknowledgments

	- [NVIDIA](https://huggingface.co/nvidia) for Cache-Aware ASR NeMo
	- [istupakov](https://huggingface.co/istupakov) for the ONNX reference
	- [Google](https://huggingface.co/google) for the TranslateGemma NMT model
	- [Coqui](https://huggingface.co/coqui) for the XTTSv2