NVIDIA NeMo Toolkit Developer Docs =================================== `NVIDIA NeMo Toolkit `_ is an open-source toolkit for speech, audio, and multimodal language model research, with a clear path from experimentation to production deployment. Models ------ - **ASR:** `Parakeet `_, `Canary `_, FastConformer -- with CTC, Transducer, TDT, and hybrid decoders - **TTS:** `MagpieTTS `_, `FastPitch `_ + `HiFi-GAN `_ -- multi-language, multi-speaker - **Speaker:** `Sortformer `_ streaming diarization, `TitaNet `_ speaker recognition, `MarbleNet `_ VAD - **Audio:** `Speech enhancement `_, source separation, neural audio codecs - **SpeechLM2:** `Canary-Qwen 2.5B `_ (SALM), Duplex Speech-to-Speech -- HuggingFace Transformers backbone integration Inference & Deployment ---------------------- - Streaming and real-time ASR with cache-aware Conformer - GPU-accelerated decoding with `NGPU-LM `_ language model fusion - Export to ONNX Voice Agent ----------- - Open-source conversational agent framework built on `Pipecat `_ - Streaming STT + LLM + TTS pipeline with natural turn-taking - Live speaker diarization and tool calling support ---- NeMo is built for researchers and engineers. Each collection provides prebuilt, modular components that can be customized, extended, and composed -- from rapid prototyping to multi-node training to production inference. `NVIDIA NeMo Toolkit `_ has separate collections for: * :doc:`Automatic Speech Recognition (ASR) ` * :doc:`Text-to-Speech (TTS) ` * :doc:`Audio Processing