--- license: cc-by-4.0 base_model: nvidia/parakeet-tdt-0.6b-v3 tags: - automatic-speech-recognition - asr - parakeet - mlx - german - medical - neurology - apple-silicon - multilingual datasets: - NeurologyAI/neuro-whisper-v1 language: - de - multilingual metrics: - wer - cer model-index: - name: neuro-parakeet-mlx results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: NeurologyAI/neuro-whisper-v1 type: NeurologyAI/neuro-whisper-v1 metrics: - name: WER type: wer value: 0.0104 --- # 🧠🦜 neuro-parakeet-mlx > Fine-tuned Parakeet TDT 0.6B model optimized for German medical/neurological speech recognition, converted to MLX format for Apple Silicon. A specialized automatic speech recognition (ASR) model based on the multilingual Parakeet TDT 0.6B base model, fine-tuned specifically for German medical terminology, particularly neurology and neuro-oncology. While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech recognition. This model is optimized for Apple Silicon devices using the MLX framework, providing fast and efficient inference on Mac hardware. ## 📋 Model Details | Property | Value | |----------|-------| | **Base Model** | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) | | **Architecture** | FastConformer-TDT (RNNT) | | **Parameters** | 600 million (base model) | | **Model Size** | 2.34 GB | | **Base Model Languages** | Multilingual (25 languages) | | **Fine-tuned Language** | German (de) - medical domain | | **Domain** | Medical/Neurological (German) | | **Framework** | MLX (Apple Silicon optimized) | | **Tokenizer** | SentencePiece BPE (8,192 tokens, multilingual) | | **Fine-tuned on** | [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) | | **License** | [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) | ## ✨ Key Features - 🌍 **Multilingual Base**: Built on a multilingual model (25 languages) fine-tuned for German medical speech - 🚀 **Apple Silicon Optimized**: Native MLX framework for fast inference on Mac (tested on M4; M1/M2/M3 results may vary) - ⚡ **Ultra-Fast Inference**: Real-time factor of 0.042 (~24x faster than real-time) on M4 - 🏥 **Medical Domain Specialized**: Fine-tuned on 52,885 German medical audio samples (114.59 hours) - 🧠 **Neurology Focus**: Optimized for neurology and neuro-oncology terminology - 🔌 **OpenAI-Compatible API**: Drop-in replacement for OpenAI Whisper API endpoints - 📝 **Automatic Formatting**: Built-in punctuation and capitalization - ⏱️ **Timestamps**: Word-level and segment-level timing information - 🎙️ **Long Audio Support**: Handles extended recordings (up to 24 minutes with full attention) - 🌐 **Multiple Formats**: Supports WAV, MP3, FLAC, OGG, M4A, AAC, and more ## 🚀 Quick Start ### Installation ```bash pip install parakeet-mlx ``` ### System Requirements - **Hardware**: Apple Silicon (M1/M2/M3/M4) supported; **tested and benchmarked on M4 only** - performance on M1/M2/M3 may vary - **RAM**: At least 4GB available (8GB+ recommended) - **Python**: 3.8 or higher - **macOS**: 12.0 or later ### Basic Usage #### Using parakeet-mlx CLI ```bash # Install pip install -U parakeet-mlx # Transcribe audio file parakeet-mlx audio.wav --model NeurologyAI/neuro-parakeet-mlx ``` #### Using mlx-audio ```bash # Install pip install -U mlx-audio # Transcribe audio file python -m mlx_audio.stt.generate --model NeurologyAI/neuro-parakeet-mlx --audio audio.wav ``` #### Using Python API ```python from parakeet_mlx import from_pretrained # Load model (first run will download ~2.34 GB) model = from_pretrained("NeurologyAI/neuro-parakeet-mlx") # Transcribe audio file result = model.transcribe("path/to/audio.wav", language="de") print(result.text) # Access timestamps if needed if hasattr(result, 'segments'): for segment in result.segments: print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}") ``` ## 🌐 OpenAI-Compatible API Server This model can be served as a drop-in replacement for OpenAI's Whisper API, compatible with tools like [open-webui](https://github.com/open-webui/open-webui) and other OpenAI API clients. ### Starting the Server You can use the [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) repository for a ready-to-use OpenAI-compatible API server: ```bash # Clone and start the server git clone git@github.com:riedemannai/parakeet-mlx-server.git cd parakeet-mlx-server pip install -r requirements.txt ./start_server.sh --model NeurologyAI/neuro-parakeet-mlx ``` Or run directly: ```bash python parakeet_server.py --model NeurologyAI/neuro-parakeet-mlx --port 8002 ``` ### API Usage Examples #### Using cURL ```bash curl -X POST http://localhost:8002/v1/audio/transcriptions \ -H "Content-Type: multipart/form-data" \ -F "file=@audio.wav" \ -F "model=parakeet-tdt-0.6b-v3" \ -F "language=de" ``` #### Using Python ```python import requests url = "http://localhost:8002/v1/audio/transcriptions" files = {"file": open("audio.wav", "rb")} data = { "model": "parakeet-tdt-0.6b-v3", "language": "de", "response_format": "json" # or "text" for plain text } response = requests.post(url, files=files, data=data) result = response.json() print(result["text"]) # Access segments with timestamps if available if "segments" in result: for seg in result["segments"]: print(f"{seg.get('start', 0):.2f}s: {seg.get('text', '')}") ``` #### Response Format The API returns JSON with the following structure: ```json { "text": "Transcribed text here...", "recording_timestamp": "optional timestamp", "segments": [ { "text": "Segment text", "start": 0.0, "end": 5.2 } ] } ``` ## 📁 Model Files - `model.safetensors`: Model weights (2.34 GB) - `config.json`: Model configuration (MLX-compatible) - `model_config.yaml`: Original NeMo training configuration (reference) ## 🎵 Supported Audio Formats The model supports various audio formats via librosa. Audio is automatically converted to the required format: **Supported Formats:** - WAV, MP3, FLAC, OGG, M4A, AAC, AIFF, AU, WMA **Input Requirements:** - **Sample Rate**: Automatically resampled to 16 kHz - **Channels**: Automatically converted to mono - **Format**: Any of the supported formats above **Recommended:** - Clear audio with minimal background noise - 16 kHz sample rate (or higher, will be downsampled) - Mono channel audio ## 🎓 Training Details ### Fine-tuning Configuration This model was fine-tuned from [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) using NVIDIA NeMo toolkit with the following parameters: | Parameter | Value | |-----------|-------| | **Training Data** | [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) | | **Training Samples** | 47,596 (90% of dataset) | | **Validation Samples** | 5,289 (10% of dataset) | | **Total Dataset Size** | 52,885 samples | | **Total Audio Duration** | 114.59 hours | | **Average Sample Duration** | 7.80 seconds | | **Max Epochs** | 7 | | **Learning Rate** | 1e-4 | | **Weight Decay** | 0.001 | | **Batch Size** | 64 (per device) | | **Gradient Accumulation** | 1 | | **Precision** | BF16 mixed | | **Optimizer** | AdamW | | **Scheduler** | Cosine annealing with warmup | | **Warmup Ratio** | 0.07 | | **Gradient Clipping** | 1.0 | | **Validation Check Interval** | Every 50 steps | | **Logging Frequency** | Every 10 steps | | **Audio Duration Range** | 0.1 - 20.0 seconds | | **Data Loader Workers** | 8 | | **Pin Memory** | True | ### Training Dataset: NeurologyAI/neuro-whisper-v1 The [neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) dataset is a comprehensive German medical speech recognition dataset specifically designed for neurology and neuro-oncology terminology. **Dataset Characteristics:** - **Language**: German (de) - **Domain**: Neuro-oncology, neurology, medical terminology - **Total Samples**: 52,885 audio-text pairs - **Total Duration**: 114.59 hours of audio - **Audio Format**: 16 kHz mono WAV - **Split**: 90% training (47,596 samples) / 10% validation (5,289 samples) **Data Generation:** - **Voice Data**: Synthetically generated using [Resemble AI Chatterbox TTS](https://github.com/resemble-ai/chatterbox) - **Text Data**: Medical text generated with [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) - **Purpose**: ASR training for medical transcription in neurology/neuro-oncology domains **Dataset Features:** - Consistent audio quality (uniform TTS generation) - Comprehensive coverage of specialized medical terminology - Privacy-safe (no real patient data) - Optimized for German medical speech recognition ### Base Model Training The base model ([nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)) was trained on: - **Primary Dataset**: Granary multilingual corpus (660,000+ hours of pseudo-labeled data) - **Fine-tuning Dataset**: NeMo ASR Set 3.0 (~7,500 hours of high-quality, human-transcribed data) - Includes: LibriSpeech, Fisher Corpus, VCTK, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice, and more - **Training**: 150,000 steps on 128 A100 GPUs, then 5,000 steps on 4 A100 GPUs - **Tokenizer**: Unified SentencePiece BPE with 8,192 tokens, optimized across 25 languages ### Training Infrastructure - **Framework**: NVIDIA NeMo 2.4 - **Training Hardware**: NVIDIA A100 GPUs (Google Colab) - **Inference Hardware**: Apple Silicon via MLX framework (tested on M4 only) - **Evaluation Hardware**: Mac Mini M4 with 32 GB unified memory (VRAM) - **only M4 was tested; results on M1/M2/M3 may vary** - **Training Script**: Based on NeMo ASR training examples with TDT configuration - **Checkpointing**: Best model selected by validation WER (lowest WER), with top-2 checkpoints saved - **CUDA Graphs**: Disabled for compatibility (use_cuda_graph_decoder = False) - **Decoding Strategy**: Greedy batch decoding ## 🌍 CO2 Emission Related to Experiments Fine-tuning was conducted using Google Cloud Platform in region `europe-west3-a`, which has a carbon efficiency of 0.61 kgCO₂eq/kWh. A cumulative of 5 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W). Total emissions are estimated to be **0.76 kgCO₂eq** of which 100 percents were directly offset by the cloud provider. Estimations were conducted using the [MachineLearning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). **Reference:** ```bibtex @article{lacoste2019quantifying, title={Quantifying the Carbon Emissions of Machine Learning}, author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas}, journal={arXiv preprint arXiv:1910.09700}, year={2019} } ``` ## 🏥 Medical Terminology Coverage This model is specifically optimized for German medical terminology, with enhanced accuracy for: - **Neuro-oncology**: Gliomas, glioblastomas, astrocytomas, oligodendrogliomas, meningiomas - **Molecular Markers**: IDH mutations, MGMT methylation, 1p/19q codeletion, EGFR amplification - **Treatment Modalities**: Radiation therapy, chemotherapy, immunotherapy, targeted therapy - **Diagnostic Imaging**: MRI, CT, PET scans, contrast enhancement, diffusion-weighted imaging - **Clinical Assessments**: Neurological examinations, cognitive assessments, motor function tests - **Medical Procedures**: Biopsies, resections, craniotomies, ventriculostomies ## 📊 Performance Characteristics ### Evaluation Results The model was evaluated on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1): | Metric | Value | |--------|-------| | **Word Error Rate (WER)** | **1.04%** (0.0104) | | **Real-Time Factor (RTF)** | **0.042** (~24x faster than real-time) | | **Evaluation Dataset** | NeurologyAI/neuro-whisper-v1 (validation) | | **Evaluation Samples** | 5,289 samples | | **Total Audio Duration** | 22,786.68 seconds (~6.3 hours) | | **Average Inference Time** | 0.18 seconds per sample | | **Samples per Second** | 5.46 samples/second | | **Evaluation Hardware** | Mac Mini M4 with 32 GB unified memory (VRAM) | **Hardware Details:** - **Device**: Mac Mini M4 (only M4 was tested) - **Memory**: 32 GB unified memory (VRAM) - **Framework**: MLX (Apple Silicon optimized) - **Note**: Performance metrics are specific to M4. Results on M1, M2, and M3 may vary. This demonstrates excellent performance on German medical/neurological speech recognition, with a WER below 1.1% on the validation set. The real-time factor of 0.042 means the model processes audio approximately **24x faster than real-time** on M4 hardware, making it highly efficient for batch processing and real-time applications. **Note:** All inference and validation were conducted exclusively on a Mac Mini M4 with 32 GB unified memory. Performance on M1, M2, and M3 chips may vary and has not been tested. **Evaluation Tool:** These results were obtained using the [🎤🏥🎯 Medical ASR Evaluator](https://github.com/riedemannai/Medical_ASR_Evaluator) - a standalone tool for evaluating ASR models using Word Error Rate (WER) metric, optimized for medical/clinical speech recognition. ### Model Comparison The following table compares WER results on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (5,289 samples, ~6.3 hours of audio): | Model | WER | Notes | |-------|-----|-------| | **NeurologyAI/neuro-parakeet-mlx** (this model) | **1.04%** | Fine-tuned Parakeet TDT 0.6B for German medical | | mlx-community/parakeet-tdt-0.6b-v3 | 18.31% | Base Parakeet TDT 0.6B (no fine-tuning) | | mlx-community/whisper-large-v3-mlx | 13.96% | Base Whisper Large v3 (no fine-tuning) | ### Performance by Domain - ✅ **Best Performance**: German medical/neurological speech, clinical dictations - ✅ **Strong Performance**: Medical reports, patient documentation, case presentations - ✅ **Good Performance**: General German speech with medical context - ⚠️ **Limited Performance**: Non-neurology domains, other languages (base model supports 25 languages, but this fine-tuned version is optimized for German), heavily accented speech ## ⚠️ Limitations - **Hardware**: This model was tested and benchmarked exclusively on Apple Silicon M4. While it should work on M1/M2/M3 chips, performance metrics (RTF, speed) may vary and have not been validated on these earlier chips. - **Language**: While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech. Other languages may have reduced accuracy compared to the base model. - **Domain**: Best results on German medical/neurological content; general speech may be less accurate - **Audio Quality**: Requires clear audio (16 kHz, mono) for optimal performance - **Length**: Very long audio files may have reduced accuracy - **Accents**: Performance may vary with regional accents or non-standard pronunciation - **Background Noise**: Best results with minimal background noise and clear speech ## 📄 License **GOVERNING TERMS**: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license, inherited from the base model [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3). This model is ready for **commercial and non-commercial use** under the terms of the CC-BY-4.0 license. ## 📚 Citation If you use this model in your research or applications, please cite: ```bibtex @misc{neuro-parakeet-mlx, title={Neuro-Parakeet-MLX: German Medical Speech Recognition for Neurology and Neuro-oncology Optimized for Apple Silicon}, author={Riedemann, Lars}, year={2025}, publisher={Hugging Face}, note={Based on nvidia/parakeet-tdt-0.6b-v3}, howpublished={\\url{https://huggingface.co/NeurologyAI/neuro-parakeet-mlx}} } ``` **Base Model Citation:** ```bibtex @misc{sekoyan2025canary1bv2parakeettdt06bv3efficient, title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST}, author={Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg}, year={2025}, eprint={2509.14128}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.14128}, } ``` **Training Dataset Citation:** ```bibtex @dataset{neuro-whisper-v1, title={neuro-whisper-v1: German Medical Speech Recognition Dataset}, author={Riedemann, Lars}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1}, language={de} } ``` ## 🙏 Acknowledgments This model builds upon the excellent work of: - **NVIDIA**: For the [Parakeet TDT 0.6B v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) base model (CC-BY-4.0) - a 600-million-parameter multilingual ASR model with FastConformer-TDT architecture - **Granary Dataset**: For providing the multilingual training corpus used in base model training - **NeMo ASR Set 3.0**: For the high-quality, human-transcribed training data (10,000+ hours) used in base model training - **Fine-tuning Dataset**: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (MIT License) - 52,885 German medical audio samples (114.59 hours) for neurology/neuro-oncology - **Qwen Team**: For [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) used for medical text generation in the training dataset - **ResembleAI**: For [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) (MIT License) used for synthetic audio generation in the training dataset - **Apple MLX**: For the [MLX framework](https://github.com/ml-explore/mlx) enabling efficient Apple Silicon optimization - **NVIDIA NeMo**: For the [NeMo toolkit](https://github.com/NVIDIA/NeMo) version 2.4 used in model development and training - **Training Infrastructure**: Google Colab with A100 GPUs for training - **Training Notebook**: See [Parakeet_Training_MLX.ipynb](colab_notebooks/Parakeet_Training_MLX.ipynb) for the complete training pipeline ## 🔗 Related Resources - **API Server**: [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) - OpenAI-compatible FastAPI server for this model - **Base Model**: [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) - **Training Dataset**: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) - German medical ASR dataset with 52,885 samples (114.59 hours, 90/10 train/val split) for neurology/neuro-oncology - **Data Generation**: [Resemble AI Chatterbox](https://github.com/resemble-ai/chatterbox) and [Qwen3](https://github.com/QwenLM/Qwen3) - **MLX Framework**: [Apple MLX](https://github.com/ml-explore/mlx) - **NeMo Toolkit**: [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) - **Parakeet Collection**: [Parakeet Models on Hugging Face](https://huggingface.co/collections/nvidia/parakeet)