Instructions to use NeurologyAI/neuro-parakeet-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use NeurologyAI/neuro-parakeet-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir neuro-parakeet-mlx NeurologyAI/neuro-parakeet-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- 🧠🦜 neuro-parakeet-mlx
- 📋 Model Details
- ✨ Key Features
- 🚀 Quick Start
- 🌐 OpenAI-Compatible API Server
- 📁 Model Files
- 🎵 Supported Audio Formats
- 🎓 Training Details
- 🌍 CO2 Emission Related to Experiments
- 🏥 Medical Terminology Coverage
- 📊 Performance Characteristics
- ⚠️ Limitations
- 📄 License
- 📚 Citation
- 🙏 Acknowledgments
- 🔗 Related Resources
- 📋 Model Details
🧠🦜 neuro-parakeet-mlx
Fine-tuned Parakeet TDT 0.6B model optimized for German medical/neurological speech recognition, converted to MLX format for Apple Silicon.
A specialized automatic speech recognition (ASR) model based on the multilingual Parakeet TDT 0.6B base model, fine-tuned specifically for German medical terminology, particularly neurology and neuro-oncology. While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech recognition. This model is optimized for Apple Silicon devices using the MLX framework, providing fast and efficient inference on Mac hardware.
📋 Model Details
| Property | Value |
|---|---|
| Base Model | nvidia/parakeet-tdt-0.6b-v3 |
| Architecture | FastConformer-TDT (RNNT) |
| Parameters | 600 million (base model) |
| Model Size | 2.34 GB |
| Base Model Languages | Multilingual (25 languages) |
| Fine-tuned Language | German (de) - medical domain |
| Domain | Medical/Neurological (German) |
| Framework | MLX (Apple Silicon optimized) |
| Tokenizer | SentencePiece BPE (8,192 tokens, multilingual) |
| Fine-tuned on | NeurologyAI/neuro-whisper-v1 |
| License | CC-BY-4.0 |
✨ Key Features
- 🌍 Multilingual Base: Built on a multilingual model (25 languages) fine-tuned for German medical speech
- 🚀 Apple Silicon Optimized: Native MLX framework for fast inference on Mac (tested on M4; M1/M2/M3 results may vary)
- ⚡ Ultra-Fast Inference: Real-time factor of 0.042 (~24x faster than real-time) on M4
- 🏥 Medical Domain Specialized: Fine-tuned on 52,885 German medical audio samples (114.59 hours)
- 🧠 Neurology Focus: Optimized for neurology and neuro-oncology terminology
- 🔌 OpenAI-Compatible API: Drop-in replacement for OpenAI Whisper API endpoints
- 📝 Automatic Formatting: Built-in punctuation and capitalization
- ⏱️ Timestamps: Word-level and segment-level timing information
- 🎙️ Long Audio Support: Handles extended recordings (up to 24 minutes with full attention)
- 🌐 Multiple Formats: Supports WAV, MP3, FLAC, OGG, M4A, AAC, and more
🚀 Quick Start
Installation
pip install parakeet-mlx
System Requirements
- Hardware: Apple Silicon (M1/M2/M3/M4) supported; tested and benchmarked on M4 only - performance on M1/M2/M3 may vary
- RAM: At least 4GB available (8GB+ recommended)
- Python: 3.8 or higher
- macOS: 12.0 or later
Basic Usage
Using parakeet-mlx CLI
# Install
pip install -U parakeet-mlx
# Transcribe audio file
parakeet-mlx audio.wav --model NeurologyAI/neuro-parakeet-mlx
Using mlx-audio
# Install
pip install -U mlx-audio
# Transcribe audio file
python -m mlx_audio.stt.generate --model NeurologyAI/neuro-parakeet-mlx --audio audio.wav
Using Python API
from parakeet_mlx import from_pretrained
# Load model (first run will download ~2.34 GB)
model = from_pretrained("NeurologyAI/neuro-parakeet-mlx")
# Transcribe audio file
result = model.transcribe("path/to/audio.wav", language="de")
print(result.text)
# Access timestamps if needed
if hasattr(result, 'segments'):
for segment in result.segments:
print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")
🌐 OpenAI-Compatible API Server
This model can be served as a drop-in replacement for OpenAI's Whisper API, compatible with tools like open-webui and other OpenAI API clients.
Starting the Server
You can use the parakeet-mlx-server repository for a ready-to-use OpenAI-compatible API server:
# Clone and start the server
git clone git@github.com:riedemannai/parakeet-mlx-server.git
cd parakeet-mlx-server
pip install -r requirements.txt
./start_server.sh --model NeurologyAI/neuro-parakeet-mlx
Or run directly:
python parakeet_server.py --model NeurologyAI/neuro-parakeet-mlx --port 8002
API Usage Examples
Using cURL
curl -X POST http://localhost:8002/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.wav" \
-F "model=parakeet-tdt-0.6b-v3" \
-F "language=de"
Using Python
import requests
url = "http://localhost:8002/v1/audio/transcriptions"
files = {"file": open("audio.wav", "rb")}
data = {
"model": "parakeet-tdt-0.6b-v3",
"language": "de",
"response_format": "json" # or "text" for plain text
}
response = requests.post(url, files=files, data=data)
result = response.json()
print(result["text"])
# Access segments with timestamps if available
if "segments" in result:
for seg in result["segments"]:
print(f"{seg.get('start', 0):.2f}s: {seg.get('text', '')}")
Response Format
The API returns JSON with the following structure:
{
"text": "Transcribed text here...",
"recording_timestamp": "optional timestamp",
"segments": [
{
"text": "Segment text",
"start": 0.0,
"end": 5.2
}
]
}
📁 Model Files
model.safetensors: Model weights (2.34 GB)config.json: Model configuration (MLX-compatible)model_config.yaml: Original NeMo training configuration (reference)
🎵 Supported Audio Formats
The model supports various audio formats via librosa. Audio is automatically converted to the required format:
Supported Formats:
- WAV, MP3, FLAC, OGG, M4A, AAC, AIFF, AU, WMA
Input Requirements:
- Sample Rate: Automatically resampled to 16 kHz
- Channels: Automatically converted to mono
- Format: Any of the supported formats above
Recommended:
- Clear audio with minimal background noise
- 16 kHz sample rate (or higher, will be downsampled)
- Mono channel audio
🎓 Training Details
Fine-tuning Configuration
This model was fine-tuned from nvidia/parakeet-tdt-0.6b-v3 using NVIDIA NeMo toolkit with the following parameters:
| Parameter | Value |
|---|---|
| Training Data | NeurologyAI/neuro-whisper-v1 |
| Training Samples | 47,596 (90% of dataset) |
| Validation Samples | 5,289 (10% of dataset) |
| Total Dataset Size | 52,885 samples |
| Total Audio Duration | 114.59 hours |
| Average Sample Duration | 7.80 seconds |
| Max Epochs | 7 |
| Learning Rate | 1e-4 |
| Weight Decay | 0.001 |
| Batch Size | 64 (per device) |
| Gradient Accumulation | 1 |
| Precision | BF16 mixed |
| Optimizer | AdamW |
| Scheduler | Cosine annealing with warmup |
| Warmup Ratio | 0.07 |
| Gradient Clipping | 1.0 |
| Validation Check Interval | Every 50 steps |
| Logging Frequency | Every 10 steps |
| Audio Duration Range | 0.1 - 20.0 seconds |
| Data Loader Workers | 8 |
| Pin Memory | True |
Training Dataset: NeurologyAI/neuro-whisper-v1
The neuro-whisper-v1 dataset is a comprehensive German medical speech recognition dataset specifically designed for neurology and neuro-oncology terminology.
Dataset Characteristics:
- Language: German (de)
- Domain: Neuro-oncology, neurology, medical terminology
- Total Samples: 52,885 audio-text pairs
- Total Duration: 114.59 hours of audio
- Audio Format: 16 kHz mono WAV
- Split: 90% training (47,596 samples) / 10% validation (5,289 samples)
Data Generation:
- Voice Data: Synthetically generated using Resemble AI Chatterbox TTS
- Text Data: Medical text generated with Qwen/Qwen3-30B-A3B-Instruct-2507
- Purpose: ASR training for medical transcription in neurology/neuro-oncology domains
Dataset Features:
- Consistent audio quality (uniform TTS generation)
- Comprehensive coverage of specialized medical terminology
- Privacy-safe (no real patient data)
- Optimized for German medical speech recognition
Base Model Training
The base model (nvidia/parakeet-tdt-0.6b-v3) was trained on:
- Primary Dataset: Granary multilingual corpus (660,000+ hours of pseudo-labeled data)
- Fine-tuning Dataset: NeMo ASR Set 3.0 (~7,500 hours of high-quality, human-transcribed data)
- Includes: LibriSpeech, Fisher Corpus, VCTK, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice, and more
- Training: 150,000 steps on 128 A100 GPUs, then 5,000 steps on 4 A100 GPUs
- Tokenizer: Unified SentencePiece BPE with 8,192 tokens, optimized across 25 languages
Training Infrastructure
- Framework: NVIDIA NeMo 2.4
- Training Hardware: NVIDIA A100 GPUs (Google Colab)
- Inference Hardware: Apple Silicon via MLX framework (tested on M4 only)
- Evaluation Hardware: Mac Mini M4 with 32 GB unified memory (VRAM) - only M4 was tested; results on M1/M2/M3 may vary
- Training Script: Based on NeMo ASR training examples with TDT configuration
- Checkpointing: Best model selected by validation WER (lowest WER), with top-2 checkpoints saved
- CUDA Graphs: Disabled for compatibility (use_cuda_graph_decoder = False)
- Decoding Strategy: Greedy batch decoding
🌍 CO2 Emission Related to Experiments
Fine-tuning was conducted using Google Cloud Platform in region europe-west3-a, which has a carbon efficiency of 0.61 kgCO₂eq/kWh. A cumulative of 5 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).
Total emissions are estimated to be 0.76 kgCO₂eq of which 100 percents were directly offset by the cloud provider.
Estimations were conducted using the MachineLearning Impact calculator presented in Lacoste et al. (2019).
Reference:
@article{lacoste2019quantifying,
title={Quantifying the Carbon Emissions of Machine Learning},
author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
journal={arXiv preprint arXiv:1910.09700},
year={2019}
}
🏥 Medical Terminology Coverage
This model is specifically optimized for German medical terminology, with enhanced accuracy for:
- Neuro-oncology: Gliomas, glioblastomas, astrocytomas, oligodendrogliomas, meningiomas
- Molecular Markers: IDH mutations, MGMT methylation, 1p/19q codeletion, EGFR amplification
- Treatment Modalities: Radiation therapy, chemotherapy, immunotherapy, targeted therapy
- Diagnostic Imaging: MRI, CT, PET scans, contrast enhancement, diffusion-weighted imaging
- Clinical Assessments: Neurological examinations, cognitive assessments, motor function tests
- Medical Procedures: Biopsies, resections, craniotomies, ventriculostomies
📊 Performance Characteristics
Evaluation Results
The model was evaluated on the validation split of NeurologyAI/neuro-whisper-v1:
| Metric | Value |
|---|---|
| Word Error Rate (WER) | 1.04% (0.0104) |
| Real-Time Factor (RTF) | 0.042 (~24x faster than real-time) |
| Evaluation Dataset | NeurologyAI/neuro-whisper-v1 (validation) |
| Evaluation Samples | 5,289 samples |
| Total Audio Duration | 22,786.68 seconds (~6.3 hours) |
| Average Inference Time | 0.18 seconds per sample |
| Samples per Second | 5.46 samples/second |
| Evaluation Hardware | Mac Mini M4 with 32 GB unified memory (VRAM) |
Hardware Details:
- Device: Mac Mini M4 (only M4 was tested)
- Memory: 32 GB unified memory (VRAM)
- Framework: MLX (Apple Silicon optimized)
- Note: Performance metrics are specific to M4. Results on M1, M2, and M3 may vary.
This demonstrates excellent performance on German medical/neurological speech recognition, with a WER below 1.1% on the validation set. The real-time factor of 0.042 means the model processes audio approximately 24x faster than real-time on M4 hardware, making it highly efficient for batch processing and real-time applications.
Note: All inference and validation were conducted exclusively on a Mac Mini M4 with 32 GB unified memory. Performance on M1, M2, and M3 chips may vary and has not been tested.
Evaluation Tool: These results were obtained using the 🎤🏥🎯 Medical ASR Evaluator - a standalone tool for evaluating ASR models using Word Error Rate (WER) metric, optimized for medical/clinical speech recognition.
Model Comparison
The following table compares WER results on the validation split of NeurologyAI/neuro-whisper-v1 (5,289 samples, ~6.3 hours of audio):
| Model | WER | Notes |
|---|---|---|
| NeurologyAI/neuro-parakeet-mlx (this model) | 1.04% | Fine-tuned Parakeet TDT 0.6B for German medical |
| mlx-community/parakeet-tdt-0.6b-v3 | 18.31% | Base Parakeet TDT 0.6B (no fine-tuning) |
| mlx-community/whisper-large-v3-mlx | 13.96% | Base Whisper Large v3 (no fine-tuning) |
Performance by Domain
- ✅ Best Performance: German medical/neurological speech, clinical dictations
- ✅ Strong Performance: Medical reports, patient documentation, case presentations
- ✅ Good Performance: General German speech with medical context
- ⚠️ Limited Performance: Non-neurology domains, other languages (base model supports 25 languages, but this fine-tuned version is optimized for German), heavily accented speech
⚠️ Limitations
- Hardware: This model was tested and benchmarked exclusively on Apple Silicon M4. While it should work on M1/M2/M3 chips, performance metrics (RTF, speed) may vary and have not been validated on these earlier chips.
- Language: While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech. Other languages may have reduced accuracy compared to the base model.
- Domain: Best results on German medical/neurological content; general speech may be less accurate
- Audio Quality: Requires clear audio (16 kHz, mono) for optimal performance
- Length: Very long audio files may have reduced accuracy
- Accents: Performance may vary with regional accents or non-standard pronunciation
- Background Noise: Best results with minimal background noise and clear speech
📄 License
GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license, inherited from the base model nvidia/parakeet-tdt-0.6b-v3.
This model is ready for commercial and non-commercial use under the terms of the CC-BY-4.0 license.
📚 Citation
If you use this model in your research or applications, please cite:
@misc{neuro-parakeet-mlx,
title={Neuro-Parakeet-MLX: German Medical Speech Recognition for Neurology and Neuro-oncology Optimized for Apple Silicon},
author={Riedemann, Lars},
year={2025},
publisher={Hugging Face},
note={Based on nvidia/parakeet-tdt-0.6b-v3},
howpublished={\\url{https://huggingface.co/NeurologyAI/neuro-parakeet-mlx}}
}
Base Model Citation:
@misc{sekoyan2025canary1bv2parakeettdt06bv3efficient,
title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST},
author={Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg},
year={2025},
eprint={2509.14128},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.14128},
}
Training Dataset Citation:
@dataset{neuro-whisper-v1,
title={neuro-whisper-v1: German Medical Speech Recognition Dataset},
author={Riedemann, Lars},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1},
language={de}
}
🙏 Acknowledgments
This model builds upon the excellent work of:
- NVIDIA: For the Parakeet TDT 0.6B v3 base model (CC-BY-4.0) - a 600-million-parameter multilingual ASR model with FastConformer-TDT architecture
- Granary Dataset: For providing the multilingual training corpus used in base model training
- NeMo ASR Set 3.0: For the high-quality, human-transcribed training data (10,000+ hours) used in base model training
- Fine-tuning Dataset: NeurologyAI/neuro-whisper-v1 (MIT License) - 52,885 German medical audio samples (114.59 hours) for neurology/neuro-oncology
- Qwen Team: For Qwen/Qwen3-30B-A3B-Instruct-2507 used for medical text generation in the training dataset
- ResembleAI: For Chatterbox TTS (MIT License) used for synthetic audio generation in the training dataset
- Apple MLX: For the MLX framework enabling efficient Apple Silicon optimization
- NVIDIA NeMo: For the NeMo toolkit version 2.4 used in model development and training
- Training Infrastructure: Google Colab with A100 GPUs for training
- Training Notebook: See Parakeet_Training_MLX.ipynb for the complete training pipeline
🔗 Related Resources
- API Server: parakeet-mlx-server - OpenAI-compatible FastAPI server for this model
- Base Model: nvidia/parakeet-tdt-0.6b-v3
- Training Dataset: NeurologyAI/neuro-whisper-v1 - German medical ASR dataset with 52,885 samples (114.59 hours, 90/10 train/val split) for neurology/neuro-oncology
- Data Generation: Resemble AI Chatterbox and Qwen3
- MLX Framework: Apple MLX
- NeMo Toolkit: NVIDIA NeMo
- Parakeet Collection: Parakeet Models on Hugging Face
- Downloads last month
- 7
Quantized
Model tree for NeurologyAI/neuro-parakeet-mlx
Base model
nvidia/parakeet-tdt-0.6b-v3Dataset used to train NeurologyAI/neuro-parakeet-mlx
Collection including NeurologyAI/neuro-parakeet-mlx
Papers for NeurologyAI/neuro-parakeet-mlx
Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Quantifying the Carbon Emissions of Machine Learning
Evaluation results
- WER on NeurologyAI/neuro-whisper-v1self-reported0.010