π§ π¦ neuro-parakeet-mlx
Fine-tuned Parakeet TDT 0.6B model optimized for German medical/neurological speech recognition, converted to MLX format for Apple Silicon.
A specialized automatic speech recognition (ASR) model based on the multilingual Parakeet TDT 0.6B base model, fine-tuned specifically for German medical terminology, particularly neurology and neuro-oncology. While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech recognition. This model is optimized for Apple Silicon devices using the MLX framework, providing fast and efficient inference on Mac hardware.
π Model Details
| Property | Value |
|---|---|
| Base Model | nvidia/parakeet-tdt-0.6b-v3 |
| Architecture | FastConformer-TDT (RNNT) |
| Parameters | 600 million (base model) |
| Model Size | 2.34 GB |
| Base Model Languages | Multilingual (25 languages) |
| Fine-tuned Language | German (de) - medical domain |
| Domain | Medical/Neurological (German) |
| Framework | MLX (Apple Silicon optimized) |
| Tokenizer | SentencePiece BPE (8,192 tokens, multilingual) |
| Fine-tuned on | NeurologyAI/neuro-whisper-v1 |
| License | CC-BY-4.0 |
β¨ Key Features
- π Multilingual Base: Built on a multilingual model (25 languages) fine-tuned for German medical speech
- π Apple Silicon Optimized: Native MLX framework for fast inference on Mac (tested on M4; M1/M2/M3 results may vary)
- β‘ Ultra-Fast Inference: Real-time factor of 0.042 (~24x faster than real-time) on M4
- π₯ Medical Domain Specialized: Fine-tuned on 52,885 German medical audio samples (114.59 hours)
- π§ Neurology Focus: Optimized for neurology and neuro-oncology terminology
- π OpenAI-Compatible API: Drop-in replacement for OpenAI Whisper API endpoints
- π Automatic Formatting: Built-in punctuation and capitalization
- β±οΈ Timestamps: Word-level and segment-level timing information
- ποΈ Long Audio Support: Handles extended recordings (up to 24 minutes with full attention)
- π Multiple Formats: Supports WAV, MP3, FLAC, OGG, M4A, AAC, and more
π Quick Start
Installation
pip install parakeet-mlx
System Requirements
- Hardware: Apple Silicon (M1/M2/M3/M4) supported; tested and benchmarked on M4 only - performance on M1/M2/M3 may vary
- RAM: At least 4GB available (8GB+ recommended)
- Python: 3.8 or higher
- macOS: 12.0 or later
Basic Usage
Using parakeet-mlx CLI
# Install
pip install -U parakeet-mlx
# Transcribe audio file
parakeet-mlx audio.wav --model NeurologyAI/neuro-parakeet-mlx
Using mlx-audio
# Install
pip install -U mlx-audio
# Transcribe audio file
python -m mlx_audio.stt.generate --model NeurologyAI/neuro-parakeet-mlx --audio audio.wav
Using Python API
from parakeet_mlx import from_pretrained
# Load model (first run will download ~2.34 GB)
model = from_pretrained("NeurologyAI/neuro-parakeet-mlx")
# Transcribe audio file
result = model.transcribe("path/to/audio.wav", language="de")
print(result.text)
# Access timestamps if needed
if hasattr(result, 'segments'):
for segment in result.segments:
print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")
π OpenAI-Compatible API Server
This model can be served as a drop-in replacement for OpenAI's Whisper API, compatible with tools like open-webui and other OpenAI API clients.
Starting the Server
You can use the parakeet-mlx-server repository for a ready-to-use OpenAI-compatible API server:
# Clone and start the server
git clone git@github.com:riedemannai/parakeet-mlx-server.git
cd parakeet-mlx-server
pip install -r requirements.txt
./start_server.sh --model NeurologyAI/neuro-parakeet-mlx
Or run directly:
python parakeet_server.py --model NeurologyAI/neuro-parakeet-mlx --port 8002
API Usage Examples
Using cURL
curl -X POST http://localhost:8002/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.wav" \
-F "model=parakeet-tdt-0.6b-v3" \
-F "language=de"
Using Python
import requests
url = "http://localhost:8002/v1/audio/transcriptions"
files = {"file": open("audio.wav", "rb")}
data = {
"model": "parakeet-tdt-0.6b-v3",
"language": "de",
"response_format": "json" # or "text" for plain text
}
response = requests.post(url, files=files, data=data)
result = response.json()
print(result["text"])
# Access segments with timestamps if available
if "segments" in result:
for seg in result["segments"]:
print(f"{seg.get('start', 0):.2f}s: {seg.get('text', '')}")
Response Format
The API returns JSON with the following structure:
{
"text": "Transcribed text here...",
"recording_timestamp": "optional timestamp",
"segments": [
{
"text": "Segment text",
"start": 0.0,
"end": 5.2
}
]
}
π Model Files
model.safetensors: Model weights (2.34 GB)config.json: Model configuration (MLX-compatible)model_config.yaml: Original NeMo training configuration (reference)
π΅ Supported Audio Formats
The model supports various audio formats via librosa. Audio is automatically converted to the required format:
Supported Formats:
- WAV, MP3, FLAC, OGG, M4A, AAC, AIFF, AU, WMA
Input Requirements:
- Sample Rate: Automatically resampled to 16 kHz
- Channels: Automatically converted to mono
- Format: Any of the supported formats above
Recommended:
- Clear audio with minimal background noise
- 16 kHz sample rate (or higher, will be downsampled)
- Mono channel audio
π Training Details
Fine-tuning Configuration
This model was fine-tuned from nvidia/parakeet-tdt-0.6b-v3 using NVIDIA NeMo toolkit with the following parameters:
| Parameter | Value |
|---|---|
| Training Data | NeurologyAI/neuro-whisper-v1 |
| Training Samples | 47,596 (90% of dataset) |
| Validation Samples | 5,289 (10% of dataset) |
| Total Dataset Size | 52,885 samples |
| Total Audio Duration | 114.59 hours |
| Average Sample Duration | 7.80 seconds |
| Max Epochs | 7 |
| Learning Rate | 1e-4 |
| Weight Decay | 0.001 |
| Batch Size | 64 (per device) |
| Gradient Accumulation | 1 |
| Precision | BF16 mixed |
| Optimizer | AdamW |
| Scheduler | Cosine annealing with warmup |
| Warmup Ratio | 0.07 |
| Gradient Clipping | 1.0 |
| Validation Check Interval | Every 50 steps |
| Logging Frequency | Every 10 steps |
| Audio Duration Range | 0.1 - 20.0 seconds |
| Data Loader Workers | 8 |
| Pin Memory | True |
Training Dataset: NeurologyAI/neuro-whisper-v1
The neuro-whisper-v1 dataset is a comprehensive German medical speech recognition dataset specifically designed for neurology and neuro-oncology terminology.
Dataset Characteristics:
- Language: German (de)
- Domain: Neuro-oncology, neurology, medical terminology
- Total Samples: 52,885 audio-text pairs
- Total Duration: 114.59 hours of audio
- Audio Format: 16 kHz mono WAV
- Split: 90% training (47,596 samples) / 10% validation (5,289 samples)
Data Generation:
- Voice Data: Synthetically generated using Resemble AI Chatterbox TTS
- Text Data: Medical text generated with Qwen/Qwen3-30B-A3B-Instruct-2507
- Purpose: ASR training for medical transcription in neurology/neuro-oncology domains
Dataset Features:
- Consistent audio quality (uniform TTS generation)
- Comprehensive coverage of specialized medical terminology
- Privacy-safe (no real patient data)
- Optimized for German medical speech recognition
Base Model Training
The base model (nvidia/parakeet-tdt-0.6b-v3) was trained on:
- Primary Dataset: Granary multilingual corpus (660,000+ hours of pseudo-labeled data)
- Fine-tuning Dataset: NeMo ASR Set 3.0 (~7,500 hours of high-quality, human-transcribed data)
- Includes: LibriSpeech, Fisher Corpus, VCTK, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice, and more
- Training: 150,000 steps on 128 A100 GPUs, then 5,000 steps on 4 A100 GPUs
- Tokenizer: Unified SentencePiece BPE with 8,192 tokens, optimized across 25 languages
Training Infrastructure
- Framework: NVIDIA NeMo 2.4
- Training Hardware: NVIDIA A100 GPUs (Google Colab)
- Inference Hardware: Apple Silicon via MLX framework (tested on M4 only)
- Evaluation Hardware: Mac Mini M4 with 32 GB unified memory (VRAM) - only M4 was tested; results on M1/M2/M3 may vary
- Training Script: Based on NeMo ASR training examples with TDT configuration
- Checkpointing: Best model selected by validation WER (lowest WER), with top-2 checkpoints saved
- CUDA Graphs: Disabled for compatibility (use_cuda_graph_decoder = False)
- Decoding Strategy: Greedy batch decoding
π₯ Medical Terminology Coverage
This model is specifically optimized for German medical terminology, with enhanced accuracy for:
- Neuro-oncology: Gliomas, glioblastomas, astrocytomas, oligodendrogliomas, meningiomas
- Molecular Markers: IDH mutations, MGMT methylation, 1p/19q codeletion, EGFR amplification
- Treatment Modalities: Radiation therapy, chemotherapy, immunotherapy, targeted therapy
- Diagnostic Imaging: MRI, CT, PET scans, contrast enhancement, diffusion-weighted imaging
- Clinical Assessments: Neurological examinations, cognitive assessments, motor function tests
- Medical Procedures: Biopsies, resections, craniotomies, ventriculostomies
π Performance Characteristics
Evaluation Results
The model was evaluated on the validation split of NeurologyAI/neuro-whisper-v1:
| Metric | Value |
|---|---|
| Word Error Rate (WER) | 1.04% (0.0104) |
| Real-Time Factor (RTF) | 0.042 (~24x faster than real-time) |
| Evaluation Dataset | NeurologyAI/neuro-whisper-v1 (validation) |
| Evaluation Samples | 5,289 samples |
| Total Audio Duration | 22,786.68 seconds (~6.3 hours) |
| Average Inference Time | 0.18 seconds per sample |
| Samples per Second | 5.46 samples/second |
| Evaluation Hardware | Mac Mini M4 with 32 GB unified memory (VRAM) |
Hardware Details:
- Device: Mac Mini M4 (only M4 was tested)
- Memory: 32 GB unified memory (VRAM)
- Framework: MLX (Apple Silicon optimized)
- Note: Performance metrics are specific to M4. Results on M1, M2, and M3 may vary.
This demonstrates excellent performance on German medical/neurological speech recognition, with a WER below 1.1% on the validation set. The real-time factor of 0.042 means the model processes audio approximately 24x faster than real-time on M4 hardware, making it highly efficient for batch processing and real-time applications.
Note: All inference and validation were conducted exclusively on a Mac Mini M4 with 32 GB unified memory. Performance on M1, M2, and M3 chips may vary and has not been tested.
Model Comparison
The following table compares WER results on the validation split of NeurologyAI/neuro-whisper-v1 (5,289 samples, ~6.3 hours of audio):
| Model | WER | Notes |
|---|---|---|
| NeurologyAI/neuro-parakeet-mlx (this model) | 1.04% | Fine-tuned Parakeet TDT 0.6B for German medical |
| mlx-community/parakeet-tdt-0.6b-v3 | 18.31% | Base Parakeet TDT 0.6B (no fine-tuning) |
| mlx-community/whisper-large-v3-mlx | 13.96% | Base Whisper Large v3 (no fine-tuning) |
Performance by Domain
- β Best Performance: German medical/neurological speech, clinical dictations
- β Strong Performance: Medical reports, patient documentation, case presentations
- β Good Performance: General German speech with medical context
- β οΈ Limited Performance: Non-neurology domains, other languages (base model supports 25 languages, but this fine-tuned version is optimized for German), heavily accented speech
β οΈ Limitations
- Hardware: This model was tested and benchmarked exclusively on Apple Silicon M4. While it should work on M1/M2/M3 chips, performance metrics (RTF, speed) may vary and have not been validated on these earlier chips.
- Language: While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech. Other languages may have reduced accuracy compared to the base model.
- Domain: Best results on German medical/neurological content; general speech may be less accurate
- Audio Quality: Requires clear audio (16 kHz, mono) for optimal performance
- Length: Very long audio files may have reduced accuracy
- Accents: Performance may vary with regional accents or non-standard pronunciation
- Background Noise: Best results with minimal background noise and clear speech
π License
GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license, inherited from the base model nvidia/parakeet-tdt-0.6b-v3.
This model is ready for commercial and non-commercial use under the terms of the CC-BY-4.0 license.
π Citation
If you use this model in your research or applications, please cite:
@misc{neuro-parakeet-mlx,
title={Neuro-Parakeet-MLX: German Medical Speech Recognition for Neurology and Neuro-oncology Optimized for Apple Silicon},
author={Riedemann, Lars},
year={2025},
publisher={Hugging Face},
note={Based on nvidia/parakeet-tdt-0.6b-v3},
howpublished={\\url{https://huggingface.co/NeurologyAI/neuro-parakeet-mlx}}
}
Base Model Citation:
@misc{sekoyan2025canary1bv2parakeettdt06bv3efficient,
title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST},
author={Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg},
year={2025},
eprint={2509.14128},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.14128},
}
Training Dataset Citation:
@dataset{neuro-whisper-v1,
title={neuro-whisper-v1: German Medical Speech Recognition Dataset},
author={Riedemann, Lars},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1},
language={de}
}
π Acknowledgments
This model builds upon the excellent work of:
- NVIDIA: For the Parakeet TDT 0.6B v3 base model (CC-BY-4.0) - a 600-million-parameter multilingual ASR model with FastConformer-TDT architecture
- Granary Dataset: For providing the multilingual training corpus used in base model training
- NeMo ASR Set 3.0: For the high-quality, human-transcribed training data (10,000+ hours) used in base model training
- Fine-tuning Dataset: NeurologyAI/neuro-whisper-v1 (MIT License) - 52,885 German medical audio samples (114.59 hours) for neurology/neuro-oncology
- Qwen Team: For Qwen/Qwen3-30B-A3B-Instruct-2507 used for medical text generation in the training dataset
- ResembleAI: For Chatterbox TTS (MIT License) used for synthetic audio generation in the training dataset
- Apple MLX: For the MLX framework enabling efficient Apple Silicon optimization
- NVIDIA NeMo: For the NeMo toolkit version 2.4 used in model development and training
- Training Infrastructure: Google Colab with A100 GPUs for training
- Training Notebook: See Parakeet_Training_MLX.ipynb for the complete training pipeline
π Related Resources
- API Server: parakeet-mlx-server - OpenAI-compatible FastAPI server for this model
- Base Model: nvidia/parakeet-tdt-0.6b-v3
- Training Dataset: NeurologyAI/neuro-whisper-v1 - German medical ASR dataset with 52,885 samples (114.59 hours, 90/10 train/val split) for neurology/neuro-oncology
- Data Generation: Resemble AI Chatterbox and Qwen3
- MLX Framework: Apple MLX
- NeMo Toolkit: NVIDIA NeMo
- Parakeet Collection: Parakeet Models on Hugging Face
- Downloads last month
- 134
Model tree for NeurologyAI/neuro-parakeet-mlx
Base model
nvidia/parakeet-tdt-0.6b-v3Dataset used to train NeurologyAI/neuro-parakeet-mlx
Evaluation results
- WER on NeurologyAI/neuro-whisper-v1self-reported0.010