🧠🦜 neuro-parakeet-mlx

Fine-tuned Parakeet TDT 0.6B model optimized for German medical/neurological speech recognition, converted to MLX format for Apple Silicon.

A specialized automatic speech recognition (ASR) model based on the multilingual Parakeet TDT 0.6B base model, fine-tuned specifically for German medical terminology, particularly neurology and neuro-oncology. While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech recognition. This model is optimized for Apple Silicon devices using the MLX framework, providing fast and efficient inference on Mac hardware.

📋 Model Details

Property	Value
Base Model	nvidia/parakeet-tdt-0.6b-v3
Architecture	FastConformer-TDT (RNNT)
Parameters	600 million (base model)
Model Size	2.34 GB
Base Model Languages	Multilingual (25 languages)
Fine-tuned Language	German (de) - medical domain
Domain	Medical/Neurological (German)
Framework	MLX (Apple Silicon optimized)
Tokenizer	SentencePiece BPE (8,192 tokens, multilingual)
Fine-tuned on	NeurologyAI/neuro-whisper-v1
License	CC-BY-4.0

✨ Key Features

🌍 Multilingual Base: Built on a multilingual model (25 languages) fine-tuned for German medical speech
🚀 Apple Silicon Optimized: Native MLX framework for fast inference on Mac (tested on M4; M1/M2/M3 results may vary)
⚡ Ultra-Fast Inference: Real-time factor of 0.042 (~24x faster than real-time) on M4
🏥 Medical Domain Specialized: Fine-tuned on 52,885 German medical audio samples (114.59 hours)
🧠 Neurology Focus: Optimized for neurology and neuro-oncology terminology
🔌 OpenAI-Compatible API: Drop-in replacement for OpenAI Whisper API endpoints
📝 Automatic Formatting: Built-in punctuation and capitalization
⏱️ Timestamps: Word-level and segment-level timing information
🎙️ Long Audio Support: Handles extended recordings (up to 24 minutes with full attention)
🌐 Multiple Formats: Supports WAV, MP3, FLAC, OGG, M4A, AAC, and more

🚀 Quick Start

Installation

pip install parakeet-mlx

System Requirements

Hardware: Apple Silicon (M1/M2/M3/M4) supported; tested and benchmarked on M4 only - performance on M1/M2/M3 may vary
RAM: At least 4GB available (8GB+ recommended)
Python: 3.8 or higher
macOS: 12.0 or later

Basic Usage

Using parakeet-mlx CLI

# Install
pip install -U parakeet-mlx

# Transcribe audio file
parakeet-mlx audio.wav --model NeurologyAI/neuro-parakeet-mlx

Using mlx-audio

# Install
pip install -U mlx-audio

# Transcribe audio file
python -m mlx_audio.stt.generate --model NeurologyAI/neuro-parakeet-mlx --audio audio.wav

Using Python API

from parakeet_mlx import from_pretrained

# Load model (first run will download ~2.34 GB)
model = from_pretrained("NeurologyAI/neuro-parakeet-mlx")

# Transcribe audio file
result = model.transcribe("path/to/audio.wav", language="de")
print(result.text)

# Access timestamps if needed
if hasattr(result, 'segments'):
    for segment in result.segments:
        print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")

🌐 OpenAI-Compatible API Server

This model can be served as a drop-in replacement for OpenAI's Whisper API, compatible with tools like open-webui and other OpenAI API clients.

Starting the Server

You can use the parakeet-mlx-server repository for a ready-to-use OpenAI-compatible API server:

# Clone and start the server
git clone git@github.com:riedemannai/parakeet-mlx-server.git
cd parakeet-mlx-server
pip install -r requirements.txt
./start_server.sh --model NeurologyAI/neuro-parakeet-mlx

Or run directly:

python parakeet_server.py --model NeurologyAI/neuro-parakeet-mlx --port 8002

API Usage Examples

Using cURL

curl -X POST http://localhost:8002/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.wav" \
  -F "model=parakeet-tdt-0.6b-v3" \
  -F "language=de"

Using Python

import requests

url = "http://localhost:8002/v1/audio/transcriptions"
files = {"file": open("audio.wav", "rb")}
data = {
    "model": "parakeet-tdt-0.6b-v3",
    "language": "de",
    "response_format": "json"  # or "text" for plain text
}

response = requests.post(url, files=files, data=data)
result = response.json()

print(result["text"])
# Access segments with timestamps if available
if "segments" in result:
    for seg in result["segments"]:
        print(f"{seg.get('start', 0):.2f}s: {seg.get('text', '')}")

Response Format

The API returns JSON with the following structure:

{
  "text": "Transcribed text here...",
  "recording_timestamp": "optional timestamp",
  "segments": [
    {
      "text": "Segment text",
      "start": 0.0,
      "end": 5.2
    }
  ]
}

📁 Model Files

model.safetensors: Model weights (2.34 GB)
config.json: Model configuration (MLX-compatible)
model_config.yaml: Original NeMo training configuration (reference)

🎵 Supported Audio Formats

The model supports various audio formats via librosa. Audio is automatically converted to the required format:

Supported Formats:

WAV, MP3, FLAC, OGG, M4A, AAC, AIFF, AU, WMA

Input Requirements:

Sample Rate: Automatically resampled to 16 kHz
Channels: Automatically converted to mono
Format: Any of the supported formats above

Recommended:

Clear audio with minimal background noise
16 kHz sample rate (or higher, will be downsampled)
Mono channel audio

🎓 Training Details

Fine-tuning Configuration

This model was fine-tuned from nvidia/parakeet-tdt-0.6b-v3 using NVIDIA NeMo toolkit with the following parameters:

Parameter	Value
Training Data	NeurologyAI/neuro-whisper-v1
Training Samples	47,596 (90% of dataset)
Validation Samples	5,289 (10% of dataset)
Total Dataset Size	52,885 samples
Total Audio Duration	114.59 hours
Average Sample Duration	7.80 seconds
Max Epochs	7
Learning Rate	1e-4
Weight Decay	0.001
Batch Size	64 (per device)
Gradient Accumulation	1
Precision	BF16 mixed
Optimizer	AdamW
Scheduler	Cosine annealing with warmup
Warmup Ratio	0.07
Gradient Clipping	1.0
Validation Check Interval	Every 50 steps
Logging Frequency	Every 10 steps
Audio Duration Range	0.1 - 20.0 seconds
Data Loader Workers	8
Pin Memory	True

Training Dataset: NeurologyAI/neuro-whisper-v1

The neuro-whisper-v1 dataset is a comprehensive German medical speech recognition dataset specifically designed for neurology and neuro-oncology terminology.

Dataset Characteristics:

Language: German (de)
Domain: Neuro-oncology, neurology, medical terminology
Total Samples: 52,885 audio-text pairs
Total Duration: 114.59 hours of audio
Audio Format: 16 kHz mono WAV
Split: 90% training (47,596 samples) / 10% validation (5,289 samples)

Data Generation:

Voice Data: Synthetically generated using Resemble AI Chatterbox TTS
Text Data: Medical text generated with Qwen/Qwen3-30B-A3B-Instruct-2507
Purpose: ASR training for medical transcription in neurology/neuro-oncology domains

Dataset Features:

Consistent audio quality (uniform TTS generation)
Comprehensive coverage of specialized medical terminology
Privacy-safe (no real patient data)
Optimized for German medical speech recognition

Base Model Training

The base model (nvidia/parakeet-tdt-0.6b-v3) was trained on:

Primary Dataset: Granary multilingual corpus (660,000+ hours of pseudo-labeled data)
Fine-tuning Dataset: NeMo ASR Set 3.0 (~7,500 hours of high-quality, human-transcribed data)
- Includes: LibriSpeech, Fisher Corpus, VCTK, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice, and more
Training: 150,000 steps on 128 A100 GPUs, then 5,000 steps on 4 A100 GPUs
Tokenizer: Unified SentencePiece BPE with 8,192 tokens, optimized across 25 languages

Training Infrastructure

Framework: NVIDIA NeMo 2.4
Training Hardware: NVIDIA A100 GPUs (Google Colab)
Inference Hardware: Apple Silicon via MLX framework (tested on M4 only)
Evaluation Hardware: Mac Mini M4 with 32 GB unified memory (VRAM) - only M4 was tested; results on M1/M2/M3 may vary
Training Script: Based on NeMo ASR training examples with TDT configuration
Checkpointing: Best model selected by validation WER (lowest WER), with top-2 checkpoints saved
CUDA Graphs: Disabled for compatibility (use_cuda_graph_decoder = False)
Decoding Strategy: Greedy batch decoding

🌍 CO2 Emission Related to Experiments

Fine-tuning was conducted using Google Cloud Platform in region europe-west3-a, which has a carbon efficiency of 0.61 kgCO₂eq/kWh. A cumulative of 5 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).

Total emissions are estimated to be 0.76 kgCO₂eq of which 100 percents were directly offset by the cloud provider.

Estimations were conducted using the MachineLearning Impact calculator presented in Lacoste et al. (2019).

Reference:

@article{lacoste2019quantifying,
  title={Quantifying the Carbon Emissions of Machine Learning},
  author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
  journal={arXiv preprint arXiv:1910.09700},
  year={2019}
}

🏥 Medical Terminology Coverage

This model is specifically optimized for German medical terminology, with enhanced accuracy for:

Neuro-oncology: Gliomas, glioblastomas, astrocytomas, oligodendrogliomas, meningiomas
Molecular Markers: IDH mutations, MGMT methylation, 1p/19q codeletion, EGFR amplification
Treatment Modalities: Radiation therapy, chemotherapy, immunotherapy, targeted therapy
Diagnostic Imaging: MRI, CT, PET scans, contrast enhancement, diffusion-weighted imaging
Clinical Assessments: Neurological examinations, cognitive assessments, motor function tests
Medical Procedures: Biopsies, resections, craniotomies, ventriculostomies

📊 Performance Characteristics

Evaluation Results

The model was evaluated on the validation split of NeurologyAI/neuro-whisper-v1:

Metric	Value
Word Error Rate (WER)	1.04% (0.0104)
Real-Time Factor (RTF)	0.042 (~24x faster than real-time)
Evaluation Dataset	NeurologyAI/neuro-whisper-v1 (validation)
Evaluation Samples	5,289 samples
Total Audio Duration	22,786.68 seconds (~6.3 hours)
Average Inference Time	0.18 seconds per sample
Samples per Second	5.46 samples/second
Evaluation Hardware	Mac Mini M4 with 32 GB unified memory (VRAM)

Hardware Details:

Device: Mac Mini M4 (only M4 was tested)
Memory: 32 GB unified memory (VRAM)
Framework: MLX (Apple Silicon optimized)
Note: Performance metrics are specific to M4. Results on M1, M2, and M3 may vary.

This demonstrates excellent performance on German medical/neurological speech recognition, with a WER below 1.1% on the validation set. The real-time factor of 0.042 means the model processes audio approximately 24x faster than real-time on M4 hardware, making it highly efficient for batch processing and real-time applications.

Note: All inference and validation were conducted exclusively on a Mac Mini M4 with 32 GB unified memory. Performance on M1, M2, and M3 chips may vary and has not been tested.

Evaluation Tool: These results were obtained using the 🎤🏥🎯 Medical ASR Evaluator - a standalone tool for evaluating ASR models using Word Error Rate (WER) metric, optimized for medical/clinical speech recognition.

Model Comparison

The following table compares WER results on the validation split of NeurologyAI/neuro-whisper-v1 (5,289 samples, ~6.3 hours of audio):

Model	WER	Notes
NeurologyAI/neuro-parakeet-mlx (this model)	1.04%	Fine-tuned Parakeet TDT 0.6B for German medical
mlx-community/parakeet-tdt-0.6b-v3	18.31%	Base Parakeet TDT 0.6B (no fine-tuning)
mlx-community/whisper-large-v3-mlx	13.96%	Base Whisper Large v3 (no fine-tuning)

Performance by Domain

✅ Best Performance: German medical/neurological speech, clinical dictations
✅ Strong Performance: Medical reports, patient documentation, case presentations
✅ Good Performance: General German speech with medical context
⚠️ Limited Performance: Non-neurology domains, other languages (base model supports 25 languages, but this fine-tuned version is optimized for German), heavily accented speech

⚠️ Limitations

Hardware: This model was tested and benchmarked exclusively on Apple Silicon M4. While it should work on M1/M2/M3 chips, performance metrics (RTF, speed) may vary and have not been validated on these earlier chips.
Language: While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech. Other languages may have reduced accuracy compared to the base model.
Domain: Best results on German medical/neurological content; general speech may be less accurate
Audio Quality: Requires clear audio (16 kHz, mono) for optimal performance
Length: Very long audio files may have reduced accuracy
Accents: Performance may vary with regional accents or non-standard pronunciation
Background Noise: Best results with minimal background noise and clear speech

📄 License

GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license, inherited from the base model nvidia/parakeet-tdt-0.6b-v3.

This model is ready for commercial and non-commercial use under the terms of the CC-BY-4.0 license.

📚 Citation

If you use this model in your research or applications, please cite:

@misc{neuro-parakeet-mlx,
  title={Neuro-Parakeet-MLX: German Medical Speech Recognition for Neurology and Neuro-oncology Optimized for Apple Silicon},
  author={Riedemann, Lars},
  year={2025},
  publisher={Hugging Face},
  note={Based on nvidia/parakeet-tdt-0.6b-v3},
  howpublished={\\url{https://huggingface.co/NeurologyAI/neuro-parakeet-mlx}}
}

Base Model Citation:

@misc{sekoyan2025canary1bv2parakeettdt06bv3efficient,
      title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST}, 
      author={Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg},
      year={2025},
      eprint={2509.14128},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.14128}, 
}

Training Dataset Citation:

@dataset{neuro-whisper-v1,
  title={neuro-whisper-v1: German Medical Speech Recognition Dataset},
  author={Riedemann, Lars},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1},
  language={de}
}

🙏 Acknowledgments

This model builds upon the excellent work of:

NVIDIA: For the Parakeet TDT 0.6B v3 base model (CC-BY-4.0) - a 600-million-parameter multilingual ASR model with FastConformer-TDT architecture
Granary Dataset: For providing the multilingual training corpus used in base model training
NeMo ASR Set 3.0: For the high-quality, human-transcribed training data (10,000+ hours) used in base model training
Fine-tuning Dataset: NeurologyAI/neuro-whisper-v1 (MIT License) - 52,885 German medical audio samples (114.59 hours) for neurology/neuro-oncology
Qwen Team: For Qwen/Qwen3-30B-A3B-Instruct-2507 used for medical text generation in the training dataset
ResembleAI: For Chatterbox TTS (MIT License) used for synthetic audio generation in the training dataset
Apple MLX: For the MLX framework enabling efficient Apple Silicon optimization
NVIDIA NeMo: For the NeMo toolkit version 2.4 used in model development and training
Training Infrastructure: Google Colab with A100 GPUs for training
Training Notebook: See Parakeet_Training_MLX.ipynb for the complete training pipeline

🔗 Related Resources

API Server: parakeet-mlx-server - OpenAI-compatible FastAPI server for this model
Base Model: nvidia/parakeet-tdt-0.6b-v3
Training Dataset: NeurologyAI/neuro-whisper-v1 - German medical ASR dataset with 52,885 samples (114.59 hours, 90/10 train/val split) for neurology/neuro-oncology
Data Generation: Resemble AI Chatterbox and Qwen3
MLX Framework: Apple MLX
NeMo Toolkit: NVIDIA NeMo
Parakeet Collection: Parakeet Models on Hugging Face