neuro-parakeet-mlx / README.md
NeurologyAI's picture
Add link to Medical ASR Evaluator in Evaluation Results section
4eac73b
---
license: cc-by-4.0
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- asr
- parakeet
- mlx
- german
- medical
- neurology
- apple-silicon
- multilingual
datasets:
- NeurologyAI/neuro-whisper-v1
language:
- de
- multilingual
metrics:
- wer
- cer
model-index:
- name: neuro-parakeet-mlx
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: NeurologyAI/neuro-whisper-v1
type: NeurologyAI/neuro-whisper-v1
metrics:
- name: WER
type: wer
value: 0.0104
---
# 🧠🦜 neuro-parakeet-mlx
> Fine-tuned Parakeet TDT 0.6B model optimized for German medical/neurological speech recognition, converted to MLX format for Apple Silicon.
A specialized automatic speech recognition (ASR) model based on the multilingual Parakeet TDT 0.6B base model, fine-tuned specifically for German medical terminology, particularly neurology and neuro-oncology. While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech recognition. This model is optimized for Apple Silicon devices using the MLX framework, providing fast and efficient inference on Mac hardware.
## πŸ“‹ Model Details
| Property | Value |
|----------|-------|
| **Base Model** | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
| **Architecture** | FastConformer-TDT (RNNT) |
| **Parameters** | 600 million (base model) |
| **Model Size** | 2.34 GB |
| **Base Model Languages** | Multilingual (25 languages) |
| **Fine-tuned Language** | German (de) - medical domain |
| **Domain** | Medical/Neurological (German) |
| **Framework** | MLX (Apple Silicon optimized) |
| **Tokenizer** | SentencePiece BPE (8,192 tokens, multilingual) |
| **Fine-tuned on** | [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) |
| **License** | [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) |
## ✨ Key Features
- 🌍 **Multilingual Base**: Built on a multilingual model (25 languages) fine-tuned for German medical speech
- πŸš€ **Apple Silicon Optimized**: Native MLX framework for fast inference on Mac (tested on M4; M1/M2/M3 results may vary)
- ⚑ **Ultra-Fast Inference**: Real-time factor of 0.042 (~24x faster than real-time) on M4
- πŸ₯ **Medical Domain Specialized**: Fine-tuned on 52,885 German medical audio samples (114.59 hours)
- 🧠 **Neurology Focus**: Optimized for neurology and neuro-oncology terminology
- πŸ”Œ **OpenAI-Compatible API**: Drop-in replacement for OpenAI Whisper API endpoints
- πŸ“ **Automatic Formatting**: Built-in punctuation and capitalization
- ⏱️ **Timestamps**: Word-level and segment-level timing information
- πŸŽ™οΈ **Long Audio Support**: Handles extended recordings (up to 24 minutes with full attention)
- 🌐 **Multiple Formats**: Supports WAV, MP3, FLAC, OGG, M4A, AAC, and more
## πŸš€ Quick Start
### Installation
```bash
pip install parakeet-mlx
```
### System Requirements
- **Hardware**: Apple Silicon (M1/M2/M3/M4) supported; **tested and benchmarked on M4 only** - performance on M1/M2/M3 may vary
- **RAM**: At least 4GB available (8GB+ recommended)
- **Python**: 3.8 or higher
- **macOS**: 12.0 or later
### Basic Usage
#### Using parakeet-mlx CLI
```bash
# Install
pip install -U parakeet-mlx
# Transcribe audio file
parakeet-mlx audio.wav --model NeurologyAI/neuro-parakeet-mlx
```
#### Using mlx-audio
```bash
# Install
pip install -U mlx-audio
# Transcribe audio file
python -m mlx_audio.stt.generate --model NeurologyAI/neuro-parakeet-mlx --audio audio.wav
```
#### Using Python API
```python
from parakeet_mlx import from_pretrained
# Load model (first run will download ~2.34 GB)
model = from_pretrained("NeurologyAI/neuro-parakeet-mlx")
# Transcribe audio file
result = model.transcribe("path/to/audio.wav", language="de")
print(result.text)
# Access timestamps if needed
if hasattr(result, 'segments'):
for segment in result.segments:
print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")
```
## 🌐 OpenAI-Compatible API Server
This model can be served as a drop-in replacement for OpenAI's Whisper API, compatible with tools like [open-webui](https://github.com/open-webui/open-webui) and other OpenAI API clients.
### Starting the Server
You can use the [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) repository for a ready-to-use OpenAI-compatible API server:
```bash
# Clone and start the server
git clone git@github.com:riedemannai/parakeet-mlx-server.git
cd parakeet-mlx-server
pip install -r requirements.txt
./start_server.sh --model NeurologyAI/neuro-parakeet-mlx
```
Or run directly:
```bash
python parakeet_server.py --model NeurologyAI/neuro-parakeet-mlx --port 8002
```
### API Usage Examples
#### Using cURL
```bash
curl -X POST http://localhost:8002/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-F "file=@audio.wav" \
-F "model=parakeet-tdt-0.6b-v3" \
-F "language=de"
```
#### Using Python
```python
import requests
url = "http://localhost:8002/v1/audio/transcriptions"
files = {"file": open("audio.wav", "rb")}
data = {
"model": "parakeet-tdt-0.6b-v3",
"language": "de",
"response_format": "json" # or "text" for plain text
}
response = requests.post(url, files=files, data=data)
result = response.json()
print(result["text"])
# Access segments with timestamps if available
if "segments" in result:
for seg in result["segments"]:
print(f"{seg.get('start', 0):.2f}s: {seg.get('text', '')}")
```
#### Response Format
The API returns JSON with the following structure:
```json
{
"text": "Transcribed text here...",
"recording_timestamp": "optional timestamp",
"segments": [
{
"text": "Segment text",
"start": 0.0,
"end": 5.2
}
]
}
```
## πŸ“ Model Files
- `model.safetensors`: Model weights (2.34 GB)
- `config.json`: Model configuration (MLX-compatible)
- `model_config.yaml`: Original NeMo training configuration (reference)
## 🎡 Supported Audio Formats
The model supports various audio formats via librosa. Audio is automatically converted to the required format:
**Supported Formats:**
- WAV, MP3, FLAC, OGG, M4A, AAC, AIFF, AU, WMA
**Input Requirements:**
- **Sample Rate**: Automatically resampled to 16 kHz
- **Channels**: Automatically converted to mono
- **Format**: Any of the supported formats above
**Recommended:**
- Clear audio with minimal background noise
- 16 kHz sample rate (or higher, will be downsampled)
- Mono channel audio
## πŸŽ“ Training Details
### Fine-tuning Configuration
This model was fine-tuned from [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) using NVIDIA NeMo toolkit with the following parameters:
| Parameter | Value |
|-----------|-------|
| **Training Data** | [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) |
| **Training Samples** | 47,596 (90% of dataset) |
| **Validation Samples** | 5,289 (10% of dataset) |
| **Total Dataset Size** | 52,885 samples |
| **Total Audio Duration** | 114.59 hours |
| **Average Sample Duration** | 7.80 seconds |
| **Max Epochs** | 7 |
| **Learning Rate** | 1e-4 |
| **Weight Decay** | 0.001 |
| **Batch Size** | 64 (per device) |
| **Gradient Accumulation** | 1 |
| **Precision** | BF16 mixed |
| **Optimizer** | AdamW |
| **Scheduler** | Cosine annealing with warmup |
| **Warmup Ratio** | 0.07 |
| **Gradient Clipping** | 1.0 |
| **Validation Check Interval** | Every 50 steps |
| **Logging Frequency** | Every 10 steps |
| **Audio Duration Range** | 0.1 - 20.0 seconds |
| **Data Loader Workers** | 8 |
| **Pin Memory** | True |
### Training Dataset: NeurologyAI/neuro-whisper-v1
The [neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) dataset is a comprehensive German medical speech recognition dataset specifically designed for neurology and neuro-oncology terminology.
**Dataset Characteristics:**
- **Language**: German (de)
- **Domain**: Neuro-oncology, neurology, medical terminology
- **Total Samples**: 52,885 audio-text pairs
- **Total Duration**: 114.59 hours of audio
- **Audio Format**: 16 kHz mono WAV
- **Split**: 90% training (47,596 samples) / 10% validation (5,289 samples)
**Data Generation:**
- **Voice Data**: Synthetically generated using [Resemble AI Chatterbox TTS](https://github.com/resemble-ai/chatterbox)
- **Text Data**: Medical text generated with [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
- **Purpose**: ASR training for medical transcription in neurology/neuro-oncology domains
**Dataset Features:**
- Consistent audio quality (uniform TTS generation)
- Comprehensive coverage of specialized medical terminology
- Privacy-safe (no real patient data)
- Optimized for German medical speech recognition
### Base Model Training
The base model ([nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)) was trained on:
- **Primary Dataset**: Granary multilingual corpus (660,000+ hours of pseudo-labeled data)
- **Fine-tuning Dataset**: NeMo ASR Set 3.0 (~7,500 hours of high-quality, human-transcribed data)
- Includes: LibriSpeech, Fisher Corpus, VCTK, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice, and more
- **Training**: 150,000 steps on 128 A100 GPUs, then 5,000 steps on 4 A100 GPUs
- **Tokenizer**: Unified SentencePiece BPE with 8,192 tokens, optimized across 25 languages
### Training Infrastructure
- **Framework**: NVIDIA NeMo 2.4
- **Training Hardware**: NVIDIA A100 GPUs (Google Colab)
- **Inference Hardware**: Apple Silicon via MLX framework (tested on M4 only)
- **Evaluation Hardware**: Mac Mini M4 with 32 GB unified memory (VRAM) - **only M4 was tested; results on M1/M2/M3 may vary**
- **Training Script**: Based on NeMo ASR training examples with TDT configuration
- **Checkpointing**: Best model selected by validation WER (lowest WER), with top-2 checkpoints saved
- **CUDA Graphs**: Disabled for compatibility (use_cuda_graph_decoder = False)
- **Decoding Strategy**: Greedy batch decoding
## 🌍 CO2 Emission Related to Experiments
Fine-tuning was conducted using Google Cloud Platform in region `europe-west3-a`, which has a carbon efficiency of 0.61 kgCOβ‚‚eq/kWh. A cumulative of 5 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).
Total emissions are estimated to be **0.76 kgCOβ‚‚eq** of which 100 percents were directly offset by the cloud provider.
Estimations were conducted using the [MachineLearning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
**Reference:**
```bibtex
@article{lacoste2019quantifying,
title={Quantifying the Carbon Emissions of Machine Learning},
author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
journal={arXiv preprint arXiv:1910.09700},
year={2019}
}
```
## πŸ₯ Medical Terminology Coverage
This model is specifically optimized for German medical terminology, with enhanced accuracy for:
- **Neuro-oncology**: Gliomas, glioblastomas, astrocytomas, oligodendrogliomas, meningiomas
- **Molecular Markers**: IDH mutations, MGMT methylation, 1p/19q codeletion, EGFR amplification
- **Treatment Modalities**: Radiation therapy, chemotherapy, immunotherapy, targeted therapy
- **Diagnostic Imaging**: MRI, CT, PET scans, contrast enhancement, diffusion-weighted imaging
- **Clinical Assessments**: Neurological examinations, cognitive assessments, motor function tests
- **Medical Procedures**: Biopsies, resections, craniotomies, ventriculostomies
## πŸ“Š Performance Characteristics
### Evaluation Results
The model was evaluated on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1):
| Metric | Value |
|--------|-------|
| **Word Error Rate (WER)** | **1.04%** (0.0104) |
| **Real-Time Factor (RTF)** | **0.042** (~24x faster than real-time) |
| **Evaluation Dataset** | NeurologyAI/neuro-whisper-v1 (validation) |
| **Evaluation Samples** | 5,289 samples |
| **Total Audio Duration** | 22,786.68 seconds (~6.3 hours) |
| **Average Inference Time** | 0.18 seconds per sample |
| **Samples per Second** | 5.46 samples/second |
| **Evaluation Hardware** | Mac Mini M4 with 32 GB unified memory (VRAM) |
**Hardware Details:**
- **Device**: Mac Mini M4 (only M4 was tested)
- **Memory**: 32 GB unified memory (VRAM)
- **Framework**: MLX (Apple Silicon optimized)
- **Note**: Performance metrics are specific to M4. Results on M1, M2, and M3 may vary.
This demonstrates excellent performance on German medical/neurological speech recognition, with a WER below 1.1% on the validation set. The real-time factor of 0.042 means the model processes audio approximately **24x faster than real-time** on M4 hardware, making it highly efficient for batch processing and real-time applications.
**Note:** All inference and validation were conducted exclusively on a Mac Mini M4 with 32 GB unified memory. Performance on M1, M2, and M3 chips may vary and has not been tested.
**Evaluation Tool:** These results were obtained using the [🎀πŸ₯🎯 Medical ASR Evaluator](https://github.com/riedemannai/Medical_ASR_Evaluator) - a standalone tool for evaluating ASR models using Word Error Rate (WER) metric, optimized for medical/clinical speech recognition.
### Model Comparison
The following table compares WER results on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (5,289 samples, ~6.3 hours of audio):
| Model | WER | Notes |
|-------|-----|-------|
| **NeurologyAI/neuro-parakeet-mlx** (this model) | **1.04%** | Fine-tuned Parakeet TDT 0.6B for German medical |
| mlx-community/parakeet-tdt-0.6b-v3 | 18.31% | Base Parakeet TDT 0.6B (no fine-tuning) |
| mlx-community/whisper-large-v3-mlx | 13.96% | Base Whisper Large v3 (no fine-tuning) |
### Performance by Domain
- βœ… **Best Performance**: German medical/neurological speech, clinical dictations
- βœ… **Strong Performance**: Medical reports, patient documentation, case presentations
- βœ… **Good Performance**: General German speech with medical context
- ⚠️ **Limited Performance**: Non-neurology domains, other languages (base model supports 25 languages, but this fine-tuned version is optimized for German), heavily accented speech
## ⚠️ Limitations
- **Hardware**: This model was tested and benchmarked exclusively on Apple Silicon M4. While it should work on M1/M2/M3 chips, performance metrics (RTF, speed) may vary and have not been validated on these earlier chips.
- **Language**: While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech. Other languages may have reduced accuracy compared to the base model.
- **Domain**: Best results on German medical/neurological content; general speech may be less accurate
- **Audio Quality**: Requires clear audio (16 kHz, mono) for optimal performance
- **Length**: Very long audio files may have reduced accuracy
- **Accents**: Performance may vary with regional accents or non-standard pronunciation
- **Background Noise**: Best results with minimal background noise and clear speech
## πŸ“„ License
**GOVERNING TERMS**: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license, inherited from the base model [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3).
This model is ready for **commercial and non-commercial use** under the terms of the CC-BY-4.0 license.
## πŸ“š Citation
If you use this model in your research or applications, please cite:
```bibtex
@misc{neuro-parakeet-mlx,
title={Neuro-Parakeet-MLX: German Medical Speech Recognition for Neurology and Neuro-oncology Optimized for Apple Silicon},
author={Riedemann, Lars},
year={2025},
publisher={Hugging Face},
note={Based on nvidia/parakeet-tdt-0.6b-v3},
howpublished={\\url{https://huggingface.co/NeurologyAI/neuro-parakeet-mlx}}
}
```
**Base Model Citation:**
```bibtex
@misc{sekoyan2025canary1bv2parakeettdt06bv3efficient,
title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST},
author={Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg},
year={2025},
eprint={2509.14128},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.14128},
}
```
**Training Dataset Citation:**
```bibtex
@dataset{neuro-whisper-v1,
title={neuro-whisper-v1: German Medical Speech Recognition Dataset},
author={Riedemann, Lars},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1},
language={de}
}
```
## πŸ™ Acknowledgments
This model builds upon the excellent work of:
- **NVIDIA**: For the [Parakeet TDT 0.6B v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) base model (CC-BY-4.0) - a 600-million-parameter multilingual ASR model with FastConformer-TDT architecture
- **Granary Dataset**: For providing the multilingual training corpus used in base model training
- **NeMo ASR Set 3.0**: For the high-quality, human-transcribed training data (10,000+ hours) used in base model training
- **Fine-tuning Dataset**: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (MIT License) - 52,885 German medical audio samples (114.59 hours) for neurology/neuro-oncology
- **Qwen Team**: For [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) used for medical text generation in the training dataset
- **ResembleAI**: For [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) (MIT License) used for synthetic audio generation in the training dataset
- **Apple MLX**: For the [MLX framework](https://github.com/ml-explore/mlx) enabling efficient Apple Silicon optimization
- **NVIDIA NeMo**: For the [NeMo toolkit](https://github.com/NVIDIA/NeMo) version 2.4 used in model development and training
- **Training Infrastructure**: Google Colab with A100 GPUs for training
- **Training Notebook**: See [Parakeet_Training_MLX.ipynb](colab_notebooks/Parakeet_Training_MLX.ipynb) for the complete training pipeline
## πŸ”— Related Resources
- **API Server**: [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) - OpenAI-compatible FastAPI server for this model
- **Base Model**: [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
- **Training Dataset**: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) - German medical ASR dataset with 52,885 samples (114.59 hours, 90/10 train/val split) for neurology/neuro-oncology
- **Data Generation**: [Resemble AI Chatterbox](https://github.com/resemble-ai/chatterbox) and [Qwen3](https://github.com/QwenLM/Qwen3)
- **MLX Framework**: [Apple MLX](https://github.com/ml-explore/mlx)
- **NeMo Toolkit**: [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
- **Parakeet Collection**: [Parakeet Models on Hugging Face](https://huggingface.co/collections/nvidia/parakeet)