neuro-parakeet-mlx / README.md

Add link to Medical ASR Evaluator in Evaluation Results section

4eac73b 17 days ago

19.5 kB

	---
	license: cc-by-4.0
	base_model: nvidia/parakeet-tdt-0.6b-v3
	tags:
	- automatic-speech-recognition
	- asr
	- parakeet
	- mlx
	- german
	- medical
	- neurology
	- apple-silicon
	- multilingual
	datasets:
	- NeurologyAI/neuro-whisper-v1
	language:
	- de
	- multilingual
	metrics:
	- wer
	- cer
	model-index:
	- name: neuro-parakeet-mlx
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: NeurologyAI/neuro-whisper-v1
	type: NeurologyAI/neuro-whisper-v1
	metrics:
	- name: WER
	type: wer
	value: 0.0104
	---

	# 🧠🦜 neuro-parakeet-mlx

	> Fine-tuned Parakeet TDT 0.6B model optimized for German medical/neurological speech recognition, converted to MLX format for Apple Silicon.

	A specialized automatic speech recognition (ASR) model based on the multilingual Parakeet TDT 0.6B base model, fine-tuned specifically for German medical terminology, particularly neurology and neuro-oncology. While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech recognition. This model is optimized for Apple Silicon devices using the MLX framework, providing fast and efficient inference on Mac hardware.

	## 📋 Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) \|
	\| Architecture \| FastConformer-TDT (RNNT) \|
	\| Parameters \| 600 million (base model) \|
	\| Model Size \| 2.34 GB \|
	\| Base Model Languages \| Multilingual (25 languages) \|
	\| Fine-tuned Language \| German (de) - medical domain \|
	\| Domain \| Medical/Neurological (German) \|
	\| Framework \| MLX (Apple Silicon optimized) \|
	\| Tokenizer \| SentencePiece BPE (8,192 tokens, multilingual) \|
	\| Fine-tuned on \| [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) \|
	\| License \| [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) \|

	## ✨ Key Features

	- 🌍 Multilingual Base: Built on a multilingual model (25 languages) fine-tuned for German medical speech
	- 🚀 Apple Silicon Optimized: Native MLX framework for fast inference on Mac (tested on M4; M1/M2/M3 results may vary)
	- ⚡ Ultra-Fast Inference: Real-time factor of 0.042 (~24x faster than real-time) on M4
	- 🏥 Medical Domain Specialized: Fine-tuned on 52,885 German medical audio samples (114.59 hours)
	- 🧠 Neurology Focus: Optimized for neurology and neuro-oncology terminology
	- 🔌 OpenAI-Compatible API: Drop-in replacement for OpenAI Whisper API endpoints
	- 📝 Automatic Formatting: Built-in punctuation and capitalization
	- ⏱️ Timestamps: Word-level and segment-level timing information
	- 🎙️ Long Audio Support: Handles extended recordings (up to 24 minutes with full attention)
	- 🌐 Multiple Formats: Supports WAV, MP3, FLAC, OGG, M4A, AAC, and more

	## 🚀 Quick Start

	### Installation

	```bash
	pip install parakeet-mlx
	```

	### System Requirements

	- Hardware: Apple Silicon (M1/M2/M3/M4) supported; tested and benchmarked on M4 only - performance on M1/M2/M3 may vary
	- RAM: At least 4GB available (8GB+ recommended)
	- Python: 3.8 or higher
	- macOS: 12.0 or later

	### Basic Usage

	#### Using parakeet-mlx CLI

	```bash
	# Install
	pip install -U parakeet-mlx

	# Transcribe audio file
	parakeet-mlx audio.wav --model NeurologyAI/neuro-parakeet-mlx
	```

	#### Using mlx-audio

	```bash
	# Install
	pip install -U mlx-audio

	# Transcribe audio file
	python -m mlx_audio.stt.generate --model NeurologyAI/neuro-parakeet-mlx --audio audio.wav
	```

	#### Using Python API

	```python
	from parakeet_mlx import from_pretrained

	# Load model (first run will download ~2.34 GB)
	model = from_pretrained("NeurologyAI/neuro-parakeet-mlx")

	# Transcribe audio file
	result = model.transcribe("path/to/audio.wav", language="de")
	print(result.text)

	# Access timestamps if needed
	if hasattr(result, 'segments'):
	for segment in result.segments:
	print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")
	```

	## 🌐 OpenAI-Compatible API Server

	This model can be served as a drop-in replacement for OpenAI's Whisper API, compatible with tools like [open-webui](https://github.com/open-webui/open-webui) and other OpenAI API clients.

	### Starting the Server

	You can use the [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) repository for a ready-to-use OpenAI-compatible API server:

	```bash
	# Clone and start the server
	git clone git@github.com:riedemannai/parakeet-mlx-server.git
	cd parakeet-mlx-server
	pip install -r requirements.txt
	./start_server.sh --model NeurologyAI/neuro-parakeet-mlx
	```

	Or run directly:

	```bash
	python parakeet_server.py --model NeurologyAI/neuro-parakeet-mlx --port 8002
	```

	### API Usage Examples

	#### Using cURL

	```bash
	curl -X POST http://localhost:8002/v1/audio/transcriptions \
	-H "Content-Type: multipart/form-data" \
	-F "file=@audio.wav" \
	-F "model=parakeet-tdt-0.6b-v3" \
	-F "language=de"
	```

	#### Using Python

	```python
	import requests

	url = "http://localhost:8002/v1/audio/transcriptions"
	files = {"file": open("audio.wav", "rb")}
	data = {
	"model": "parakeet-tdt-0.6b-v3",
	"language": "de",
	"response_format": "json" # or "text" for plain text
	}

	response = requests.post(url, files=files, data=data)
	result = response.json()

	print(result["text"])
	# Access segments with timestamps if available
	if "segments" in result:
	for seg in result["segments"]:
	print(f"{seg.get('start', 0):.2f}s: {seg.get('text', '')}")
	```

	#### Response Format

	The API returns JSON with the following structure:

	```json
	{
	"text": "Transcribed text here...",
	"recording_timestamp": "optional timestamp",
	"segments": [
	{
	"text": "Segment text",
	"start": 0.0,
	"end": 5.2
	}
	]
	}
	```

	## 📁 Model Files

	- `model.safetensors`: Model weights (2.34 GB)
	- `config.json`: Model configuration (MLX-compatible)
	- `model_config.yaml`: Original NeMo training configuration (reference)

	## 🎵 Supported Audio Formats

	The model supports various audio formats via librosa. Audio is automatically converted to the required format:

	Supported Formats:
	- WAV, MP3, FLAC, OGG, M4A, AAC, AIFF, AU, WMA

	Input Requirements:
	- Sample Rate: Automatically resampled to 16 kHz
	- Channels: Automatically converted to mono
	- Format: Any of the supported formats above

	Recommended:
	- Clear audio with minimal background noise
	- 16 kHz sample rate (or higher, will be downsampled)
	- Mono channel audio

	## 🎓 Training Details

	### Fine-tuning Configuration

	This model was fine-tuned from [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) using NVIDIA NeMo toolkit with the following parameters:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Training Data \| [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) \|
	\| Training Samples \| 47,596 (90% of dataset) \|
	\| Validation Samples \| 5,289 (10% of dataset) \|
	\| Total Dataset Size \| 52,885 samples \|
	\| Total Audio Duration \| 114.59 hours \|
	\| Average Sample Duration \| 7.80 seconds \|
	\| Max Epochs \| 7 \|
	\| Learning Rate \| 1e-4 \|
	\| Weight Decay \| 0.001 \|
	\| Batch Size \| 64 (per device) \|
	\| Gradient Accumulation \| 1 \|
	\| Precision \| BF16 mixed \|
	\| Optimizer \| AdamW \|
	\| Scheduler \| Cosine annealing with warmup \|
	\| Warmup Ratio \| 0.07 \|
	\| Gradient Clipping \| 1.0 \|
	\| Validation Check Interval \| Every 50 steps \|
	\| Logging Frequency \| Every 10 steps \|
	\| Audio Duration Range \| 0.1 - 20.0 seconds \|
	\| Data Loader Workers \| 8 \|
	\| Pin Memory \| True \|

	### Training Dataset: NeurologyAI/neuro-whisper-v1

	The [neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) dataset is a comprehensive German medical speech recognition dataset specifically designed for neurology and neuro-oncology terminology.

	Dataset Characteristics:
	- Language: German (de)
	- Domain: Neuro-oncology, neurology, medical terminology
	- Total Samples: 52,885 audio-text pairs
	- Total Duration: 114.59 hours of audio
	- Audio Format: 16 kHz mono WAV
	- Split: 90% training (47,596 samples) / 10% validation (5,289 samples)

	Data Generation:
	- Voice Data: Synthetically generated using [Resemble AI Chatterbox TTS](https://github.com/resemble-ai/chatterbox)
	- Text Data: Medical text generated with [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
	- Purpose: ASR training for medical transcription in neurology/neuro-oncology domains

	Dataset Features:
	- Consistent audio quality (uniform TTS generation)
	- Comprehensive coverage of specialized medical terminology
	- Privacy-safe (no real patient data)
	- Optimized for German medical speech recognition

	### Base Model Training

	The base model ([nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)) was trained on:

	- Primary Dataset: Granary multilingual corpus (660,000+ hours of pseudo-labeled data)
	- Fine-tuning Dataset: NeMo ASR Set 3.0 (~7,500 hours of high-quality, human-transcribed data)
	- Includes: LibriSpeech, Fisher Corpus, VCTK, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice, and more
	- Training: 150,000 steps on 128 A100 GPUs, then 5,000 steps on 4 A100 GPUs
	- Tokenizer: Unified SentencePiece BPE with 8,192 tokens, optimized across 25 languages

	### Training Infrastructure

	- Framework: NVIDIA NeMo 2.4
	- Training Hardware: NVIDIA A100 GPUs (Google Colab)
	- Inference Hardware: Apple Silicon via MLX framework (tested on M4 only)
	- Evaluation Hardware: Mac Mini M4 with 32 GB unified memory (VRAM) - only M4 was tested; results on M1/M2/M3 may vary
	- Training Script: Based on NeMo ASR training examples with TDT configuration
	- Checkpointing: Best model selected by validation WER (lowest WER), with top-2 checkpoints saved
	- CUDA Graphs: Disabled for compatibility (use_cuda_graph_decoder = False)
	- Decoding Strategy: Greedy batch decoding

	## 🌍 CO2 Emission Related to Experiments

	Fine-tuning was conducted using Google Cloud Platform in region `europe-west3-a`, which has a carbon efficiency of 0.61 kgCO₂eq/kWh. A cumulative of 5 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).

	Total emissions are estimated to be 0.76 kgCO₂eq of which 100 percents were directly offset by the cloud provider.

	Estimations were conducted using the [MachineLearning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	Reference:
	```bibtex
	@article{lacoste2019quantifying,
	title={Quantifying the Carbon Emissions of Machine Learning},
	author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
	journal={arXiv preprint arXiv:1910.09700},
	year={2019}
	}
	```

	## 🏥 Medical Terminology Coverage

	This model is specifically optimized for German medical terminology, with enhanced accuracy for:

	- Neuro-oncology: Gliomas, glioblastomas, astrocytomas, oligodendrogliomas, meningiomas
	- Molecular Markers: IDH mutations, MGMT methylation, 1p/19q codeletion, EGFR amplification
	- Treatment Modalities: Radiation therapy, chemotherapy, immunotherapy, targeted therapy
	- Diagnostic Imaging: MRI, CT, PET scans, contrast enhancement, diffusion-weighted imaging
	- Clinical Assessments: Neurological examinations, cognitive assessments, motor function tests
	- Medical Procedures: Biopsies, resections, craniotomies, ventriculostomies

	## 📊 Performance Characteristics

	### Evaluation Results

	The model was evaluated on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Word Error Rate (WER) \| 1.04% (0.0104) \|
	\| Real-Time Factor (RTF) \| 0.042 (~24x faster than real-time) \|
	\| Evaluation Dataset \| NeurologyAI/neuro-whisper-v1 (validation) \|
	\| Evaluation Samples \| 5,289 samples \|
	\| Total Audio Duration \| 22,786.68 seconds (~6.3 hours) \|
	\| Average Inference Time \| 0.18 seconds per sample \|
	\| Samples per Second \| 5.46 samples/second \|
	\| Evaluation Hardware \| Mac Mini M4 with 32 GB unified memory (VRAM) \|

	Hardware Details:
	- Device: Mac Mini M4 (only M4 was tested)
	- Memory: 32 GB unified memory (VRAM)
	- Framework: MLX (Apple Silicon optimized)
	- Note: Performance metrics are specific to M4. Results on M1, M2, and M3 may vary.

	This demonstrates excellent performance on German medical/neurological speech recognition, with a WER below 1.1% on the validation set. The real-time factor of 0.042 means the model processes audio approximately 24x faster than real-time on M4 hardware, making it highly efficient for batch processing and real-time applications.

	Note: All inference and validation were conducted exclusively on a Mac Mini M4 with 32 GB unified memory. Performance on M1, M2, and M3 chips may vary and has not been tested.

	Evaluation Tool: These results were obtained using the [🎤🏥🎯 Medical ASR Evaluator](https://github.com/riedemannai/Medical_ASR_Evaluator) - a standalone tool for evaluating ASR models using Word Error Rate (WER) metric, optimized for medical/clinical speech recognition.

	### Model Comparison

	The following table compares WER results on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (5,289 samples, ~6.3 hours of audio):

	\| Model \| WER \| Notes \|
	\|-------\|-----\|-------\|
	\| NeurologyAI/neuro-parakeet-mlx (this model) \| 1.04% \| Fine-tuned Parakeet TDT 0.6B for German medical \|
	\| mlx-community/parakeet-tdt-0.6b-v3 \| 18.31% \| Base Parakeet TDT 0.6B (no fine-tuning) \|
	\| mlx-community/whisper-large-v3-mlx \| 13.96% \| Base Whisper Large v3 (no fine-tuning) \|

	### Performance by Domain

	- ✅ Best Performance: German medical/neurological speech, clinical dictations
	- ✅ Strong Performance: Medical reports, patient documentation, case presentations
	- ✅ Good Performance: General German speech with medical context
	- ⚠️ Limited Performance: Non-neurology domains, other languages (base model supports 25 languages, but this fine-tuned version is optimized for German), heavily accented speech

	## ⚠️ Limitations

	- Hardware: This model was tested and benchmarked exclusively on Apple Silicon M4. While it should work on M1/M2/M3 chips, performance metrics (RTF, speed) may vary and have not been validated on these earlier chips.
	- Language: While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech. Other languages may have reduced accuracy compared to the base model.
	- Domain: Best results on German medical/neurological content; general speech may be less accurate
	- Audio Quality: Requires clear audio (16 kHz, mono) for optimal performance
	- Length: Very long audio files may have reduced accuracy
	- Accents: Performance may vary with regional accents or non-standard pronunciation
	- Background Noise: Best results with minimal background noise and clear speech

	## 📄 License

	GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license, inherited from the base model [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3).

	This model is ready for commercial and non-commercial use under the terms of the CC-BY-4.0 license.

	## 📚 Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@misc{neuro-parakeet-mlx,
	title={Neuro-Parakeet-MLX: German Medical Speech Recognition for Neurology and Neuro-oncology Optimized for Apple Silicon},
	author={Riedemann, Lars},
	year={2025},
	publisher={Hugging Face},
	note={Based on nvidia/parakeet-tdt-0.6b-v3},
	howpublished={\\url{https://huggingface.co/NeurologyAI/neuro-parakeet-mlx}}
	}
	```

	Base Model Citation:
	```bibtex
	@misc{sekoyan2025canary1bv2parakeettdt06bv3efficient,
	title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST},
	author={Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg},
	year={2025},
	eprint={2509.14128},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2509.14128},
	}
	```

	Training Dataset Citation:
	```bibtex
	@dataset{neuro-whisper-v1,
	title={neuro-whisper-v1: German Medical Speech Recognition Dataset},
	author={Riedemann, Lars},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1},
	language={de}
	}
	```

	## 🙏 Acknowledgments

	This model builds upon the excellent work of:

	- NVIDIA: For the [Parakeet TDT 0.6B v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) base model (CC-BY-4.0) - a 600-million-parameter multilingual ASR model with FastConformer-TDT architecture
	- Granary Dataset: For providing the multilingual training corpus used in base model training
	- NeMo ASR Set 3.0: For the high-quality, human-transcribed training data (10,000+ hours) used in base model training
	- Fine-tuning Dataset: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (MIT License) - 52,885 German medical audio samples (114.59 hours) for neurology/neuro-oncology
	- Qwen Team: For [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) used for medical text generation in the training dataset
	- ResembleAI: For [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) (MIT License) used for synthetic audio generation in the training dataset
	- Apple MLX: For the [MLX framework](https://github.com/ml-explore/mlx) enabling efficient Apple Silicon optimization
	- NVIDIA NeMo: For the [NeMo toolkit](https://github.com/NVIDIA/NeMo) version 2.4 used in model development and training
	- Training Infrastructure: Google Colab with A100 GPUs for training
	- Training Notebook: See [Parakeet_Training_MLX.ipynb](colab_notebooks/Parakeet_Training_MLX.ipynb) for the complete training pipeline

	## 🔗 Related Resources

	- API Server: [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) - OpenAI-compatible FastAPI server for this model
	- Base Model: [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
	- Training Dataset: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) - German medical ASR dataset with 52,885 samples (114.59 hours, 90/10 train/val split) for neurology/neuro-oncology
	- Data Generation: [Resemble AI Chatterbox](https://github.com/resemble-ai/chatterbox) and [Qwen3](https://github.com/QwenLM/Qwen3)
	- MLX Framework: [Apple MLX](https://github.com/ml-explore/mlx)
	- NeMo Toolkit: [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
	- Parakeet Collection: [Parakeet Models on Hugging Face](https://huggingface.co/collections/nvidia/parakeet)