🧠🦜 neuro-parakeet-mlx

Fine-tuned Parakeet TDT 0.6B model optimized for German medical/neurological speech recognition, converted to MLX format for Apple Silicon.

A specialized automatic speech recognition (ASR) model based on the multilingual Parakeet TDT 0.6B base model, fine-tuned specifically for German medical terminology, particularly neurology and neuro-oncology. While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech recognition. This model is optimized for Apple Silicon devices using the MLX framework, providing fast and efficient inference on Mac hardware.

πŸ“‹ Model Details

Property Value
Base Model nvidia/parakeet-tdt-0.6b-v3
Architecture FastConformer-TDT (RNNT)
Parameters 600 million (base model)
Model Size 2.34 GB
Base Model Languages Multilingual (25 languages)
Fine-tuned Language German (de) - medical domain
Domain Medical/Neurological (German)
Framework MLX (Apple Silicon optimized)
Tokenizer SentencePiece BPE (8,192 tokens, multilingual)
Fine-tuned on NeurologyAI/neuro-whisper-v1
License CC-BY-4.0

✨ Key Features

  • 🌍 Multilingual Base: Built on a multilingual model (25 languages) fine-tuned for German medical speech
  • πŸš€ Apple Silicon Optimized: Native MLX framework for fast inference on Mac (tested on M4; M1/M2/M3 results may vary)
  • ⚑ Ultra-Fast Inference: Real-time factor of 0.042 (~24x faster than real-time) on M4
  • πŸ₯ Medical Domain Specialized: Fine-tuned on 52,885 German medical audio samples (114.59 hours)
  • 🧠 Neurology Focus: Optimized for neurology and neuro-oncology terminology
  • πŸ”Œ OpenAI-Compatible API: Drop-in replacement for OpenAI Whisper API endpoints
  • πŸ“ Automatic Formatting: Built-in punctuation and capitalization
  • ⏱️ Timestamps: Word-level and segment-level timing information
  • πŸŽ™οΈ Long Audio Support: Handles extended recordings (up to 24 minutes with full attention)
  • 🌐 Multiple Formats: Supports WAV, MP3, FLAC, OGG, M4A, AAC, and more

πŸš€ Quick Start

Installation

pip install parakeet-mlx

System Requirements

  • Hardware: Apple Silicon (M1/M2/M3/M4) supported; tested and benchmarked on M4 only - performance on M1/M2/M3 may vary
  • RAM: At least 4GB available (8GB+ recommended)
  • Python: 3.8 or higher
  • macOS: 12.0 or later

Basic Usage

Using parakeet-mlx CLI

# Install
pip install -U parakeet-mlx

# Transcribe audio file
parakeet-mlx audio.wav --model NeurologyAI/neuro-parakeet-mlx

Using mlx-audio

# Install
pip install -U mlx-audio

# Transcribe audio file
python -m mlx_audio.stt.generate --model NeurologyAI/neuro-parakeet-mlx --audio audio.wav

Using Python API

from parakeet_mlx import from_pretrained

# Load model (first run will download ~2.34 GB)
model = from_pretrained("NeurologyAI/neuro-parakeet-mlx")

# Transcribe audio file
result = model.transcribe("path/to/audio.wav", language="de")
print(result.text)

# Access timestamps if needed
if hasattr(result, 'segments'):
    for segment in result.segments:
        print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")

🌐 OpenAI-Compatible API Server

This model can be served as a drop-in replacement for OpenAI's Whisper API, compatible with tools like open-webui and other OpenAI API clients.

Starting the Server

You can use the parakeet-mlx-server repository for a ready-to-use OpenAI-compatible API server:

# Clone and start the server
git clone git@github.com:riedemannai/parakeet-mlx-server.git
cd parakeet-mlx-server
pip install -r requirements.txt
./start_server.sh --model NeurologyAI/neuro-parakeet-mlx

Or run directly:

python parakeet_server.py --model NeurologyAI/neuro-parakeet-mlx --port 8002

API Usage Examples

Using cURL

curl -X POST http://localhost:8002/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.wav" \
  -F "model=parakeet-tdt-0.6b-v3" \
  -F "language=de"

Using Python

import requests

url = "http://localhost:8002/v1/audio/transcriptions"
files = {"file": open("audio.wav", "rb")}
data = {
    "model": "parakeet-tdt-0.6b-v3",
    "language": "de",
    "response_format": "json"  # or "text" for plain text
}

response = requests.post(url, files=files, data=data)
result = response.json()

print(result["text"])
# Access segments with timestamps if available
if "segments" in result:
    for seg in result["segments"]:
        print(f"{seg.get('start', 0):.2f}s: {seg.get('text', '')}")

Response Format

The API returns JSON with the following structure:

{
  "text": "Transcribed text here...",
  "recording_timestamp": "optional timestamp",
  "segments": [
    {
      "text": "Segment text",
      "start": 0.0,
      "end": 5.2
    }
  ]
}

πŸ“ Model Files

  • model.safetensors: Model weights (2.34 GB)
  • config.json: Model configuration (MLX-compatible)
  • model_config.yaml: Original NeMo training configuration (reference)

🎡 Supported Audio Formats

The model supports various audio formats via librosa. Audio is automatically converted to the required format:

Supported Formats:

  • WAV, MP3, FLAC, OGG, M4A, AAC, AIFF, AU, WMA

Input Requirements:

  • Sample Rate: Automatically resampled to 16 kHz
  • Channels: Automatically converted to mono
  • Format: Any of the supported formats above

Recommended:

  • Clear audio with minimal background noise
  • 16 kHz sample rate (or higher, will be downsampled)
  • Mono channel audio

πŸŽ“ Training Details

Fine-tuning Configuration

This model was fine-tuned from nvidia/parakeet-tdt-0.6b-v3 using NVIDIA NeMo toolkit with the following parameters:

Parameter Value
Training Data NeurologyAI/neuro-whisper-v1
Training Samples 47,596 (90% of dataset)
Validation Samples 5,289 (10% of dataset)
Total Dataset Size 52,885 samples
Total Audio Duration 114.59 hours
Average Sample Duration 7.80 seconds
Max Epochs 7
Learning Rate 1e-4
Weight Decay 0.001
Batch Size 64 (per device)
Gradient Accumulation 1
Precision BF16 mixed
Optimizer AdamW
Scheduler Cosine annealing with warmup
Warmup Ratio 0.07
Gradient Clipping 1.0
Validation Check Interval Every 50 steps
Logging Frequency Every 10 steps
Audio Duration Range 0.1 - 20.0 seconds
Data Loader Workers 8
Pin Memory True

Training Dataset: NeurologyAI/neuro-whisper-v1

The neuro-whisper-v1 dataset is a comprehensive German medical speech recognition dataset specifically designed for neurology and neuro-oncology terminology.

Dataset Characteristics:

  • Language: German (de)
  • Domain: Neuro-oncology, neurology, medical terminology
  • Total Samples: 52,885 audio-text pairs
  • Total Duration: 114.59 hours of audio
  • Audio Format: 16 kHz mono WAV
  • Split: 90% training (47,596 samples) / 10% validation (5,289 samples)

Data Generation:

Dataset Features:

  • Consistent audio quality (uniform TTS generation)
  • Comprehensive coverage of specialized medical terminology
  • Privacy-safe (no real patient data)
  • Optimized for German medical speech recognition

Base Model Training

The base model (nvidia/parakeet-tdt-0.6b-v3) was trained on:

  • Primary Dataset: Granary multilingual corpus (660,000+ hours of pseudo-labeled data)
  • Fine-tuning Dataset: NeMo ASR Set 3.0 (~7,500 hours of high-quality, human-transcribed data)
    • Includes: LibriSpeech, Fisher Corpus, VCTK, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice, and more
  • Training: 150,000 steps on 128 A100 GPUs, then 5,000 steps on 4 A100 GPUs
  • Tokenizer: Unified SentencePiece BPE with 8,192 tokens, optimized across 25 languages

Training Infrastructure

  • Framework: NVIDIA NeMo 2.4
  • Training Hardware: NVIDIA A100 GPUs (Google Colab)
  • Inference Hardware: Apple Silicon via MLX framework (tested on M4 only)
  • Evaluation Hardware: Mac Mini M4 with 32 GB unified memory (VRAM) - only M4 was tested; results on M1/M2/M3 may vary
  • Training Script: Based on NeMo ASR training examples with TDT configuration
  • Checkpointing: Best model selected by validation WER (lowest WER), with top-2 checkpoints saved
  • CUDA Graphs: Disabled for compatibility (use_cuda_graph_decoder = False)
  • Decoding Strategy: Greedy batch decoding

πŸ₯ Medical Terminology Coverage

This model is specifically optimized for German medical terminology, with enhanced accuracy for:

  • Neuro-oncology: Gliomas, glioblastomas, astrocytomas, oligodendrogliomas, meningiomas
  • Molecular Markers: IDH mutations, MGMT methylation, 1p/19q codeletion, EGFR amplification
  • Treatment Modalities: Radiation therapy, chemotherapy, immunotherapy, targeted therapy
  • Diagnostic Imaging: MRI, CT, PET scans, contrast enhancement, diffusion-weighted imaging
  • Clinical Assessments: Neurological examinations, cognitive assessments, motor function tests
  • Medical Procedures: Biopsies, resections, craniotomies, ventriculostomies

πŸ“Š Performance Characteristics

Evaluation Results

The model was evaluated on the validation split of NeurologyAI/neuro-whisper-v1:

Metric Value
Word Error Rate (WER) 1.04% (0.0104)
Real-Time Factor (RTF) 0.042 (~24x faster than real-time)
Evaluation Dataset NeurologyAI/neuro-whisper-v1 (validation)
Evaluation Samples 5,289 samples
Total Audio Duration 22,786.68 seconds (~6.3 hours)
Average Inference Time 0.18 seconds per sample
Samples per Second 5.46 samples/second
Evaluation Hardware Mac Mini M4 with 32 GB unified memory (VRAM)

Hardware Details:

  • Device: Mac Mini M4 (only M4 was tested)
  • Memory: 32 GB unified memory (VRAM)
  • Framework: MLX (Apple Silicon optimized)
  • Note: Performance metrics are specific to M4. Results on M1, M2, and M3 may vary.

This demonstrates excellent performance on German medical/neurological speech recognition, with a WER below 1.1% on the validation set. The real-time factor of 0.042 means the model processes audio approximately 24x faster than real-time on M4 hardware, making it highly efficient for batch processing and real-time applications.

Note: All inference and validation were conducted exclusively on a Mac Mini M4 with 32 GB unified memory. Performance on M1, M2, and M3 chips may vary and has not been tested.

Model Comparison

The following table compares WER results on the validation split of NeurologyAI/neuro-whisper-v1 (5,289 samples, ~6.3 hours of audio):

Model WER Notes
NeurologyAI/neuro-parakeet-mlx (this model) 1.04% Fine-tuned Parakeet TDT 0.6B for German medical
mlx-community/parakeet-tdt-0.6b-v3 18.31% Base Parakeet TDT 0.6B (no fine-tuning)
mlx-community/whisper-large-v3-mlx 13.96% Base Whisper Large v3 (no fine-tuning)

Performance by Domain

  • βœ… Best Performance: German medical/neurological speech, clinical dictations
  • βœ… Strong Performance: Medical reports, patient documentation, case presentations
  • βœ… Good Performance: General German speech with medical context
  • ⚠️ Limited Performance: Non-neurology domains, other languages (base model supports 25 languages, but this fine-tuned version is optimized for German), heavily accented speech

⚠️ Limitations

  • Hardware: This model was tested and benchmarked exclusively on Apple Silicon M4. While it should work on M1/M2/M3 chips, performance metrics (RTF, speed) may vary and have not been validated on these earlier chips.
  • Language: While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech. Other languages may have reduced accuracy compared to the base model.
  • Domain: Best results on German medical/neurological content; general speech may be less accurate
  • Audio Quality: Requires clear audio (16 kHz, mono) for optimal performance
  • Length: Very long audio files may have reduced accuracy
  • Accents: Performance may vary with regional accents or non-standard pronunciation
  • Background Noise: Best results with minimal background noise and clear speech

πŸ“„ License

GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license, inherited from the base model nvidia/parakeet-tdt-0.6b-v3.

This model is ready for commercial and non-commercial use under the terms of the CC-BY-4.0 license.

πŸ“š Citation

If you use this model in your research or applications, please cite:

@misc{neuro-parakeet-mlx,
  title={Neuro-Parakeet-MLX: German Medical Speech Recognition for Neurology and Neuro-oncology Optimized for Apple Silicon},
  author={Riedemann, Lars},
  year={2025},
  publisher={Hugging Face},
  note={Based on nvidia/parakeet-tdt-0.6b-v3},
  howpublished={\\url{https://huggingface.co/NeurologyAI/neuro-parakeet-mlx}}
}

Base Model Citation:

@misc{sekoyan2025canary1bv2parakeettdt06bv3efficient,
      title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST}, 
      author={Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg},
      year={2025},
      eprint={2509.14128},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.14128}, 
}

Training Dataset Citation:

@dataset{neuro-whisper-v1,
  title={neuro-whisper-v1: German Medical Speech Recognition Dataset},
  author={Riedemann, Lars},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1},
  language={de}
}

πŸ™ Acknowledgments

This model builds upon the excellent work of:

  • NVIDIA: For the Parakeet TDT 0.6B v3 base model (CC-BY-4.0) - a 600-million-parameter multilingual ASR model with FastConformer-TDT architecture
  • Granary Dataset: For providing the multilingual training corpus used in base model training
  • NeMo ASR Set 3.0: For the high-quality, human-transcribed training data (10,000+ hours) used in base model training
  • Fine-tuning Dataset: NeurologyAI/neuro-whisper-v1 (MIT License) - 52,885 German medical audio samples (114.59 hours) for neurology/neuro-oncology
  • Qwen Team: For Qwen/Qwen3-30B-A3B-Instruct-2507 used for medical text generation in the training dataset
  • ResembleAI: For Chatterbox TTS (MIT License) used for synthetic audio generation in the training dataset
  • Apple MLX: For the MLX framework enabling efficient Apple Silicon optimization
  • NVIDIA NeMo: For the NeMo toolkit version 2.4 used in model development and training
  • Training Infrastructure: Google Colab with A100 GPUs for training
  • Training Notebook: See Parakeet_Training_MLX.ipynb for the complete training pipeline

πŸ”— Related Resources

Downloads last month
134
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for NeurologyAI/neuro-parakeet-mlx

Finetuned
(6)
this model

Dataset used to train NeurologyAI/neuro-parakeet-mlx

Evaluation results