---
license: cc-by-4.0
base_model: nvidia/parakeet-tdt-0.6b-v3
tags:
- automatic-speech-recognition
- asr
- parakeet
- mlx
- german
- medical
- neurology
- apple-silicon
- multilingual
datasets:
- NeurologyAI/neuro-whisper-v1
language:
- de
- multilingual
metrics:
- wer
- cer
model-index:
- name: neuro-parakeet-mlx
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: NeurologyAI/neuro-whisper-v1
      type: NeurologyAI/neuro-whisper-v1
    metrics:
    - name: WER
      type: wer
      value: 0.0104
---

# 🧠🦜 neuro-parakeet-mlx

> Fine-tuned Parakeet TDT 0.6B model optimized for German medical/neurological speech recognition, converted to MLX format for Apple Silicon.

A specialized automatic speech recognition (ASR) model based on the multilingual Parakeet TDT 0.6B base model, fine-tuned specifically for German medical terminology, particularly neurology and neuro-oncology. While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech recognition. This model is optimized for Apple Silicon devices using the MLX framework, providing fast and efficient inference on Mac hardware.

## 📋 Model Details

| Property | Value |
|----------|-------|
| **Base Model** | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
| **Architecture** | FastConformer-TDT (RNNT) |
| **Parameters** | 600 million (base model) |
| **Model Size** | 2.34 GB |
| **Base Model Languages** | Multilingual (25 languages) |
| **Fine-tuned Language** | German (de) - medical domain |
| **Domain** | Medical/Neurological (German) |
| **Framework** | MLX (Apple Silicon optimized) |
| **Tokenizer** | SentencePiece BPE (8,192 tokens, multilingual) |
| **Fine-tuned on** | [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) |
| **License** | [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) |

## ✨ Key Features

- 🌍 **Multilingual Base**: Built on a multilingual model (25 languages) fine-tuned for German medical speech
- 🚀 **Apple Silicon Optimized**: Native MLX framework for fast inference on Mac (tested on M4; M1/M2/M3 results may vary)
- ⚡ **Ultra-Fast Inference**: Real-time factor of 0.042 (~24x faster than real-time) on M4
- 🏥 **Medical Domain Specialized**: Fine-tuned on 52,885 German medical audio samples (114.59 hours)
- 🧠 **Neurology Focus**: Optimized for neurology and neuro-oncology terminology
- 🔌 **OpenAI-Compatible API**: Drop-in replacement for OpenAI Whisper API endpoints
- 📝 **Automatic Formatting**: Built-in punctuation and capitalization
- ⏱️ **Timestamps**: Word-level and segment-level timing information
- 🎙️ **Long Audio Support**: Handles extended recordings (up to 24 minutes with full attention)
- 🌐 **Multiple Formats**: Supports WAV, MP3, FLAC, OGG, M4A, AAC, and more

## 🚀 Quick Start

### Installation

```bash
pip install parakeet-mlx
```

### System Requirements

- **Hardware**: Apple Silicon (M1/M2/M3/M4) supported; **tested and benchmarked on M4 only** - performance on M1/M2/M3 may vary
- **RAM**: At least 4GB available (8GB+ recommended)
- **Python**: 3.8 or higher
- **macOS**: 12.0 or later

### Basic Usage

#### Using parakeet-mlx CLI

```bash
# Install
pip install -U parakeet-mlx

# Transcribe audio file
parakeet-mlx audio.wav --model NeurologyAI/neuro-parakeet-mlx
```

#### Using mlx-audio

```bash
# Install
pip install -U mlx-audio

# Transcribe audio file
python -m mlx_audio.stt.generate --model NeurologyAI/neuro-parakeet-mlx --audio audio.wav
```

#### Using Python API

```python
from parakeet_mlx import from_pretrained

# Load model (first run will download ~2.34 GB)
model = from_pretrained("NeurologyAI/neuro-parakeet-mlx")

# Transcribe audio file
result = model.transcribe("path/to/audio.wav", language="de")
print(result.text)

# Access timestamps if needed
if hasattr(result, 'segments'):
    for segment in result.segments:
        print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}")
```

## 🌐 OpenAI-Compatible API Server

This model can be served as a drop-in replacement for OpenAI's Whisper API, compatible with tools like [open-webui](https://github.com/open-webui/open-webui) and other OpenAI API clients.

### Starting the Server

You can use the [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) repository for a ready-to-use OpenAI-compatible API server:

```bash
# Clone and start the server
git clone git@github.com:riedemannai/parakeet-mlx-server.git
cd parakeet-mlx-server
pip install -r requirements.txt
./start_server.sh --model NeurologyAI/neuro-parakeet-mlx
```

Or run directly:

```bash
python parakeet_server.py --model NeurologyAI/neuro-parakeet-mlx --port 8002
```

### API Usage Examples

#### Using cURL

```bash
curl -X POST http://localhost:8002/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F "file=@audio.wav" \
  -F "model=parakeet-tdt-0.6b-v3" \
  -F "language=de"
```

#### Using Python

```python
import requests

url = "http://localhost:8002/v1/audio/transcriptions"
files = {"file": open("audio.wav", "rb")}
data = {
    "model": "parakeet-tdt-0.6b-v3",
    "language": "de",
    "response_format": "json"  # or "text" for plain text
}

response = requests.post(url, files=files, data=data)
result = response.json()

print(result["text"])
# Access segments with timestamps if available
if "segments" in result:
    for seg in result["segments"]:
        print(f"{seg.get('start', 0):.2f}s: {seg.get('text', '')}")
```

#### Response Format

The API returns JSON with the following structure:

```json
{
  "text": "Transcribed text here...",
  "recording_timestamp": "optional timestamp",
  "segments": [
    {
      "text": "Segment text",
      "start": 0.0,
      "end": 5.2
    }
  ]
}
```

## 📁 Model Files

- `model.safetensors`: Model weights (2.34 GB)
- `config.json`: Model configuration (MLX-compatible)
- `model_config.yaml`: Original NeMo training configuration (reference)

## 🎵 Supported Audio Formats

The model supports various audio formats via librosa. Audio is automatically converted to the required format:

**Supported Formats:**
- WAV, MP3, FLAC, OGG, M4A, AAC, AIFF, AU, WMA

**Input Requirements:**
- **Sample Rate**: Automatically resampled to 16 kHz
- **Channels**: Automatically converted to mono
- **Format**: Any of the supported formats above

**Recommended:**
- Clear audio with minimal background noise
- 16 kHz sample rate (or higher, will be downsampled)
- Mono channel audio

## 🎓 Training Details

### Fine-tuning Configuration

This model was fine-tuned from [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) using NVIDIA NeMo toolkit with the following parameters:

| Parameter | Value |
|-----------|-------|
| **Training Data** | [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) |
| **Training Samples** | 47,596 (90% of dataset) |
| **Validation Samples** | 5,289 (10% of dataset) |
| **Total Dataset Size** | 52,885 samples |
| **Total Audio Duration** | 114.59 hours |
| **Average Sample Duration** | 7.80 seconds |
| **Max Epochs** | 7 |
| **Learning Rate** | 1e-4 |
| **Weight Decay** | 0.001 |
| **Batch Size** | 64 (per device) |
| **Gradient Accumulation** | 1 |
| **Precision** | BF16 mixed |
| **Optimizer** | AdamW |
| **Scheduler** | Cosine annealing with warmup |
| **Warmup Ratio** | 0.07 |
| **Gradient Clipping** | 1.0 |
| **Validation Check Interval** | Every 50 steps |
| **Logging Frequency** | Every 10 steps |
| **Audio Duration Range** | 0.1 - 20.0 seconds |
| **Data Loader Workers** | 8 |
| **Pin Memory** | True |

### Training Dataset: NeurologyAI/neuro-whisper-v1

The [neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) dataset is a comprehensive German medical speech recognition dataset specifically designed for neurology and neuro-oncology terminology.

**Dataset Characteristics:**
- **Language**: German (de)
- **Domain**: Neuro-oncology, neurology, medical terminology
- **Total Samples**: 52,885 audio-text pairs
- **Total Duration**: 114.59 hours of audio
- **Audio Format**: 16 kHz mono WAV
- **Split**: 90% training (47,596 samples) / 10% validation (5,289 samples)

**Data Generation:**
- **Voice Data**: Synthetically generated using [Resemble AI Chatterbox TTS](https://github.com/resemble-ai/chatterbox)
- **Text Data**: Medical text generated with [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
- **Purpose**: ASR training for medical transcription in neurology/neuro-oncology domains

**Dataset Features:**
- Consistent audio quality (uniform TTS generation)
- Comprehensive coverage of specialized medical terminology
- Privacy-safe (no real patient data)
- Optimized for German medical speech recognition

### Base Model Training

The base model ([nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)) was trained on:

- **Primary Dataset**: Granary multilingual corpus (660,000+ hours of pseudo-labeled data)
- **Fine-tuning Dataset**: NeMo ASR Set 3.0 (~7,500 hours of high-quality, human-transcribed data)
  - Includes: LibriSpeech, Fisher Corpus, VCTK, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice, and more
- **Training**: 150,000 steps on 128 A100 GPUs, then 5,000 steps on 4 A100 GPUs
- **Tokenizer**: Unified SentencePiece BPE with 8,192 tokens, optimized across 25 languages

### Training Infrastructure

- **Framework**: NVIDIA NeMo 2.4
- **Training Hardware**: NVIDIA A100 GPUs (Google Colab)
- **Inference Hardware**: Apple Silicon via MLX framework (tested on M4 only)
- **Evaluation Hardware**: Mac Mini M4 with 32 GB unified memory (VRAM) - **only M4 was tested; results on M1/M2/M3 may vary**
- **Training Script**: Based on NeMo ASR training examples with TDT configuration
- **Checkpointing**: Best model selected by validation WER (lowest WER), with top-2 checkpoints saved
- **CUDA Graphs**: Disabled for compatibility (use_cuda_graph_decoder = False)
- **Decoding Strategy**: Greedy batch decoding

## 🌍 CO2 Emission Related to Experiments

Fine-tuning was conducted using Google Cloud Platform in region `europe-west3-a`, which has a carbon efficiency of 0.61 kgCO₂eq/kWh. A cumulative of 5 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W).

Total emissions are estimated to be **0.76 kgCO₂eq** of which 100 percents were directly offset by the cloud provider.

Estimations were conducted using the [MachineLearning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

**Reference:**
```bibtex
@article{lacoste2019quantifying,
  title={Quantifying the Carbon Emissions of Machine Learning},
  author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas},
  journal={arXiv preprint arXiv:1910.09700},
  year={2019}
}
```

## 🏥 Medical Terminology Coverage

This model is specifically optimized for German medical terminology, with enhanced accuracy for:

- **Neuro-oncology**: Gliomas, glioblastomas, astrocytomas, oligodendrogliomas, meningiomas
- **Molecular Markers**: IDH mutations, MGMT methylation, 1p/19q codeletion, EGFR amplification
- **Treatment Modalities**: Radiation therapy, chemotherapy, immunotherapy, targeted therapy
- **Diagnostic Imaging**: MRI, CT, PET scans, contrast enhancement, diffusion-weighted imaging
- **Clinical Assessments**: Neurological examinations, cognitive assessments, motor function tests
- **Medical Procedures**: Biopsies, resections, craniotomies, ventriculostomies

## 📊 Performance Characteristics

### Evaluation Results

The model was evaluated on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1):

| Metric | Value |
|--------|-------|
| **Word Error Rate (WER)** | **1.04%** (0.0104) |
| **Real-Time Factor (RTF)** | **0.042** (~24x faster than real-time) |
| **Evaluation Dataset** | NeurologyAI/neuro-whisper-v1 (validation) |
| **Evaluation Samples** | 5,289 samples |
| **Total Audio Duration** | 22,786.68 seconds (~6.3 hours) |
| **Average Inference Time** | 0.18 seconds per sample |
| **Samples per Second** | 5.46 samples/second |
| **Evaluation Hardware** | Mac Mini M4 with 32 GB unified memory (VRAM) |

**Hardware Details:**
- **Device**: Mac Mini M4 (only M4 was tested)
- **Memory**: 32 GB unified memory (VRAM)
- **Framework**: MLX (Apple Silicon optimized)
- **Note**: Performance metrics are specific to M4. Results on M1, M2, and M3 may vary.

This demonstrates excellent performance on German medical/neurological speech recognition, with a WER below 1.1% on the validation set. The real-time factor of 0.042 means the model processes audio approximately **24x faster than real-time** on M4 hardware, making it highly efficient for batch processing and real-time applications. 

**Note:** All inference and validation were conducted exclusively on a Mac Mini M4 with 32 GB unified memory. Performance on M1, M2, and M3 chips may vary and has not been tested.

**Evaluation Tool:** These results were obtained using the [🎤🏥🎯 Medical ASR Evaluator](https://github.com/riedemannai/Medical_ASR_Evaluator) - a standalone tool for evaluating ASR models using Word Error Rate (WER) metric, optimized for medical/clinical speech recognition.

### Model Comparison

The following table compares WER results on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (5,289 samples, ~6.3 hours of audio):

| Model | WER | Notes |
|-------|-----|-------|
| **NeurologyAI/neuro-parakeet-mlx** (this model) | **1.04%** | Fine-tuned Parakeet TDT 0.6B for German medical |
| mlx-community/parakeet-tdt-0.6b-v3 | 18.31% | Base Parakeet TDT 0.6B (no fine-tuning) |
| mlx-community/whisper-large-v3-mlx | 13.96% | Base Whisper Large v3 (no fine-tuning) |

### Performance by Domain

- ✅ **Best Performance**: German medical/neurological speech, clinical dictations
- ✅ **Strong Performance**: Medical reports, patient documentation, case presentations
- ✅ **Good Performance**: General German speech with medical context
- ⚠️ **Limited Performance**: Non-neurology domains, other languages (base model supports 25 languages, but this fine-tuned version is optimized for German), heavily accented speech

## ⚠️ Limitations

- **Hardware**: This model was tested and benchmarked exclusively on Apple Silicon M4. While it should work on M1/M2/M3 chips, performance metrics (RTF, speed) may vary and have not been validated on these earlier chips.
- **Language**: While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech. Other languages may have reduced accuracy compared to the base model.
- **Domain**: Best results on German medical/neurological content; general speech may be less accurate
- **Audio Quality**: Requires clear audio (16 kHz, mono) for optimal performance
- **Length**: Very long audio files may have reduced accuracy
- **Accents**: Performance may vary with regional accents or non-standard pronunciation
- **Background Noise**: Best results with minimal background noise and clear speech

## 📄 License

**GOVERNING TERMS**: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license, inherited from the base model [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3).

This model is ready for **commercial and non-commercial use** under the terms of the CC-BY-4.0 license.

## 📚 Citation

If you use this model in your research or applications, please cite:

```bibtex
@misc{neuro-parakeet-mlx,
  title={Neuro-Parakeet-MLX: German Medical Speech Recognition for Neurology and Neuro-oncology Optimized for Apple Silicon},
  author={Riedemann, Lars},
  year={2025},
  publisher={Hugging Face},
  note={Based on nvidia/parakeet-tdt-0.6b-v3},
  howpublished={\\url{https://huggingface.co/NeurologyAI/neuro-parakeet-mlx}}
}
```

**Base Model Citation:**
```bibtex
@misc{sekoyan2025canary1bv2parakeettdt06bv3efficient,
      title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST}, 
      author={Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg},
      year={2025},
      eprint={2509.14128},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.14128}, 
}
```

**Training Dataset Citation:**
```bibtex
@dataset{neuro-whisper-v1,
  title={neuro-whisper-v1: German Medical Speech Recognition Dataset},
  author={Riedemann, Lars},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1},
  language={de}
}
```

## 🙏 Acknowledgments

This model builds upon the excellent work of:

- **NVIDIA**: For the [Parakeet TDT 0.6B v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) base model (CC-BY-4.0) - a 600-million-parameter multilingual ASR model with FastConformer-TDT architecture
- **Granary Dataset**: For providing the multilingual training corpus used in base model training
- **NeMo ASR Set 3.0**: For the high-quality, human-transcribed training data (10,000+ hours) used in base model training
- **Fine-tuning Dataset**: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (MIT License) - 52,885 German medical audio samples (114.59 hours) for neurology/neuro-oncology
- **Qwen Team**: For [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) used for medical text generation in the training dataset
- **ResembleAI**: For [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) (MIT License) used for synthetic audio generation in the training dataset
- **Apple MLX**: For the [MLX framework](https://github.com/ml-explore/mlx) enabling efficient Apple Silicon optimization
- **NVIDIA NeMo**: For the [NeMo toolkit](https://github.com/NVIDIA/NeMo) version 2.4 used in model development and training
- **Training Infrastructure**: Google Colab with A100 GPUs for training
- **Training Notebook**: See [Parakeet_Training_MLX.ipynb](colab_notebooks/Parakeet_Training_MLX.ipynb) for the complete training pipeline

## 🔗 Related Resources

- **API Server**: [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) - OpenAI-compatible FastAPI server for this model
- **Base Model**: [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)
- **Training Dataset**: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) - German medical ASR dataset with 52,885 samples (114.59 hours, 90/10 train/val split) for neurology/neuro-oncology
- **Data Generation**: [Resemble AI Chatterbox](https://github.com/resemble-ai/chatterbox) and [Qwen3](https://github.com/QwenLM/Qwen3)
- **MLX Framework**: [Apple MLX](https://github.com/ml-explore/mlx)
- **NeMo Toolkit**: [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
- **Parakeet Collection**: [Parakeet Models on Hugging Face](https://huggingface.co/collections/nvidia/parakeet)