|
|
--- |
|
|
license: cc-by-4.0 |
|
|
base_model: nvidia/parakeet-tdt-0.6b-v3 |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- asr |
|
|
- parakeet |
|
|
- mlx |
|
|
- german |
|
|
- medical |
|
|
- neurology |
|
|
- apple-silicon |
|
|
- multilingual |
|
|
datasets: |
|
|
- NeurologyAI/neuro-whisper-v1 |
|
|
language: |
|
|
- de |
|
|
- multilingual |
|
|
metrics: |
|
|
- wer |
|
|
- cer |
|
|
model-index: |
|
|
- name: neuro-parakeet-mlx |
|
|
results: |
|
|
- task: |
|
|
name: Automatic Speech Recognition |
|
|
type: automatic-speech-recognition |
|
|
dataset: |
|
|
name: NeurologyAI/neuro-whisper-v1 |
|
|
type: NeurologyAI/neuro-whisper-v1 |
|
|
metrics: |
|
|
- name: WER |
|
|
type: wer |
|
|
value: 0.0104 |
|
|
--- |
|
|
|
|
|
# π§ π¦ neuro-parakeet-mlx |
|
|
|
|
|
> Fine-tuned Parakeet TDT 0.6B model optimized for German medical/neurological speech recognition, converted to MLX format for Apple Silicon. |
|
|
|
|
|
A specialized automatic speech recognition (ASR) model based on the multilingual Parakeet TDT 0.6B base model, fine-tuned specifically for German medical terminology, particularly neurology and neuro-oncology. While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech recognition. This model is optimized for Apple Silicon devices using the MLX framework, providing fast and efficient inference on Mac hardware. |
|
|
|
|
|
## π Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Base Model** | [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) | |
|
|
| **Architecture** | FastConformer-TDT (RNNT) | |
|
|
| **Parameters** | 600 million (base model) | |
|
|
| **Model Size** | 2.34 GB | |
|
|
| **Base Model Languages** | Multilingual (25 languages) | |
|
|
| **Fine-tuned Language** | German (de) - medical domain | |
|
|
| **Domain** | Medical/Neurological (German) | |
|
|
| **Framework** | MLX (Apple Silicon optimized) | |
|
|
| **Tokenizer** | SentencePiece BPE (8,192 tokens, multilingual) | |
|
|
| **Fine-tuned on** | [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) | |
|
|
| **License** | [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) | |
|
|
|
|
|
## β¨ Key Features |
|
|
|
|
|
- π **Multilingual Base**: Built on a multilingual model (25 languages) fine-tuned for German medical speech |
|
|
- π **Apple Silicon Optimized**: Native MLX framework for fast inference on Mac (tested on M4; M1/M2/M3 results may vary) |
|
|
- β‘ **Ultra-Fast Inference**: Real-time factor of 0.042 (~24x faster than real-time) on M4 |
|
|
- π₯ **Medical Domain Specialized**: Fine-tuned on 52,885 German medical audio samples (114.59 hours) |
|
|
- π§ **Neurology Focus**: Optimized for neurology and neuro-oncology terminology |
|
|
- π **OpenAI-Compatible API**: Drop-in replacement for OpenAI Whisper API endpoints |
|
|
- π **Automatic Formatting**: Built-in punctuation and capitalization |
|
|
- β±οΈ **Timestamps**: Word-level and segment-level timing information |
|
|
- ποΈ **Long Audio Support**: Handles extended recordings (up to 24 minutes with full attention) |
|
|
- π **Multiple Formats**: Supports WAV, MP3, FLAC, OGG, M4A, AAC, and more |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install parakeet-mlx |
|
|
``` |
|
|
|
|
|
### System Requirements |
|
|
|
|
|
- **Hardware**: Apple Silicon (M1/M2/M3/M4) supported; **tested and benchmarked on M4 only** - performance on M1/M2/M3 may vary |
|
|
- **RAM**: At least 4GB available (8GB+ recommended) |
|
|
- **Python**: 3.8 or higher |
|
|
- **macOS**: 12.0 or later |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
#### Using parakeet-mlx CLI |
|
|
|
|
|
```bash |
|
|
# Install |
|
|
pip install -U parakeet-mlx |
|
|
|
|
|
# Transcribe audio file |
|
|
parakeet-mlx audio.wav --model NeurologyAI/neuro-parakeet-mlx |
|
|
``` |
|
|
|
|
|
#### Using mlx-audio |
|
|
|
|
|
```bash |
|
|
# Install |
|
|
pip install -U mlx-audio |
|
|
|
|
|
# Transcribe audio file |
|
|
python -m mlx_audio.stt.generate --model NeurologyAI/neuro-parakeet-mlx --audio audio.wav |
|
|
``` |
|
|
|
|
|
#### Using Python API |
|
|
|
|
|
```python |
|
|
from parakeet_mlx import from_pretrained |
|
|
|
|
|
# Load model (first run will download ~2.34 GB) |
|
|
model = from_pretrained("NeurologyAI/neuro-parakeet-mlx") |
|
|
|
|
|
# Transcribe audio file |
|
|
result = model.transcribe("path/to/audio.wav", language="de") |
|
|
print(result.text) |
|
|
|
|
|
# Access timestamps if needed |
|
|
if hasattr(result, 'segments'): |
|
|
for segment in result.segments: |
|
|
print(f"{segment.start:.2f}s - {segment.end:.2f}s: {segment.text}") |
|
|
``` |
|
|
|
|
|
## π OpenAI-Compatible API Server |
|
|
|
|
|
This model can be served as a drop-in replacement for OpenAI's Whisper API, compatible with tools like [open-webui](https://github.com/open-webui/open-webui) and other OpenAI API clients. |
|
|
|
|
|
### Starting the Server |
|
|
|
|
|
You can use the [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) repository for a ready-to-use OpenAI-compatible API server: |
|
|
|
|
|
```bash |
|
|
# Clone and start the server |
|
|
git clone git@github.com:riedemannai/parakeet-mlx-server.git |
|
|
cd parakeet-mlx-server |
|
|
pip install -r requirements.txt |
|
|
./start_server.sh --model NeurologyAI/neuro-parakeet-mlx |
|
|
``` |
|
|
|
|
|
Or run directly: |
|
|
|
|
|
```bash |
|
|
python parakeet_server.py --model NeurologyAI/neuro-parakeet-mlx --port 8002 |
|
|
``` |
|
|
|
|
|
### API Usage Examples |
|
|
|
|
|
#### Using cURL |
|
|
|
|
|
```bash |
|
|
curl -X POST http://localhost:8002/v1/audio/transcriptions \ |
|
|
-H "Content-Type: multipart/form-data" \ |
|
|
-F "file=@audio.wav" \ |
|
|
-F "model=parakeet-tdt-0.6b-v3" \ |
|
|
-F "language=de" |
|
|
``` |
|
|
|
|
|
#### Using Python |
|
|
|
|
|
```python |
|
|
import requests |
|
|
|
|
|
url = "http://localhost:8002/v1/audio/transcriptions" |
|
|
files = {"file": open("audio.wav", "rb")} |
|
|
data = { |
|
|
"model": "parakeet-tdt-0.6b-v3", |
|
|
"language": "de", |
|
|
"response_format": "json" # or "text" for plain text |
|
|
} |
|
|
|
|
|
response = requests.post(url, files=files, data=data) |
|
|
result = response.json() |
|
|
|
|
|
print(result["text"]) |
|
|
# Access segments with timestamps if available |
|
|
if "segments" in result: |
|
|
for seg in result["segments"]: |
|
|
print(f"{seg.get('start', 0):.2f}s: {seg.get('text', '')}") |
|
|
``` |
|
|
|
|
|
#### Response Format |
|
|
|
|
|
The API returns JSON with the following structure: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"text": "Transcribed text here...", |
|
|
"recording_timestamp": "optional timestamp", |
|
|
"segments": [ |
|
|
{ |
|
|
"text": "Segment text", |
|
|
"start": 0.0, |
|
|
"end": 5.2 |
|
|
} |
|
|
] |
|
|
} |
|
|
``` |
|
|
|
|
|
## π Model Files |
|
|
|
|
|
- `model.safetensors`: Model weights (2.34 GB) |
|
|
- `config.json`: Model configuration (MLX-compatible) |
|
|
- `model_config.yaml`: Original NeMo training configuration (reference) |
|
|
|
|
|
## π΅ Supported Audio Formats |
|
|
|
|
|
The model supports various audio formats via librosa. Audio is automatically converted to the required format: |
|
|
|
|
|
**Supported Formats:** |
|
|
- WAV, MP3, FLAC, OGG, M4A, AAC, AIFF, AU, WMA |
|
|
|
|
|
**Input Requirements:** |
|
|
- **Sample Rate**: Automatically resampled to 16 kHz |
|
|
- **Channels**: Automatically converted to mono |
|
|
- **Format**: Any of the supported formats above |
|
|
|
|
|
**Recommended:** |
|
|
- Clear audio with minimal background noise |
|
|
- 16 kHz sample rate (or higher, will be downsampled) |
|
|
- Mono channel audio |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
### Fine-tuning Configuration |
|
|
|
|
|
This model was fine-tuned from [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) using NVIDIA NeMo toolkit with the following parameters: |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| **Training Data** | [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) | |
|
|
| **Training Samples** | 47,596 (90% of dataset) | |
|
|
| **Validation Samples** | 5,289 (10% of dataset) | |
|
|
| **Total Dataset Size** | 52,885 samples | |
|
|
| **Total Audio Duration** | 114.59 hours | |
|
|
| **Average Sample Duration** | 7.80 seconds | |
|
|
| **Max Epochs** | 7 | |
|
|
| **Learning Rate** | 1e-4 | |
|
|
| **Weight Decay** | 0.001 | |
|
|
| **Batch Size** | 64 (per device) | |
|
|
| **Gradient Accumulation** | 1 | |
|
|
| **Precision** | BF16 mixed | |
|
|
| **Optimizer** | AdamW | |
|
|
| **Scheduler** | Cosine annealing with warmup | |
|
|
| **Warmup Ratio** | 0.07 | |
|
|
| **Gradient Clipping** | 1.0 | |
|
|
| **Validation Check Interval** | Every 50 steps | |
|
|
| **Logging Frequency** | Every 10 steps | |
|
|
| **Audio Duration Range** | 0.1 - 20.0 seconds | |
|
|
| **Data Loader Workers** | 8 | |
|
|
| **Pin Memory** | True | |
|
|
|
|
|
### Training Dataset: NeurologyAI/neuro-whisper-v1 |
|
|
|
|
|
The [neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) dataset is a comprehensive German medical speech recognition dataset specifically designed for neurology and neuro-oncology terminology. |
|
|
|
|
|
**Dataset Characteristics:** |
|
|
- **Language**: German (de) |
|
|
- **Domain**: Neuro-oncology, neurology, medical terminology |
|
|
- **Total Samples**: 52,885 audio-text pairs |
|
|
- **Total Duration**: 114.59 hours of audio |
|
|
- **Audio Format**: 16 kHz mono WAV |
|
|
- **Split**: 90% training (47,596 samples) / 10% validation (5,289 samples) |
|
|
|
|
|
**Data Generation:** |
|
|
- **Voice Data**: Synthetically generated using [Resemble AI Chatterbox TTS](https://github.com/resemble-ai/chatterbox) |
|
|
- **Text Data**: Medical text generated with [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) |
|
|
- **Purpose**: ASR training for medical transcription in neurology/neuro-oncology domains |
|
|
|
|
|
**Dataset Features:** |
|
|
- Consistent audio quality (uniform TTS generation) |
|
|
- Comprehensive coverage of specialized medical terminology |
|
|
- Privacy-safe (no real patient data) |
|
|
- Optimized for German medical speech recognition |
|
|
|
|
|
### Base Model Training |
|
|
|
|
|
The base model ([nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3)) was trained on: |
|
|
|
|
|
- **Primary Dataset**: Granary multilingual corpus (660,000+ hours of pseudo-labeled data) |
|
|
- **Fine-tuning Dataset**: NeMo ASR Set 3.0 (~7,500 hours of high-quality, human-transcribed data) |
|
|
- Includes: LibriSpeech, Fisher Corpus, VCTK, Europarl-ASR, Multilingual LibriSpeech, Mozilla Common Voice, and more |
|
|
- **Training**: 150,000 steps on 128 A100 GPUs, then 5,000 steps on 4 A100 GPUs |
|
|
- **Tokenizer**: Unified SentencePiece BPE with 8,192 tokens, optimized across 25 languages |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- **Framework**: NVIDIA NeMo 2.4 |
|
|
- **Training Hardware**: NVIDIA A100 GPUs (Google Colab) |
|
|
- **Inference Hardware**: Apple Silicon via MLX framework (tested on M4 only) |
|
|
- **Evaluation Hardware**: Mac Mini M4 with 32 GB unified memory (VRAM) - **only M4 was tested; results on M1/M2/M3 may vary** |
|
|
- **Training Script**: Based on NeMo ASR training examples with TDT configuration |
|
|
- **Checkpointing**: Best model selected by validation WER (lowest WER), with top-2 checkpoints saved |
|
|
- **CUDA Graphs**: Disabled for compatibility (use_cuda_graph_decoder = False) |
|
|
- **Decoding Strategy**: Greedy batch decoding |
|
|
|
|
|
## π CO2 Emission Related to Experiments |
|
|
|
|
|
Fine-tuning was conducted using Google Cloud Platform in region `europe-west3-a`, which has a carbon efficiency of 0.61 kgCOβeq/kWh. A cumulative of 5 hours of computation was performed on hardware of type A100 PCIe 40/80GB (TDP of 250W). |
|
|
|
|
|
Total emissions are estimated to be **0.76 kgCOβeq** of which 100 percents were directly offset by the cloud provider. |
|
|
|
|
|
Estimations were conducted using the [MachineLearning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
|
|
**Reference:** |
|
|
```bibtex |
|
|
@article{lacoste2019quantifying, |
|
|
title={Quantifying the Carbon Emissions of Machine Learning}, |
|
|
author={Lacoste, Alexandre and Luccioni, Alexandra and Schmidt, Victor and Dandres, Thomas}, |
|
|
journal={arXiv preprint arXiv:1910.09700}, |
|
|
year={2019} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π₯ Medical Terminology Coverage |
|
|
|
|
|
This model is specifically optimized for German medical terminology, with enhanced accuracy for: |
|
|
|
|
|
- **Neuro-oncology**: Gliomas, glioblastomas, astrocytomas, oligodendrogliomas, meningiomas |
|
|
- **Molecular Markers**: IDH mutations, MGMT methylation, 1p/19q codeletion, EGFR amplification |
|
|
- **Treatment Modalities**: Radiation therapy, chemotherapy, immunotherapy, targeted therapy |
|
|
- **Diagnostic Imaging**: MRI, CT, PET scans, contrast enhancement, diffusion-weighted imaging |
|
|
- **Clinical Assessments**: Neurological examinations, cognitive assessments, motor function tests |
|
|
- **Medical Procedures**: Biopsies, resections, craniotomies, ventriculostomies |
|
|
|
|
|
## π Performance Characteristics |
|
|
|
|
|
### Evaluation Results |
|
|
|
|
|
The model was evaluated on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1): |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Word Error Rate (WER)** | **1.04%** (0.0104) | |
|
|
| **Real-Time Factor (RTF)** | **0.042** (~24x faster than real-time) | |
|
|
| **Evaluation Dataset** | NeurologyAI/neuro-whisper-v1 (validation) | |
|
|
| **Evaluation Samples** | 5,289 samples | |
|
|
| **Total Audio Duration** | 22,786.68 seconds (~6.3 hours) | |
|
|
| **Average Inference Time** | 0.18 seconds per sample | |
|
|
| **Samples per Second** | 5.46 samples/second | |
|
|
| **Evaluation Hardware** | Mac Mini M4 with 32 GB unified memory (VRAM) | |
|
|
|
|
|
**Hardware Details:** |
|
|
- **Device**: Mac Mini M4 (only M4 was tested) |
|
|
- **Memory**: 32 GB unified memory (VRAM) |
|
|
- **Framework**: MLX (Apple Silicon optimized) |
|
|
- **Note**: Performance metrics are specific to M4. Results on M1, M2, and M3 may vary. |
|
|
|
|
|
This demonstrates excellent performance on German medical/neurological speech recognition, with a WER below 1.1% on the validation set. The real-time factor of 0.042 means the model processes audio approximately **24x faster than real-time** on M4 hardware, making it highly efficient for batch processing and real-time applications. |
|
|
|
|
|
**Note:** All inference and validation were conducted exclusively on a Mac Mini M4 with 32 GB unified memory. Performance on M1, M2, and M3 chips may vary and has not been tested. |
|
|
|
|
|
**Evaluation Tool:** These results were obtained using the [π€π₯π― Medical ASR Evaluator](https://github.com/riedemannai/Medical_ASR_Evaluator) - a standalone tool for evaluating ASR models using Word Error Rate (WER) metric, optimized for medical/clinical speech recognition. |
|
|
|
|
|
### Model Comparison |
|
|
|
|
|
The following table compares WER results on the validation split of [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (5,289 samples, ~6.3 hours of audio): |
|
|
|
|
|
| Model | WER | Notes | |
|
|
|-------|-----|-------| |
|
|
| **NeurologyAI/neuro-parakeet-mlx** (this model) | **1.04%** | Fine-tuned Parakeet TDT 0.6B for German medical | |
|
|
| mlx-community/parakeet-tdt-0.6b-v3 | 18.31% | Base Parakeet TDT 0.6B (no fine-tuning) | |
|
|
| mlx-community/whisper-large-v3-mlx | 13.96% | Base Whisper Large v3 (no fine-tuning) | |
|
|
|
|
|
### Performance by Domain |
|
|
|
|
|
- β
**Best Performance**: German medical/neurological speech, clinical dictations |
|
|
- β
**Strong Performance**: Medical reports, patient documentation, case presentations |
|
|
- β
**Good Performance**: General German speech with medical context |
|
|
- β οΈ **Limited Performance**: Non-neurology domains, other languages (base model supports 25 languages, but this fine-tuned version is optimized for German), heavily accented speech |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
- **Hardware**: This model was tested and benchmarked exclusively on Apple Silicon M4. While it should work on M1/M2/M3 chips, performance metrics (RTF, speed) may vary and have not been validated on these earlier chips. |
|
|
- **Language**: While the base model supports 25 languages, this fine-tuned version is optimized for German medical speech. Other languages may have reduced accuracy compared to the base model. |
|
|
- **Domain**: Best results on German medical/neurological content; general speech may be less accurate |
|
|
- **Audio Quality**: Requires clear audio (16 kHz, mono) for optimal performance |
|
|
- **Length**: Very long audio files may have reduced accuracy |
|
|
- **Accents**: Performance may vary with regional accents or non-standard pronunciation |
|
|
- **Background Noise**: Best results with minimal background noise and clear speech |
|
|
|
|
|
## π License |
|
|
|
|
|
**GOVERNING TERMS**: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license, inherited from the base model [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3). |
|
|
|
|
|
This model is ready for **commercial and non-commercial use** under the terms of the CC-BY-4.0 license. |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model in your research or applications, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{neuro-parakeet-mlx, |
|
|
title={Neuro-Parakeet-MLX: German Medical Speech Recognition for Neurology and Neuro-oncology Optimized for Apple Silicon}, |
|
|
author={Riedemann, Lars}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
note={Based on nvidia/parakeet-tdt-0.6b-v3}, |
|
|
howpublished={\\url{https://huggingface.co/NeurologyAI/neuro-parakeet-mlx}} |
|
|
} |
|
|
``` |
|
|
|
|
|
**Base Model Citation:** |
|
|
```bibtex |
|
|
@misc{sekoyan2025canary1bv2parakeettdt06bv3efficient, |
|
|
title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST}, |
|
|
author={Monica Sekoyan and Nithin Rao Koluguri and Nune Tadevosyan and Piotr Zelasko and Travis Bartley and Nikolay Karpov and Jagadeesh Balam and Boris Ginsburg}, |
|
|
year={2025}, |
|
|
eprint={2509.14128}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2509.14128}, |
|
|
} |
|
|
``` |
|
|
|
|
|
**Training Dataset Citation:** |
|
|
```bibtex |
|
|
@dataset{neuro-whisper-v1, |
|
|
title={neuro-whisper-v1: German Medical Speech Recognition Dataset}, |
|
|
author={Riedemann, Lars}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1}, |
|
|
language={de} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
This model builds upon the excellent work of: |
|
|
|
|
|
- **NVIDIA**: For the [Parakeet TDT 0.6B v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) base model (CC-BY-4.0) - a 600-million-parameter multilingual ASR model with FastConformer-TDT architecture |
|
|
- **Granary Dataset**: For providing the multilingual training corpus used in base model training |
|
|
- **NeMo ASR Set 3.0**: For the high-quality, human-transcribed training data (10,000+ hours) used in base model training |
|
|
- **Fine-tuning Dataset**: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) (MIT License) - 52,885 German medical audio samples (114.59 hours) for neurology/neuro-oncology |
|
|
- **Qwen Team**: For [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) used for medical text generation in the training dataset |
|
|
- **ResembleAI**: For [Chatterbox TTS](https://github.com/resemble-ai/chatterbox) (MIT License) used for synthetic audio generation in the training dataset |
|
|
- **Apple MLX**: For the [MLX framework](https://github.com/ml-explore/mlx) enabling efficient Apple Silicon optimization |
|
|
- **NVIDIA NeMo**: For the [NeMo toolkit](https://github.com/NVIDIA/NeMo) version 2.4 used in model development and training |
|
|
- **Training Infrastructure**: Google Colab with A100 GPUs for training |
|
|
- **Training Notebook**: See [Parakeet_Training_MLX.ipynb](colab_notebooks/Parakeet_Training_MLX.ipynb) for the complete training pipeline |
|
|
|
|
|
## π Related Resources |
|
|
|
|
|
- **API Server**: [parakeet-mlx-server](https://github.com/riedemannai/parakeet-mlx-server) - OpenAI-compatible FastAPI server for this model |
|
|
- **Base Model**: [nvidia/parakeet-tdt-0.6b-v3](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) |
|
|
- **Training Dataset**: [NeurologyAI/neuro-whisper-v1](https://huggingface.co/datasets/NeurologyAI/neuro-whisper-v1) - German medical ASR dataset with 52,885 samples (114.59 hours, 90/10 train/val split) for neurology/neuro-oncology |
|
|
- **Data Generation**: [Resemble AI Chatterbox](https://github.com/resemble-ai/chatterbox) and [Qwen3](https://github.com/QwenLM/Qwen3) |
|
|
- **MLX Framework**: [Apple MLX](https://github.com/ml-explore/mlx) |
|
|
- **NeMo Toolkit**: [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) |
|
|
- **Parakeet Collection**: [Parakeet Models on Hugging Face](https://huggingface.co/collections/nvidia/parakeet) |
|
|
|
|
|
|