canary-1b-v2 — MLX (bf16)

MLX-native conversion of nvidia/canary-1b-v2 (FastConformer attention-encoder-decoder ASR, 25 European languages incl. French) for Apple Silicon, runnable with mlx-audio.

Changes made: this is a format conversion only (NVIDIA NeMo .nemo → MLX safetensors, bf16). No retraining, fine-tuning or quantization — the weights are derived deterministically from the public NVIDIA checkpoint. Unofficial community conversion — not affiliated with or endorsed by NVIDIA.

Contents

/                       AED model (text)
  model.safetensors     bf16, ~1.96 GB
  config.json           mlx-audio canary schema
  tokenizer.model       SentencePiece (16384, 25 langs)
ctc/                    native timestamps (word/sentence alignment)
  model.safetensors     CTC-model FastConformer encoder (own fine-tuned weights), bf16
  config.json, tokenizer.model
  ctc_head.npz          CTC head (W 16385×1024, b; blank id 16384)

Architecture

Unchanged from canary-1b-v2: 32-layer FastConformer encoder + 8-layer Transformer decoder, 128-mel, dw_striding ×8 subsampling, rel_pos attention. See the original model card for training data, benchmarks and full details.

Usage (mlx-audio)

from mlx_audio.stt.utils import load
model = load("CogniSoftOrg/canary-1b-v2-mlx-bf16")
res = model.generate("audio_16k.wav", source_lang="fr", target_lang="fr", use_pnc=True)
print(res.text)

The ctc/ encoder + ctc_head.npz give native ~80 ms timestamps: run the AED for the text and CTC-forced-align the AED tokens onto the CTC frames (WhisperX-style, all-Canary).

Known issues / limitations (experimental)

Tested with mlx-audio 0.4.3. The conversion is correct (matches the ONNX reference on clean audio), but the mlx-audio canary inference code has rough edges you must work around:

  • Decoder masks are float32 vs bf16 qkv → scaled_dot_product_attention raises a dtype error. Run the decoder in float32 (cast weights at load) or patch the two SDPA call sites to cast the mask to q.dtype.
  • language= kwarg is silently filtered by generate_transcription → pass source_lang / target_lang directly to model.generate.
  • Greedy decoding loops on hard far-field audio → use no_repeat_ngram_size=3.
  • The standalone ctc/ model recognizes fast but garbles far-field text (CTC limitation) — use it for timing only, keep the AED for text.

Inherited limitations of Canary-1B-v2 (noisy/far-field robustness, language coverage) apply — refer to the original model card.

License

CC-BY-4.0 — https://creativecommons.org/licenses/by/4.0/. Derived from nvidia/canary-1b-v2 (© NVIDIA), itself released under CC-BY-4.0 (GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license). Attribution to NVIDIA is required; this conversion is provided as-is, without warranty.

Citation

@article{canary-1b-v2,
  title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST},
  journal={arXiv:2509.14128}, year={2025}
}
@article{granary,
  title={Granary: Speech Recognition and Translation Dataset in 25 European Languages},
  journal={arXiv:2505.13404}, year={2025}
}
Downloads last month
18
Safetensors
Model size
1.0B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CogniSoftOrg/canary-1b-v2-mlx-bf16

Finetuned
(8)
this model

Papers for CogniSoftOrg/canary-1b-v2-mlx-bf16