Instructions to use CogniSoftOrg/canary-1b-v2-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use CogniSoftOrg/canary-1b-v2-mlx-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir canary-1b-v2-mlx-bf16 CogniSoftOrg/canary-1b-v2-mlx-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
canary-1b-v2 — MLX (bf16)
MLX-native conversion of nvidia/canary-1b-v2 (FastConformer attention-encoder-decoder ASR, 25 European languages incl. French) for Apple Silicon, runnable with mlx-audio.
Changes made: this is a format conversion only (NVIDIA NeMo
.nemo→ MLXsafetensors, bf16). No retraining, fine-tuning or quantization — the weights are derived deterministically from the public NVIDIA checkpoint. Unofficial community conversion — not affiliated with or endorsed by NVIDIA.
Contents
/ AED model (text)
model.safetensors bf16, ~1.96 GB
config.json mlx-audio canary schema
tokenizer.model SentencePiece (16384, 25 langs)
ctc/ native timestamps (word/sentence alignment)
model.safetensors CTC-model FastConformer encoder (own fine-tuned weights), bf16
config.json, tokenizer.model
ctc_head.npz CTC head (W 16385×1024, b; blank id 16384)
Architecture
Unchanged from canary-1b-v2: 32-layer FastConformer encoder + 8-layer Transformer decoder,
128-mel, dw_striding ×8 subsampling, rel_pos attention. See the
original model card for training data, benchmarks
and full details.
Usage (mlx-audio)
from mlx_audio.stt.utils import load
model = load("CogniSoftOrg/canary-1b-v2-mlx-bf16")
res = model.generate("audio_16k.wav", source_lang="fr", target_lang="fr", use_pnc=True)
print(res.text)
The ctc/ encoder + ctc_head.npz give native ~80 ms timestamps: run the AED for the text and
CTC-forced-align the AED tokens onto the CTC frames (WhisperX-style, all-Canary).
Known issues / limitations (experimental)
Tested with mlx-audio 0.4.3. The conversion is correct (matches the ONNX reference on clean audio), but the mlx-audio canary inference code has rough edges you must work around:
- Decoder masks are float32 vs bf16 qkv →
scaled_dot_product_attentionraises a dtype error. Run the decoder in float32 (cast weights at load) or patch the two SDPA call sites to cast the mask toq.dtype. language=kwarg is silently filtered bygenerate_transcription→ passsource_lang/target_langdirectly tomodel.generate.- Greedy decoding loops on hard far-field audio → use
no_repeat_ngram_size=3. - The standalone
ctc/model recognizes fast but garbles far-field text (CTC limitation) — use it for timing only, keep the AED for text.
Inherited limitations of Canary-1B-v2 (noisy/far-field robustness, language coverage) apply — refer to the original model card.
License
CC-BY-4.0 — https://creativecommons.org/licenses/by/4.0/. Derived from
nvidia/canary-1b-v2 (© NVIDIA), itself released under
CC-BY-4.0 (GOVERNING TERMS: Use of this model is governed by the CC-BY-4.0 license). Attribution
to NVIDIA is required; this conversion is provided as-is, without warranty.
Citation
@article{canary-1b-v2,
title={Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST},
journal={arXiv:2509.14128}, year={2025}
}
@article{granary,
title={Granary: Speech Recognition and Translation Dataset in 25 European Languages},
journal={arXiv:2505.13404}, year={2025}
}
- Downloads last month
- 18
Quantized
Model tree for CogniSoftOrg/canary-1b-v2-mlx-bf16
Base model
nvidia/canary-1b-v2