| --- |
| license: other |
| license_name: nvidia-software-and-model-evaluation-license |
| license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-and-model-evaluation-license/ |
| base_model: nvidia/nemotron-3.5-asr-streaming-0.6b |
| library_name: fluidaudio |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - coreml |
| - apple-silicon |
| - ane |
| - streaming-asr |
| - rnnt |
| - on-device |
| language: |
| - en |
| - es |
| - fr |
| - it |
| - pt |
| - de |
| - zh |
| - ja |
| --- |
| |
| # Nemotron 3.5 ASR Streaming Multilingual 0.6B β CoreML |
|
|
|
|
| To grant access please join the server https://discord.gg/S6m4ET3pX and message Sisyphu |
|
|
| CoreML / Apple Neural Engine ships of |
| [nemotron-3.5-asr-streaming-0.6b](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b) |
| (Conformer encoder + RNN-T decoder), optimized for on-device streaming ASR |
| on Apple Silicon. Benchmarked on Apple M5 Pro / macOS 26.5. |
|
|
| > Built on the **2026-05-29** base-checkpoint update. |
|
|
| **Two models Γ 4 latency tiers = 8 bundles.** |
| - **`latin/`** β one Latin-script-pruned vocab (2828 tokens) shared by |
| **en / es / fr / it / pt / de** (smaller, faster joint). |
| - **`multilingual/`** β the **full 13087-token vocab** covering every language, |
| including **zh / ja** (and 100+ more via `prompt_id`). |
|
|
| Each at four chunk sizes β **0.56 s / 1 s / 2 s / 4 s** β trading latency for |
| throughput. Pick the folder by script; pass the exact language at inference |
| (`--language de-DE`). FluidAudio's downloader auto-routes the language to the |
| right folder. Per-language results are in the table below and in |
| [`manifest.json`](manifest.json). |
|
|
| ## Ship matrix (per-file RTFx, single-stream batch=1) |
|
|
| RTFx = real-time factor (audio-seconds processed per wall-second; higher is |
| faster). **WER** for Latin-script languages, **CER for zh/ja** (no word |
| boundaries). **All numbers are FLEURS test, full splits** (see methodology). |
| The **Folder** column is which bundle serves that language β the en/es/fr/it/pt/de |
| rows are all the *same* `latin/` model measured per language; zh/ja and |
| Multilingual are the *same* `multilingual/` model. |
|
|
| | Language | Folder | Vocab | 0.56 s (560 ms) β‘ | 1 s (1120 ms) | **2 s (2240 ms)** β | 4 s (4480 ms) | Test set | |
| |---|---|--:|--:|--:|--:|--:|---| |
| | **English** | `latin` | 2828 | 58 (9.43%) | 103 (8.89%) | **130 (8.96%)** | 122 (9.02%) | FLEURS en_us | |
| | **Spanish** | `latin` | 2828 | 58 (4.95%) | 106 (4.76%) | **140 (4.80%)** | 136 (4.77%) | FLEURS es_419 | |
| | **French** | `latin` | 2828 | 57 (9.68%) | 105 (9.44%) | **130 (9.52%)** | 124 (9.42%) | FLEURS fr_fr | |
| | **Italian** | `latin` | 2828 | 59 (5.68%) | 109 (5.45%) | **147 (5.41%)** | 150 (5.40%) | FLEURS it_it | |
| | **Portuguese** | `latin` | 2828 | 59 (6.38%) | 108 (6.11%) | **141 (6.14%)** | 141 (6.18%) | FLEURS pt_br | |
| | **German** | `latin` | 2828 | 59 (10.83%) | 107 (9.78%) | **144 (9.83%)** | 142 (9.83%) | FLEURS de_de | |
| | **Chinese** | `multilingual` | 13087 | 22 (19.48% C) | 27 (18.75% C) | **89 (18.57% C)** | 90 (18.05% C) | FLEURS cmn_hans_cn | |
| | **Japanese** | `multilingual` | 13087 | 21 (14.61% C) | 26 (13.77% C) | **84 (13.79% C)** | 89 (13.82% C) | FLEURS ja_jp | |
| | **Multilingual** | `multilingual` | 13087 | 23 (9.15%) | 71 (8.64%) | **80 (8.76%)** | 78 (8.78%) | FLEURS en_us | |
|
|
| β‘ **560 ms is the lowest-latency tier but off the trained attention tiling** β |
| lower throughput and a small quality cost vs 1120 ms. Use 1120 ms+ unless |
| sub-second latency is required. |
|
|
| > **Full-vocab models (zh / ja / multilingual) are tier-sensitive.** The |
| > 13087-vocab joint matmul only fits the ANE working-set efficiently at the |
| > **2 s** tier. At 560 ms the per-chunk joint overhead dominates and throughput |
| > collapses to β 21β23 RTFx; **use the 2 s tier for zh/ja/multilingual** |
| > (zh/ja β 84β90, multilingual-en β 80). Throughput at 1 s depends on output |
| > density β sparse Latin text (multilingual-en β 71 RTFx) fares far better than |
| > dense CJK (zh/ja β 26), since CJK hits the big joint on more decode steps. |
| > The Latin-script ships (small joint) are fast at every tier. |
|
|
| ### Which tier to use |
|
|
| - **2 s (2240 ms) is the recommended default for every model.** Latin-script |
| ships run β 130β150 RTFx; zh/ja/multilingual peak here at β 84β90 RTFx. |
| WER/CER is at or near its best, at 2.5 s latency. |
| - **1 s (1120 ms)** for lower latency (1.25 s) on the Latin-script ships at |
| near-full quality (β 103β109 RTFx). Avoid for zh/ja/multilingual (β 26 RTFx). |
| - **0.56 s (560 ms)** only when sub-second latency is mandatory; off the trained |
| tiling, so throughput and quality both dip. Not recommended for |
| zh/ja/multilingual (β 21β22 RTFx). |
| - **4 s (4480 ms)** for offline/long-form. Within noise of 2 s for the |
| Latin-script ships, so 2 s usually dominates. |
|
|
| ## Recipe |
|
|
| All ships share: **LAYERPOS [42,13] mixed-precision encoder** (first/last 3 |
| Conformer layers INT8, middle 18 layers 6-bit palettized β ~55% encoder size |
| cut vs FP16, WER-neutral) + **B1 decoderβjoint fusion** + **triple-stage |
| pipelining**. |
|
|
| Vocab handling differs by script: |
| - **Latin-script languages (en/es/fr/it/pt/de)** share **one Latin-script-pruned |
| joint** β the keep-set is derived from the **writing system** (all Latin + |
| shared punctuation/digit tokens kept; CJK/Hangul/Cyrillic/Arabic/etc. |
| dropped), **not from any test corpus**. 2828 tokens, ~5Γ smaller joint, no |
| test-set overfit and no in-script OOV. One model file serves all six |
| languages. |
| - **Chinese / Japanese / multilingual** keep the **full 13087-vocab joint** β no |
| pruning, no OOV, full character coverage. |
|
|
| The encoder is **shared across all languages** (a multilingual encoder that |
| selects language via `prompt_id`) and is byte-identical across the Latin-script |
| and full-vocab ships at each tier β only the decode stack differs. |
|
|
| ## Usage (FluidAudio) |
|
|
| Each `<model>/<tier>ms/` directory is a self-contained bundle. Pick the folder |
| by script (`latin` for en/es/fr/it/pt/de, `multilingual` for everything else) |
| and pass the exact language: |
|
|
| ```bash |
| fluidaudiocli nemotron-multilingual-transcribe \ |
| --input audio.wav \ |
| --model-dir latin/2240ms \ |
| --language de-DE |
| ``` |
|
|
| The FluidAudio auto-downloader routes `--language` to the correct folder |
| automatically. Models are shipped as compiled `.mlmodelc` (immediate load on |
| Apple Silicon). |
|
|
| ## Folder layout |
|
|
| ``` |
| <model>/<tier>ms/ |
| preprocessor.mlmodelc |
| encoder.mlmodelc # LAYERPOS [42,13], byte-identical across both models per tier |
| decoder.mlmodelc |
| joint.mlmodelc |
| decoder_joint.mlmodelc # B1 fusion (default decode path) |
| metadata.json |
| tokenizer.json |
| ``` |
| `<model>` β {latin, multilingual}; `<tier>` β {560, 1120, 2240, 4480}. |
| `latin` serves en/es/fr/it/pt/de (shared Latin-script vocab); `multilingual` |
| serves zh/ja and 100+ languages via `prompt_id` (full vocab). A top-level |
| [`manifest.json`](manifest.json) indexes both models, all tiers, and per-language |
| benchmark numbers. |
|
|
| ### iOS 17 |
|
|
| The default `latin/` and `multilingual/` bundles target **iOS 18+** (they use an |
| iOS 18-only quantization op). A parallel **`ios17/`** tree |
| (`ios17/latin/<tier>ms/`, `ios17/multilingual/<tier>ms/`) mirrors them for |
| **iOS 17**, built from the same recipe re-targeted to iOS 17. WER is identical; |
| on iOS 18 hardware the iOS 17 build runs ~4% slower (it uses the older dequant |
| op), which is why both are shipped. Use `ios17/` only if you need iOS 17 support. |
|
|
| ## Notes |
|
|
| - **Latin-script ships are domain-general.** The vocab keep-set is defined by |
| the Latin writing system, not derived from any evaluation corpus, so there is |
| no test-set overfit and no out-of-vocabulary loss for any Latin-script text. |
| - **zh/ja use the full-vocab model** (no pruned keep-set), so they have no OOV |
| limitation and cover the full character inventory β at the cost of throughput |
| below the 2 s tier (use 2 s). |
| - The **multilingual** full-vocab model (13087) supports 100+ languages via |
| `prompt_id` β use it when broad coverage matters more than per-language speed. |
|
|
| ## Benchmark methodology |
|
|
| Apple M5 Pro, macOS 26.5, coremltools 9.0, CoreML iOS18 target, |
| `.cpuAndNeuralEngine` routing. Single-stream, batch=1, per-file sum-aggregate |
| RTFx (matches the Open ASR Leaderboard convention). **All languages evaluated |
| on FLEURS test, full splits.** WER for Latin-script languages, CER for zh/ja, |
| via HuggingFace normalization. No inverse text normalization is applied, so |
| FLEURS' digit-bearing utterances inflate WER by ~1β2 pp relative to |
| number-normalized references; FLEURS is also multi-domain, so these numbers run |
| higher than LibriSpeech/MLS would for the same model. |
|
|
| ## License & attribution |
|
|
| Derived from the base model |
| [nemotron-3.5-asr-streaming-0.6b](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b), |
| governed by the [NVIDIA Software and Model Evaluation License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-and-model-evaluation-license/). |
| Weights are quantized/pruned post-training only β **no retraining, no |
| fine-tuning, no calibration-data fitting.** |
|
|