Update README.md

9edbc93 verified 1 day ago

9.06 kB

	---
	license: other
	license_name: nvidia-software-and-model-evaluation-license
	license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-and-model-evaluation-license/
	base_model: nvidia/nemotron-3.5-asr-streaming-0.6b
	library_name: fluidaudio
	pipeline_tag: automatic-speech-recognition
	tags:
	- coreml
	- apple-silicon
	- ane
	- streaming-asr
	- rnnt
	- on-device
	language:
	- en
	- es
	- fr
	- it
	- pt
	- de
	- zh
	- ja
	---

	# Nemotron 3.5 ASR Streaming Multilingual 0.6B — CoreML


	To grant access please join the server https://discord.gg/S6m4ET3pX and message Sisyphu

	CoreML / Apple Neural Engine ships of
	[nemotron-3.5-asr-streaming-0.6b](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b)
	(Conformer encoder + RNN-T decoder), optimized for on-device streaming ASR
	on Apple Silicon. Benchmarked on Apple M5 Pro / macOS 26.5.

	> Built on the 2026-05-29 base-checkpoint update.

	Two models × 4 latency tiers = 8 bundles.
	- `latin/` — one Latin-script-pruned vocab (2828 tokens) shared by
	en / es / fr / it / pt / de (smaller, faster joint).
	- `multilingual/` — the full 13087-token vocab covering every language,
	including zh / ja (and 100+ more via `prompt_id`).

	Each at four chunk sizes — 0.56 s / 1 s / 2 s / 4 s — trading latency for
	throughput. Pick the folder by script; pass the exact language at inference
	(`--language de-DE`). FluidAudio's downloader auto-routes the language to the
	right folder. Per-language results are in the table below and in
	[`manifest.json`](manifest.json).

	## Ship matrix (per-file RTFx, single-stream batch=1)

	RTFx = real-time factor (audio-seconds processed per wall-second; higher is
	faster). WER for Latin-script languages, CER for zh/ja (no word
	boundaries). All numbers are FLEURS test, full splits (see methodology).
	The Folder column is which bundle serves that language — the en/es/fr/it/pt/de
	rows are all the same `latin/` model measured per language; zh/ja and
	Multilingual are the same `multilingual/` model.

	\| Language \| Folder \| Vocab \| 0.56 s (560 ms) ‡ \| 1 s (1120 ms) \| 2 s (2240 ms) ⭐ \| 4 s (4480 ms) \| Test set \|
	\|---\|---\|--:\|--:\|--:\|--:\|--:\|---\|
	\| English \| `latin` \| 2828 \| 58 (9.43%) \| 103 (8.89%) \| 130 (8.96%) \| 122 (9.02%) \| FLEURS en_us \|
	\| Spanish \| `latin` \| 2828 \| 58 (4.95%) \| 106 (4.76%) \| 140 (4.80%) \| 136 (4.77%) \| FLEURS es_419 \|
	\| French \| `latin` \| 2828 \| 57 (9.68%) \| 105 (9.44%) \| 130 (9.52%) \| 124 (9.42%) \| FLEURS fr_fr \|
	\| Italian \| `latin` \| 2828 \| 59 (5.68%) \| 109 (5.45%) \| 147 (5.41%) \| 150 (5.40%) \| FLEURS it_it \|
	\| Portuguese \| `latin` \| 2828 \| 59 (6.38%) \| 108 (6.11%) \| 141 (6.14%) \| 141 (6.18%) \| FLEURS pt_br \|
	\| German \| `latin` \| 2828 \| 59 (10.83%) \| 107 (9.78%) \| 144 (9.83%) \| 142 (9.83%) \| FLEURS de_de \|
	\| Chinese \| `multilingual` \| 13087 \| 22 (19.48% C) \| 27 (18.75% C) \| 89 (18.57% C) \| 90 (18.05% C) \| FLEURS cmn_hans_cn \|
	\| Japanese \| `multilingual` \| 13087 \| 21 (14.61% C) \| 26 (13.77% C) \| 84 (13.79% C) \| 89 (13.82% C) \| FLEURS ja_jp \|
	\| Multilingual \| `multilingual` \| 13087 \| 23 (9.15%) \| 71 (8.64%) \| 80 (8.76%) \| 78 (8.78%) \| FLEURS en_us \|

	‡ 560 ms is the lowest-latency tier but off the trained attention tiling —
	lower throughput and a small quality cost vs 1120 ms. Use 1120 ms+ unless
	sub-second latency is required.

	> Full-vocab models (zh / ja / multilingual) are tier-sensitive. The
	> 13087-vocab joint matmul only fits the ANE working-set efficiently at the
	> 2 s tier. At 560 ms the per-chunk joint overhead dominates and throughput
	> collapses to ≈ 21–23 RTFx; use the 2 s tier for zh/ja/multilingual
	> (zh/ja ≈ 84–90, multilingual-en ≈ 80). Throughput at 1 s depends on output
	> density — sparse Latin text (multilingual-en ≈ 71 RTFx) fares far better than
	> dense CJK (zh/ja ≈ 26), since CJK hits the big joint on more decode steps.
	> The Latin-script ships (small joint) are fast at every tier.

	### Which tier to use

	- 2 s (2240 ms) is the recommended default for every model. Latin-script
	ships run ≈ 130–150 RTFx; zh/ja/multilingual peak here at ≈ 84–90 RTFx.
	WER/CER is at or near its best, at 2.5 s latency.
	- 1 s (1120 ms) for lower latency (1.25 s) on the Latin-script ships at
	near-full quality (≈ 103–109 RTFx). Avoid for zh/ja/multilingual (≈ 26 RTFx).
	- 0.56 s (560 ms) only when sub-second latency is mandatory; off the trained
	tiling, so throughput and quality both dip. Not recommended for
	zh/ja/multilingual (≈ 21–22 RTFx).
	- 4 s (4480 ms) for offline/long-form. Within noise of 2 s for the
	Latin-script ships, so 2 s usually dominates.

	## Recipe

	All ships share: LAYERPOS [42,13] mixed-precision encoder (first/last 3
	Conformer layers INT8, middle 18 layers 6-bit palettized — ~55% encoder size
	cut vs FP16, WER-neutral) + B1 decoder⊕joint fusion + **triple-stage
	pipelining**.

	Vocab handling differs by script:
	- Latin-script languages (en/es/fr/it/pt/de) share **one Latin-script-pruned
	joint — the keep-set is derived from the writing system** (all Latin +
	shared punctuation/digit tokens kept; CJK/Hangul/Cyrillic/Arabic/etc.
	dropped), not from any test corpus. 2828 tokens, ~5× smaller joint, no
	test-set overfit and no in-script OOV. One model file serves all six
	languages.
	- Chinese / Japanese / multilingual keep the full 13087-vocab joint — no
	pruning, no OOV, full character coverage.

	The encoder is shared across all languages (a multilingual encoder that
	selects language via `prompt_id`) and is byte-identical across the Latin-script
	and full-vocab ships at each tier — only the decode stack differs.

	## Usage (FluidAudio)

	Each `<model>/<tier>ms/` directory is a self-contained bundle. Pick the folder
	by script (`latin` for en/es/fr/it/pt/de, `multilingual` for everything else)
	and pass the exact language:

	```bash
	fluidaudiocli nemotron-multilingual-transcribe \
	--input audio.wav \
	--model-dir latin/2240ms \
	--language de-DE
	```

	The FluidAudio auto-downloader routes `--language` to the correct folder
	automatically. Models are shipped as compiled `.mlmodelc` (immediate load on
	Apple Silicon).

	## Folder layout

	```
	<model>/<tier>ms/
	preprocessor.mlmodelc
	encoder.mlmodelc # LAYERPOS [42,13], byte-identical across both models per tier
	decoder.mlmodelc
	joint.mlmodelc
	decoder_joint.mlmodelc # B1 fusion (default decode path)
	metadata.json
	tokenizer.json
	```
	`<model>` ∈ {latin, multilingual}; `<tier>` ∈ {560, 1120, 2240, 4480}.
	`latin` serves en/es/fr/it/pt/de (shared Latin-script vocab); `multilingual`
	serves zh/ja and 100+ languages via `prompt_id` (full vocab). A top-level
	[`manifest.json`](manifest.json) indexes both models, all tiers, and per-language
	benchmark numbers.

	### iOS 17

	The default `latin/` and `multilingual/` bundles target iOS 18+ (they use an
	iOS 18-only quantization op). A parallel `ios17/` tree
	(`ios17/latin/<tier>ms/`, `ios17/multilingual/<tier>ms/`) mirrors them for
	iOS 17, built from the same recipe re-targeted to iOS 17. WER is identical;
	on iOS 18 hardware the iOS 17 build runs ~4% slower (it uses the older dequant
	op), which is why both are shipped. Use `ios17/` only if you need iOS 17 support.

	## Notes

	- Latin-script ships are domain-general. The vocab keep-set is defined by
	the Latin writing system, not derived from any evaluation corpus, so there is
	no test-set overfit and no out-of-vocabulary loss for any Latin-script text.
	- zh/ja use the full-vocab model (no pruned keep-set), so they have no OOV
	limitation and cover the full character inventory — at the cost of throughput
	below the 2 s tier (use 2 s).
	- The multilingual full-vocab model (13087) supports 100+ languages via
	`prompt_id` — use it when broad coverage matters more than per-language speed.

	## Benchmark methodology

	Apple M5 Pro, macOS 26.5, coremltools 9.0, CoreML iOS18 target,
	`.cpuAndNeuralEngine` routing. Single-stream, batch=1, per-file sum-aggregate
	RTFx (matches the Open ASR Leaderboard convention). **All languages evaluated
	on FLEURS test, full splits.** WER for Latin-script languages, CER for zh/ja,
	via HuggingFace normalization. No inverse text normalization is applied, so
	FLEURS' digit-bearing utterances inflate WER by ~1–2 pp relative to
	number-normalized references; FLEURS is also multi-domain, so these numbers run
	higher than LibriSpeech/MLS would for the same model.

	## License & attribution

	Derived from the base model
	[nemotron-3.5-asr-streaming-0.6b](https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b),
	governed by the [NVIDIA Software and Model Evaluation License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-and-model-evaluation-license/).
	Weights are quantized/pruned post-training only — **no retraining, no
	fine-tuning, no calibration-data fitting.**