--- license: mit library_name: onnxruntime tags: - wav2vec2 - forced-alignment - whisperx - onnx --- # wav2vec2 forced-alignment models — ONNX mirror for WhisperX Parity-validated ONNX exports of the wav2vec2-CTC models WhisperX uses for **word-level forced alignment**, so the C++ engine core can run them under ONNX Runtime with **no PyTorch and no Python** at runtime. These are produced by [`golden/export_align_onnx.py`](https://github.com/Cptweirdo/whisperX) in the WhisperX repo — **our own export, not a re-host** — because parity to the committed torch emission goldens (`emission_atol = 0.006`) is only guaranteed by a pinned `opset 17` / raw-logits export. Each model is self-checked against the golden emissions *before* upload. ## Layout ``` / # '/' in HF ids sanitized to '--' model.onnx # masked graph (contract v2), opset 18, dynamic {batch,time} meta.json # dictionary, blank_id, contract, provenance, sha256 ``` ## Artifact contract (v2 — masked, batchable) - **`model.onnx`** takes **two inputs** — `waveform (B,N) float32` (raw 16 kHz mono, no feature normalization) + `attention_mask (B,N) int64` (1 = real sample, 0 = padding) — and emits **two outputs**: `emissions (B,T,V)` **raw logits** and `frame_lengths (B,)` (valid output frames per row, to trim padded batches). The consumer applies `log_softmax` and the OOV **wildcard column** (max non-blank per frame), matching `whisperx/alignment.py`. Exported via the PyTorch **dynamo** path. - **Batched inference** (pad to a common length + mask) is parity-safe **only for `layer_norm` feature extractors** — `meta["batchable"]`. A `group_norm` extractor (the torchaudio base bundles) normalizes each channel **over time**, so right-padding shifts every valid frame's statistics and no mask recovers it; those models must be run **per-segment** (batch 1, all-ones mask). The HF xls-r models are batchable. - **`meta.json`**: `{model_name, language, pipeline_type (torchaudio|huggingface), opset, contract_version, blank_id, n_labels, dictionary (char→id), batchable, inputs, outputs, source_revision, onnx_sha256, versions}` — everything needed to run + tokenize without a torch model. ## Published models | Folder | Lang | Source | Loader | |---|---|---|---| | `WAV2VEC2_ASR_BASE_960H` | en | torchaudio bundle | torchaudio | | `VOXPOPULI_ASR_BASE_10K_DE` | de | torchaudio bundle | torchaudio | | `jonatasgrosman--wav2vec2-large-xlsr-53-russian` | ru | HF | huggingface | More languages are added on demand — the exporter resolves any code in WhisperX's `DEFAULT_ALIGN_MODELS_{TORCH,HF}` tables (43 total) with no code change: `python golden/export_align_onnx.py --lang `. ## Usage (ONNX Runtime) ```python import onnxruntime as ort, numpy as np, json from huggingface_hub import hf_hub_download folder = "WAV2VEC2_ASR_BASE_960H" meta = json.load(open(hf_hub_download("KonstantK/wav2vec2-align-onnx", f"{folder}/meta.json"))) sess = ort.InferenceSession(hf_hub_download("KonstantK/wav2vec2-align-onnx", f"{folder}/model.onnx")) wav = waveform_16k_mono[None].astype("float32") # (1, N) mask = np.ones_like(wav, dtype=np.int64) # all real (no padding) logits, frame_lengths = sess.run(["emissions", "frame_lengths"], {"waveform": wav, "attention_mask": mask}) logits = logits[0, :int(frame_lengths[0])] # trim to valid frames # then: log_softmax over the last axis, optional wildcard extension, Viterbi forced-align. # To batch: pad rows to a common length, set mask=0 on padding — but only when # meta["batchable"] (layer_norm); run group_norm models one segment at a time. ```