wav2vec2 forced-alignment models β ONNX mirror for WhisperX
Parity-validated ONNX exports of the wav2vec2-CTC models WhisperX uses for word-level forced alignment, so the C++ engine core can run them under ONNX Runtime with no PyTorch and no Python at runtime.
These are produced by golden/export_align_onnx.py
in the WhisperX repo β our own export, not a re-host β because parity to the
committed torch emission goldens (emission_atol = 0.006) is only guaranteed by a
pinned opset 17 / raw-logits export. Each model is self-checked against the golden
emissions before upload.
Layout
<model_name>/ # '/' in HF ids sanitized to '--'
model.onnx # masked graph (contract v2), opset 18, dynamic {batch,time}
meta.json # dictionary, blank_id, contract, provenance, sha256
Artifact contract (v2 β masked, batchable)
model.onnxtakes two inputs βwaveform (B,N) float32(raw 16 kHz mono, no feature normalization) +attention_mask (B,N) int64(1 = real sample, 0 = padding) β and emits two outputs:emissions (B,T,V)raw logits andframe_lengths (B,)(valid output frames per row, to trim padded batches). The consumer applieslog_softmaxand the OOV wildcard column (max non-blank per frame), matchingwhisperx/alignment.py. Exported via the PyTorch dynamo path.- Batched inference (pad to a common length + mask) is parity-safe only for
layer_normfeature extractors βmeta["batchable"]. Agroup_normextractor (the torchaudio base bundles) normalizes each channel over time, so right-padding shifts every valid frame's statistics and no mask recovers it; those models must be run per-segment (batch 1, all-ones mask). The HF xls-r models are batchable. meta.json:{model_name, language, pipeline_type (torchaudio|huggingface), opset, contract_version, blank_id, n_labels, dictionary (charβid), batchable, inputs, outputs, source_revision, onnx_sha256, versions}β everything needed to run- tokenize without a torch model.
Published models
| Folder | Lang | Source | Loader |
|---|---|---|---|
WAV2VEC2_ASR_BASE_960H |
en | torchaudio bundle | torchaudio |
VOXPOPULI_ASR_BASE_10K_DE |
de | torchaudio bundle | torchaudio |
jonatasgrosman--wav2vec2-large-xlsr-53-russian |
ru | HF | huggingface |
More languages are added on demand β the exporter resolves any code in WhisperX's
DEFAULT_ALIGN_MODELS_{TORCH,HF} tables (43 total) with no code change:
python golden/export_align_onnx.py --lang <code>.
Usage (ONNX Runtime)
import onnxruntime as ort, numpy as np, json
from huggingface_hub import hf_hub_download
folder = "WAV2VEC2_ASR_BASE_960H"
meta = json.load(open(hf_hub_download("KonstantK/wav2vec2-align-onnx", f"{folder}/meta.json")))
sess = ort.InferenceSession(hf_hub_download("KonstantK/wav2vec2-align-onnx", f"{folder}/model.onnx"))
wav = waveform_16k_mono[None].astype("float32") # (1, N)
mask = np.ones_like(wav, dtype=np.int64) # all real (no padding)
logits, frame_lengths = sess.run(["emissions", "frame_lengths"],
{"waveform": wav, "attention_mask": mask})
logits = logits[0, :int(frame_lengths[0])] # trim to valid frames
# then: log_softmax over the last axis, optional wildcard extension, Viterbi forced-align.
# To batch: pad rows to a common length, set mask=0 on padding β but only when
# meta["batchable"] (layer_norm); run group_norm models one segment at a time.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support