wav2vec2 forced-alignment models β€” ONNX mirror for WhisperX

Parity-validated ONNX exports of the wav2vec2-CTC models WhisperX uses for word-level forced alignment, so the C++ engine core can run them under ONNX Runtime with no PyTorch and no Python at runtime.

These are produced by golden/export_align_onnx.py in the WhisperX repo β€” our own export, not a re-host β€” because parity to the committed torch emission goldens (emission_atol = 0.006) is only guaranteed by a pinned opset 17 / raw-logits export. Each model is self-checked against the golden emissions before upload.

Layout

<model_name>/            # '/' in HF ids sanitized to '--'
  model.onnx             # masked graph (contract v2), opset 18, dynamic {batch,time}
  meta.json              # dictionary, blank_id, contract, provenance, sha256

Artifact contract (v2 β€” masked, batchable)

  • model.onnx takes two inputs β€” waveform (B,N) float32 (raw 16 kHz mono, no feature normalization) + attention_mask (B,N) int64 (1 = real sample, 0 = padding) β€” and emits two outputs: emissions (B,T,V) raw logits and frame_lengths (B,) (valid output frames per row, to trim padded batches). The consumer applies log_softmax and the OOV wildcard column (max non-blank per frame), matching whisperx/alignment.py. Exported via the PyTorch dynamo path.
  • Batched inference (pad to a common length + mask) is parity-safe only for layer_norm feature extractors β€” meta["batchable"]. A group_norm extractor (the torchaudio base bundles) normalizes each channel over time, so right-padding shifts every valid frame's statistics and no mask recovers it; those models must be run per-segment (batch 1, all-ones mask). The HF xls-r models are batchable.
  • meta.json: {model_name, language, pipeline_type (torchaudio|huggingface), opset, contract_version, blank_id, n_labels, dictionary (charβ†’id), batchable, inputs, outputs, source_revision, onnx_sha256, versions} β€” everything needed to run
    • tokenize without a torch model.

Published models

Folder Lang Source Loader
WAV2VEC2_ASR_BASE_960H en torchaudio bundle torchaudio
VOXPOPULI_ASR_BASE_10K_DE de torchaudio bundle torchaudio
jonatasgrosman--wav2vec2-large-xlsr-53-russian ru HF huggingface

More languages are added on demand β€” the exporter resolves any code in WhisperX's DEFAULT_ALIGN_MODELS_{TORCH,HF} tables (43 total) with no code change: python golden/export_align_onnx.py --lang <code>.

Usage (ONNX Runtime)

import onnxruntime as ort, numpy as np, json
from huggingface_hub import hf_hub_download

folder = "WAV2VEC2_ASR_BASE_960H"
meta = json.load(open(hf_hub_download("KonstantK/wav2vec2-align-onnx", f"{folder}/meta.json")))
sess = ort.InferenceSession(hf_hub_download("KonstantK/wav2vec2-align-onnx", f"{folder}/model.onnx"))

wav = waveform_16k_mono[None].astype("float32")           # (1, N)
mask = np.ones_like(wav, dtype=np.int64)                   # all real (no padding)
logits, frame_lengths = sess.run(["emissions", "frame_lengths"],
                                 {"waveform": wav, "attention_mask": mask})
logits = logits[0, :int(frame_lengths[0])]                 # trim to valid frames
# then: log_softmax over the last axis, optional wildcard extension, Viterbi forced-align.
# To batch: pad rows to a common length, set mask=0 on padding β€” but only when
# meta["batchable"] (layer_norm); run group_norm models one segment at a time.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support