omniASR-W2V-300M

Wav2Vec2 SSL encoder (300M) converted from the OmniLingual fairseq2 checkpoint omniASR_W2V_300M.

This is the pre-trained encoder backbone without a CTC head, suitable for feature extraction, probing, and fine-tuning on downstream speech tasks.

Code Base

The code base for the conversion can be found here. I was only able to convert the 300M and 1B models due to GPU limitations. Contributions are welcome.

Model details

Property	Value
HF class	`Wav2Vec2Model`
Encoder layers	24
Hidden size	1024
Attention heads	16
FFN intermediate	4096
Source framework	fairseq2
Source card	`omniASR_W2V_300M`
Parity verification	✅ Verified

Numerical parity against the original fairseq2 checkpoint has been confirmed: outputs match to within atol=1e-4 on a held-out audio sample.

Embedding statistics on the held-out audio clip: embedding shape (1, 175, 1024), max_abs_diff=0.00e+00, mean_diff=0.00e+00, std_diff=0.00e+00

Usage

from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
import torch, torchaudio

extractor = Wav2Vec2FeatureExtractor.from_pretrained("aadel4/omniASR-W2V-300M")
model     = Wav2Vec2Model.from_pretrained("aadel4/omniASR-W2V-300M")
model.eval()

waveform, sr = torchaudio.load("audio.wav")
if sr != 16_000:
    waveform = torchaudio.functional.resample(waveform, sr, 16_000)

inputs = extractor(
    waveform.squeeze().numpy(), sampling_rate=16_000,
    return_tensors="pt", padding=True
)
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state  # (1, T, 1024)

Downloads last month: 26

Safetensors

Model size

0.3B params

Tensor type

F32