omniASR-W2V-300M

Wav2Vec2 SSL encoder (300M) converted from the OmniLingual fairseq2 checkpoint omniASR_W2V_300M.

This is the pre-trained encoder backbone without a CTC head, suitable for feature extraction, probing, and fine-tuning on downstream speech tasks.

Code Base

The code base for the conversion can be found here. I was only able to convert the 300M and 1B models due to GPU limitations. Contributions are welcome.

Model details

Property Value
HF class Wav2Vec2Model
Encoder layers 24
Hidden size 1024
Attention heads 16
FFN intermediate 4096
Source framework fairseq2
Source card omniASR_W2V_300M
Parity verification ✅ Verified

Numerical parity against the original fairseq2 checkpoint has been confirmed: outputs match to within atol=1e-4 on a held-out audio sample.

Embedding statistics on the held-out audio clip: embedding shape (1, 175, 1024), max_abs_diff=0.00e+00, mean_diff=0.00e+00, std_diff=0.00e+00

Usage

from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
import torch, torchaudio

extractor = Wav2Vec2FeatureExtractor.from_pretrained("aadel4/omniASR-W2V-300M")
model     = Wav2Vec2Model.from_pretrained("aadel4/omniASR-W2V-300M")
model.eval()

waveform, sr = torchaudio.load("audio.wav")
if sr != 16_000:
    waveform = torchaudio.functional.resample(waveform, sr, 16_000)

inputs = extractor(
    waveform.squeeze().numpy(), sampling_rate=16_000,
    return_tensors="pt", padding=True
)
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state  # (1, T, 1024)
Downloads last month
113
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support