omniASR-W2V-1B

Wav2Vec2 SSL encoder (1B) converted from the OmniLingual fairseq2 checkpoint omniASR_W2V_1B.

This is the pre-trained encoder backbone without a CTC head, suitable for feature extraction, probing, and fine-tuning on downstream speech tasks.

Code Base

The code base for the conversion can be found here. I was only able to convert the 300M and 1B models due to GPU limitations. Contributions are welcome.

Model details

Property Value
HF class Wav2Vec2Model
Encoder layers 48
Hidden size 1280
Attention heads 16
FFN intermediate 5120
Source framework fairseq2
Source card omniASR_W2V_1B
Parity verification ✅ Verified

Numerical parity against the original fairseq2 checkpoint has been confirmed: outputs match to within atol=1e-4 on a held-out audio sample.

Embedding statistics on the held-out audio clip: embedding shape (1, 175, 1280), max_abs_diff=0.00e+00, mean_diff=0.00e+00, std_diff=0.00e+00

Usage

from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
import torch, torchaudio

extractor = Wav2Vec2FeatureExtractor.from_pretrained("aadel4/omniASR-W2V-1B")
model     = Wav2Vec2Model.from_pretrained("aadel4/omniASR-W2V-1B")
model.eval()

waveform, sr = torchaudio.load("audio.wav")
if sr != 16_000:
    waveform = torchaudio.functional.resample(waveform, sr, 16_000)

inputs = extractor(
    waveform.squeeze().numpy(), sampling_rate=16_000,
    return_tensors="pt", padding=True
)
with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state  # (1, T, 1280)
Downloads last month
48
Safetensors
Model size
1.0B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support