omniASR-W2V-1B
Wav2Vec2 SSL encoder (1B) converted from the OmniLingual fairseq2 checkpoint omniASR_W2V_1B.
This is the pre-trained encoder backbone without a CTC head, suitable for feature extraction, probing, and fine-tuning on downstream speech tasks.
Code Base
The code base for the conversion can be found here. I was only able to convert the 300M and 1B models due to GPU limitations. Contributions are welcome.
Model details
| Property | Value |
|---|---|
| HF class | Wav2Vec2Model |
| Encoder layers | 48 |
| Hidden size | 1280 |
| Attention heads | 16 |
| FFN intermediate | 5120 |
| Source framework | fairseq2 |
| Source card | omniASR_W2V_1B |
| Parity verification | ✅ Verified |
Numerical parity against the original fairseq2 checkpoint has been confirmed: outputs match to within atol=1e-4 on a held-out audio sample.
Embedding statistics on the held-out audio clip: embedding shape (1, 175, 1280), max_abs_diff=0.00e+00, mean_diff=0.00e+00, std_diff=0.00e+00
Usage
from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
import torch, torchaudio
extractor = Wav2Vec2FeatureExtractor.from_pretrained("aadel4/omniASR-W2V-1B")
model = Wav2Vec2Model.from_pretrained("aadel4/omniASR-W2V-1B")
model.eval()
waveform, sr = torchaudio.load("audio.wav")
if sr != 16_000:
waveform = torchaudio.functional.resample(waveform, sr, 16_000)
inputs = extractor(
waveform.squeeze().numpy(), sampling_rate=16_000,
return_tensors="pt", padding=True
)
with torch.no_grad():
embeddings = model(**inputs).last_hidden_state # (1, T, 1280)
- Downloads last month
- 48